[jira] [Commented] (SPARK-32965) pyspark reading csv files with utf_16le encoding
[ https://issues.apache.org/jira/browse/SPARK-32965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200578#comment-17200578 ] Takeshi Yamamuro commented on SPARK-32965: -- Is this issue almost the same with SPARK-32961? > pyspark reading csv files with utf_16le encoding > > > Key: SPARK-32965 > URL: https://issues.apache.org/jira/browse/SPARK-32965 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7, 3.0.0, 3.0.1 >Reporter: Punit Shah >Priority: Major > > If you have a file encoded in utf_16le or utf_16be and try to use > spark.read.csv("", encoding="utf_16le") the dataframe isn't > rendered properly > if you use python decoding like: > prdd = spark_session._sc.binaryFiles(path_url).values().flatMap(lambda x : > x.decode("utf_16le").splitlines()) > and then do spark.read.csv(prdd), then it works. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32959) Fix the "Relation: view text" test in DataSourceV2SQLSuite
[ https://issues.apache.org/jira/browse/SPARK-32959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32959. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29811 [https://github.com/apache/spark/pull/29811] > Fix the "Relation: view text" test in DataSourceV2SQLSuite > -- > > Key: SPARK-32959 > URL: https://issues.apache.org/jira/browse/SPARK-32959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Minor > Fix For: 3.1.0 > > > The existing code just defines a function literal and doesn't execute it: > {code:java} > test("Relation: view text") { > val t1 = "testcat.ns1.ns2.tbl" > withTable(t1) { > withView("view1") { v1: String => > sql(s"CREATE TABLE $t1 USING foo AS SELECT id, data FROM source") > sql(s"CREATE VIEW $v1 AS SELECT * from $t1") > checkAnswer(sql(s"TABLE $v1"), spark.table("source")) > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32959) Fix the "Relation: view text" test in DataSourceV2SQLSuite
[ https://issues.apache.org/jira/browse/SPARK-32959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32959: --- Assignee: Terry Kim > Fix the "Relation: view text" test in DataSourceV2SQLSuite > -- > > Key: SPARK-32959 > URL: https://issues.apache.org/jira/browse/SPARK-32959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Minor > > The existing code just defines a function literal and doesn't execute it: > {code:java} > test("Relation: view text") { > val t1 = "testcat.ns1.ns2.tbl" > withTable(t1) { > withView("view1") { v1: String => > sql(s"CREATE TABLE $t1 USING foo AS SELECT id, data FROM source") > sql(s"CREATE VIEW $v1 AS SELECT * from $t1") > checkAnswer(sql(s"TABLE $v1"), spark.table("source")) > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32966) Spark| PartitionBy is taking long time to process
[ https://issues.apache.org/jira/browse/SPARK-32966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sujit Das updated SPARK-32966: -- Environment: EMR - 5.30.0; Hadoop - 2.8.5; Spark - 2.4.5 (was: EMR - 5.30.0; Hadoop -2.8.5; Spark- 2.4.5) > Spark| PartitionBy is taking long time to process > - > > Key: SPARK-32966 > URL: https://issues.apache.org/jira/browse/SPARK-32966 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.5 > Environment: EMR - 5.30.0; Hadoop - 2.8.5; Spark - 2.4.5 >Reporter: Sujit Das >Priority: Major > Labels: AWS, pyspark, spark-conf > > # When I do a write without any partition it takes 8 min > df2_merge.write.mode('overwrite').parquet(dest_path) > > 2. I have added conf - > spark.sql.sources.partitionOverwriteMode=dynamic ; it took a longer time > (more than 50 min before I force terminated the EMR cluster). But I have > observed the partitions have been created and data files are present. But in > EMR cluster the process is still showing as running, where as in spark > history server it is showing no running or pending process. > df2_merge.write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest) > > 3. I have modified with new conf - spark.sql.shuffle.partitions=3; it > took 24 min > df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest) > > 4. Again I disabled the conf and run plain write with partition. It took > 30 min. > df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest) > > Only one conf is common in the above scenarios is > spark.sql.adaptive.coalescePartitions.initialPartitionNum=100 > My point is to reduce the time of writing with partitionBy. Is there anything > I am missing > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32966) Spark| PartitionBy is taking long time to process
[ https://issues.apache.org/jira/browse/SPARK-32966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200571#comment-17200571 ] Takeshi Yamamuro commented on SPARK-32966: -- Is this a question? At least, I think you need to describe more info (e.g., a complete query to reproduce the issue). > Spark| PartitionBy is taking long time to process > - > > Key: SPARK-32966 > URL: https://issues.apache.org/jira/browse/SPARK-32966 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.5 > Environment: EMR - 5.30.0; Hadoop -2.8.5; Spark- 2.4.5 >Reporter: Sujit Das >Priority: Major > Labels: AWS, pyspark, spark-conf > > # When I do a write without any partition it takes 8 min > df2_merge.write.mode('overwrite').parquet(dest_path) > > 2. I have added conf - > spark.sql.sources.partitionOverwriteMode=dynamic ; it took a longer time > (more than 50 min before I force terminated the EMR cluster). But I have > observed the partitions have been created and data files are present. But in > EMR cluster the process is still showing as running, where as in spark > history server it is showing no running or pending process. > df2_merge.write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest) > > 3. I have modified with new conf - spark.sql.shuffle.partitions=3; it > took 24 min > df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest) > > 4. Again I disabled the conf and run plain write with partition. It took > 30 min. > df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest) > > Only one conf is common in the above scenarios is > spark.sql.adaptive.coalescePartitions.initialPartitionNum=100 > My point is to reduce the time of writing with partitionBy. Is there anything > I am missing > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32966) Spark| PartitionBy is taking long time to process
[ https://issues.apache.org/jira/browse/SPARK-32966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-32966. -- Resolution: Invalid > Spark| PartitionBy is taking long time to process > - > > Key: SPARK-32966 > URL: https://issues.apache.org/jira/browse/SPARK-32966 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.5 > Environment: EMR - 5.30.0; Hadoop -2.8.5; Spark- 2.4.5 >Reporter: Sujit Das >Priority: Major > Labels: AWS, pyspark, spark-conf > > # When I do a write without any partition it takes 8 min > df2_merge.write.mode('overwrite').parquet(dest_path) > > 2. I have added conf - > spark.sql.sources.partitionOverwriteMode=dynamic ; it took a longer time > (more than 50 min before I force terminated the EMR cluster). But I have > observed the partitions have been created and data files are present. But in > EMR cluster the process is still showing as running, where as in spark > history server it is showing no running or pending process. > df2_merge.write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest) > > 3. I have modified with new conf - spark.sql.shuffle.partitions=3; it > took 24 min > df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest) > > 4. Again I disabled the conf and run plain write with partition. It took > 30 min. > df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest) > > Only one conf is common in the above scenarios is > spark.sql.adaptive.coalescePartitions.initialPartitionNum=100 > My point is to reduce the time of writing with partitionBy. Is there anything > I am missing > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200564#comment-17200564 ] Sean Malory commented on SPARK-32306: - Thank you. > `approx_percentile` in Spark SQL gives incorrect results > > > Key: SPARK-32306 > URL: https://issues.apache.org/jira/browse/SPARK-32306 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 2.4.4, 3.0.0, 3.1.0 >Reporter: Sean Malory >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0 > > > The `approx_percentile` function in Spark SQL does not give the correct > result. I'm not sure how incorrect it is; it may just be a boundary issue. > From the docs: > {quote}The accuracy parameter (default: 1) is a positive numeric literal > which controls approximation accuracy at the cost of memory. Higher value of > accuracy yields better accuracy, 1.0/accuracy is the relative error of the > approximation. > {quote} > This is not true. Here is a minimum example in `pyspark` where, essentially, > the median of 5 and 8 is being calculated as 5: > {code:python} > import pyspark.sql.functions as psf > df = spark.createDataFrame( > [('bar', 5), ('bar', 8)], ['name', 'val'] > ) > median = psf.expr('percentile_approx(val, 0.5, 2147483647)') > df.groupBy('name').agg(median.alias('median'))# gives the median as 5 > {code} > I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this > is an issue with the underlying algorithm. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32961) PySpark CSV read with UTF-16 encoding is not working correctly
[ https://issues.apache.org/jira/browse/SPARK-32961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-32961: - Component/s: (was: Spark Core) SQL > PySpark CSV read with UTF-16 encoding is not working correctly > -- > > Key: SPARK-32961 > URL: https://issues.apache.org/jira/browse/SPARK-32961 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.1 > Environment: both spark local and cluster mode >Reporter: Bui Bao Anh >Priority: Major > Labels: Correctness > Attachments: pandas df.png, pyspark df.png, sendo_sample.csv > > > There are weird characters in the output when printing out to console or > writing to files. > Find attached files to see how it look in Spark Dataframe and Pandas > Dataframe. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32961) PySpark CSV read with UTF-16 encoding is not working correctly
[ https://issues.apache.org/jira/browse/SPARK-32961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200551#comment-17200551 ] Takeshi Yamamuro commented on SPARK-32961: -- cc: [~yumwang] > PySpark CSV read with UTF-16 encoding is not working correctly > -- > > Key: SPARK-32961 > URL: https://issues.apache.org/jira/browse/SPARK-32961 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.1 > Environment: both spark local and cluster mode >Reporter: Bui Bao Anh >Priority: Major > Labels: Correctness > Attachments: pandas df.png, pyspark df.png, sendo_sample.csv > > > There are weird characters in the output when printing out to console or > writing to files. > Find attached files to see how it look in Spark Dataframe and Pandas > Dataframe. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32778) Accidental Data Deletion on calling saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-32778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-32778: - Issue Type: Improvement (was: Bug) > Accidental Data Deletion on calling saveAsTable > --- > > Key: SPARK-32778 > URL: https://issues.apache.org/jira/browse/SPARK-32778 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Aman Rastogi >Priority: Major > > {code:java} > df.write.option("path", > "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table) > {code} > Above code deleted the data present in path "/already/existing/path". This > happened because table was already not there in hive metastore however, path > given had data. And if table is not present in Hive Metastore, SaveMode gets > modified internally to SaveMode.Overwrite irrespective of what user has > provided, which leads to data deletion. This change was introduced as part of > https://issues.apache.org/jira/browse/SPARK-19583. > Now, suppose if user is not using external hive metastore (hive metastore is > associated with a cluster) and if cluster goes down or due to some reason > user has to migrate to a new cluster. Once user tries to save data using > above code in new cluster, it will first delete the data. It could be a > production data and user is completely unaware of it as they have provided > SaveMode.Append or ErrorIfExists. This will be an accidental data deletion. > > Repro Steps: > > # Save data through a hive table as mentioned in above code > # create another cluster and save data in new table in new cluster by giving > same path > > Proposed Fix: > Instead of modifying SaveMode to Overwrite, we should modify it to > ErrorIfExists in class CreateDataSourceTableAsSelectCommand. > Change (line 154) > > {code:java} > val result = saveDataIntoTable( > sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = > false) > > {code} > to > > {code:java} > val result = saveDataIntoTable( > sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, > tableExists = false){code} > This should not break CTAS. Even in case of CTAS, user may not want to delete > data if already exists as it could be accidental. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31618) Pushdown Distinct through Join in IntersectDistinct based on stats
[ https://issues.apache.org/jira/browse/SPARK-31618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-31618. -- Resolution: Won't Fix I'll close this because the corresponding PR has been closed. > Pushdown Distinct through Join in IntersectDistinct based on stats > -- > > Key: SPARK-31618 > URL: https://issues.apache.org/jira/browse/SPARK-31618 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 3.0.0 >Reporter: Prakhar Jain >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32870) Make sure that all expressions have their ExpressionDescription properly filled
[ https://issues.apache.org/jira/browse/SPARK-32870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-32870. -- Fix Version/s: 3.1.0 Assignee: Tanel Kiis Resolution: Fixed Resolved by https://github.com/apache/spark/pull/29743 > Make sure that all expressions have their ExpressionDescription properly > filled > --- > > Key: SPARK-32870 > URL: https://issues.apache.org/jira/browse/SPARK-32870 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Tanel Kiis >Assignee: Tanel Kiis >Priority: Major > Fix For: 3.1.0 > > > Make sure that all SQL expressions have their usage, examples and since > filled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32961) PySpark CSV read with UTF-16 encoding is not working correctly
[ https://issues.apache.org/jira/browse/SPARK-32961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bui Bao Anh updated SPARK-32961: Attachment: sendo_sample.csv > PySpark CSV read with UTF-16 encoding is not working correctly > -- > > Key: SPARK-32961 > URL: https://issues.apache.org/jira/browse/SPARK-32961 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4, 3.0.1 > Environment: both spark local and cluster mode >Reporter: Bui Bao Anh >Priority: Major > Labels: Correctness > Attachments: pandas df.png, pyspark df.png, sendo_sample.csv > > > There are weird characters in the output when printing out to console or > writing to files. > Find attached files to see how it look in Spark Dataframe and Pandas > Dataframe. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27872) Driver and executors use a different service account breaking pull secrets
[ https://issues.apache.org/jira/browse/SPARK-27872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194557#comment-17194557 ] Neelesh Srinivas Salian edited comment on SPARK-27872 at 9/23/20, 12:36 AM: I have a patch to add this fix to the 2.4.x (currently 2.4.6) release. Should I add it here or a new cloned issue? [~eje] Have this PR: https://github.com/apache/spark/pull/29844 was (Author: nssalian): I have a patch to add this fix to the 2.4.x (currently 2.4.6) release. Should I add it here or a new cloned issue? [~eje] > Driver and executors use a different service account breaking pull secrets > -- > > Key: SPARK-27872 > URL: https://issues.apache.org/jira/browse/SPARK-27872 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.3, 3.0.0 >Reporter: Stavros Kontopoulos >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > Driver and executors use different service accounts in case the driver has > one set up which is different than default: > [https://gist.github.com/skonto/9beb5afa2ec4659ba563cbb0a8b9c4dd] > This makes the executor pods fail when the user links the driver service > account with a pull secret: > [https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account]. > Executors will not use the driver's service account and will not be able to > get the secret in order to pull the related image. > I am not sure what is the assumption here for using the default account for > executors, probably because of the fact that this account is limited (btw > executors dont create resources)? This is an inconsistency that could be > worked around with the pod template feature in Spark 3.0.0 but it breaks pull > secrets and in general I think its a bug to have it. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27872) Driver and executors use a different service account breaking pull secrets
[ https://issues.apache.org/jira/browse/SPARK-27872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194557#comment-17194557 ] Neelesh Srinivas Salian edited comment on SPARK-27872 at 9/23/20, 12:36 AM: I have a patch to add this fix to the 2.4.x (currently 2.4.6) release. Should I add it here or a new cloned issue? [~eje] [|https://github.com/apache/spark/pull/29844] was (Author: nssalian): I have a patch to add this fix to the 2.4.x (currently 2.4.6) release. Should I add it here or a new cloned issue? [~eje] Have this PR: https://github.com/apache/spark/pull/29844 > Driver and executors use a different service account breaking pull secrets > -- > > Key: SPARK-27872 > URL: https://issues.apache.org/jira/browse/SPARK-27872 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.3, 3.0.0 >Reporter: Stavros Kontopoulos >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > Driver and executors use different service accounts in case the driver has > one set up which is different than default: > [https://gist.github.com/skonto/9beb5afa2ec4659ba563cbb0a8b9c4dd] > This makes the executor pods fail when the user links the driver service > account with a pull secret: > [https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account]. > Executors will not use the driver's service account and will not be able to > get the secret in order to pull the related image. > I am not sure what is the assumption here for using the default account for > executors, probably because of the fact that this account is limited (btw > executors dont create resources)? This is an inconsistency that could be > worked around with the pod template feature in Spark 3.0.0 but it breaks pull > secrets and in general I think its a bug to have it. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-27872) Driver and executors use a different service account breaking pull secrets
[ https://issues.apache.org/jira/browse/SPARK-27872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neelesh Srinivas Salian updated SPARK-27872: Comment: was deleted (was: I have a patch to add this fix to the 2.4.x (currently 2.4.6) release. Should I add it here or a new cloned issue? [~eje] [|https://github.com/apache/spark/pull/29844]) > Driver and executors use a different service account breaking pull secrets > -- > > Key: SPARK-27872 > URL: https://issues.apache.org/jira/browse/SPARK-27872 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.3, 3.0.0 >Reporter: Stavros Kontopoulos >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > Driver and executors use different service accounts in case the driver has > one set up which is different than default: > [https://gist.github.com/skonto/9beb5afa2ec4659ba563cbb0a8b9c4dd] > This makes the executor pods fail when the user links the driver service > account with a pull secret: > [https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account]. > Executors will not use the driver's service account and will not be able to > get the secret in order to pull the related image. > I am not sure what is the assumption here for using the default account for > executors, probably because of the fact that this account is limited (btw > executors dont create resources)? This is an inconsistency that could be > worked around with the pod template feature in Spark 3.0.0 but it breaks pull > secrets and in general I think its a bug to have it. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27872) Driver and executors use a different service account breaking pull secrets
[ https://issues.apache.org/jira/browse/SPARK-27872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194557#comment-17194557 ] Neelesh Srinivas Salian edited comment on SPARK-27872 at 9/23/20, 12:36 AM: I have a patch to add this fix to the 2.4.x (currently 2.4.6) release. Should I add it here or a new cloned issue? [~eje] [|https://github.com/apache/spark/pull/29844] was (Author: nssalian): I have a patch to add this fix to the 2.4.x (currently 2.4.6) release. Should I add it here or a new cloned issue? [~eje] [|https://github.com/apache/spark/pull/29844] > Driver and executors use a different service account breaking pull secrets > -- > > Key: SPARK-27872 > URL: https://issues.apache.org/jira/browse/SPARK-27872 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.3, 3.0.0 >Reporter: Stavros Kontopoulos >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > Driver and executors use different service accounts in case the driver has > one set up which is different than default: > [https://gist.github.com/skonto/9beb5afa2ec4659ba563cbb0a8b9c4dd] > This makes the executor pods fail when the user links the driver service > account with a pull secret: > [https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account]. > Executors will not use the driver's service account and will not be able to > get the secret in order to pull the related image. > I am not sure what is the assumption here for using the default account for > executors, probably because of the fact that this account is limited (btw > executors dont create resources)? This is an inconsistency that could be > worked around with the pod template feature in Spark 3.0.0 but it breaks pull > secrets and in general I think its a bug to have it. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32017) Make Pyspark Hadoop 3.2+ Variant available in PyPI
[ https://issues.apache.org/jira/browse/SPARK-32017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32017. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29703 [https://github.com/apache/spark/pull/29703] > Make Pyspark Hadoop 3.2+ Variant available in PyPI > -- > > Key: SPARK-32017 > URL: https://issues.apache.org/jira/browse/SPARK-32017 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: George Pongracz >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > The version of Pyspark 3.0.0 currently available in PyPI currently uses > hadoop 2.7.4. > Could a variant (or the default) have its version of Hadoop aligned to 3.2.0 > as per the downloadable spark binaries. > This would enable the PyPI version to be compatible with session token > authorisations and assist in accessing data residing in object stores with > stronger encryption methods. > If not PyPI then as a tar file in the apache download archives at the least > please. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32017) Make Pyspark Hadoop 3.2+ Variant available in PyPI
[ https://issues.apache.org/jira/browse/SPARK-32017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32017: Assignee: Hyukjin Kwon > Make Pyspark Hadoop 3.2+ Variant available in PyPI > -- > > Key: SPARK-32017 > URL: https://issues.apache.org/jira/browse/SPARK-32017 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: George Pongracz >Assignee: Hyukjin Kwon >Priority: Major > > The version of Pyspark 3.0.0 currently available in PyPI currently uses > hadoop 2.7.4. > Could a variant (or the default) have its version of Hadoop aligned to 3.2.0 > as per the downloadable spark binaries. > This would enable the PyPI version to be compatible with session token > authorisations and assist in accessing data residing in object stores with > stronger encryption methods. > If not PyPI then as a tar file in the apache download archives at the least > please. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32933) Use keyword-only syntax for keyword_only methods
[ https://issues.apache.org/jira/browse/SPARK-32933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32933. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29799 [https://github.com/apache/spark/pull/29799] > Use keyword-only syntax for keyword_only methods > > > Key: SPARK-32933 > URL: https://issues.apache.org/jira/browse/SPARK-32933 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > Fix For: 3.1.0 > > > Since 3.0, provides syntax for indicating keyword-only arguments ([PEP > 3102|https://www.python.org/dev/peps/pep-3102/]). > It is not a full replacement for our current usage of {{keyword_only}}, but > it would allow us to make our expectations explicit: > {code:python} > @keyword_only > def __init__(self, degree=2, inputCol=None, outputCol=None): > {code} > {code:python} > @keyword_only > def __init__(self, *, degree=2, inputCol=None, outputCol=None): > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32933) Use keyword-only syntax for keyword_only methods
[ https://issues.apache.org/jira/browse/SPARK-32933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32933: Assignee: Maciej Szymkiewicz > Use keyword-only syntax for keyword_only methods > > > Key: SPARK-32933 > URL: https://issues.apache.org/jira/browse/SPARK-32933 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > > Since 3.0, provides syntax for indicating keyword-only arguments ([PEP > 3102|https://www.python.org/dev/peps/pep-3102/]). > It is not a full replacement for our current usage of {{keyword_only}}, but > it would allow us to make our expectations explicit: > {code:python} > @keyword_only > def __init__(self, degree=2, inputCol=None, outputCol=None): > {code} > {code:python} > @keyword_only > def __init__(self, *, degree=2, inputCol=None, outputCol=None): > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27872) Driver and executors use a different service account breaking pull secrets
[ https://issues.apache.org/jira/browse/SPARK-27872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200439#comment-17200439 ] Apache Spark commented on SPARK-27872: -- User 'nssalian' has created a pull request for this issue: https://github.com/apache/spark/pull/29844 > Driver and executors use a different service account breaking pull secrets > -- > > Key: SPARK-27872 > URL: https://issues.apache.org/jira/browse/SPARK-27872 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.3, 3.0.0 >Reporter: Stavros Kontopoulos >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > Driver and executors use different service accounts in case the driver has > one set up which is different than default: > [https://gist.github.com/skonto/9beb5afa2ec4659ba563cbb0a8b9c4dd] > This makes the executor pods fail when the user links the driver service > account with a pull secret: > [https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account]. > Executors will not use the driver's service account and will not be able to > get the secret in order to pull the related image. > I am not sure what is the assumption here for using the default account for > executors, probably because of the fact that this account is limited (btw > executors dont create resources)? This is an inconsistency that could be > worked around with the pod template feature in Spark 3.0.0 but it breaks pull > secrets and in general I think its a bug to have it. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27872) Driver and executors use a different service account breaking pull secrets
[ https://issues.apache.org/jira/browse/SPARK-27872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200438#comment-17200438 ] Apache Spark commented on SPARK-27872: -- User 'nssalian' has created a pull request for this issue: https://github.com/apache/spark/pull/29844 > Driver and executors use a different service account breaking pull secrets > -- > > Key: SPARK-27872 > URL: https://issues.apache.org/jira/browse/SPARK-27872 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.3, 3.0.0 >Reporter: Stavros Kontopoulos >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > Driver and executors use different service accounts in case the driver has > one set up which is different than default: > [https://gist.github.com/skonto/9beb5afa2ec4659ba563cbb0a8b9c4dd] > This makes the executor pods fail when the user links the driver service > account with a pull secret: > [https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account]. > Executors will not use the driver's service account and will not be able to > get the secret in order to pull the related image. > I am not sure what is the assumption here for using the default account for > executors, probably because of the fact that this account is limited (btw > executors dont create resources)? This is an inconsistency that could be > worked around with the pod template feature in Spark 3.0.0 but it breaks pull > secrets and in general I think its a bug to have it. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17556) Executor side broadcast for broadcast joins
[ https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-17556: --- Assignee: (was: L. C. Hsieh) > Executor side broadcast for broadcast joins > --- > > Key: SPARK-17556 > URL: https://issues.apache.org/jira/browse/SPARK-17556 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Reporter: Reynold Xin >Priority: Major > Attachments: executor broadcast.pdf, executor-side-broadcast.pdf > > > Currently in Spark SQL, in order to perform a broadcast join, the driver must > collect the result of an RDD and then broadcast it. This introduces some > extra latency. It might be possible to broadcast directly from executors. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-17556) Executor side broadcast for broadcast joins
[ https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-17556: Comment: was deleted (was: We will recently try to pick this up again.) > Executor side broadcast for broadcast joins > --- > > Key: SPARK-17556 > URL: https://issues.apache.org/jira/browse/SPARK-17556 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Reporter: Reynold Xin >Priority: Major > Attachments: executor broadcast.pdf, executor-side-broadcast.pdf > > > Currently in Spark SQL, in order to perform a broadcast join, the driver must > collect the result of an RDD and then broadcast it. This introduces some > extra latency. It might be possible to broadcast directly from executors. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32932) AQE local shuffle reader breaks repartitioning for dynamic partition overwrite
[ https://issues.apache.org/jira/browse/SPARK-32932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-32932: --- Description: With AQE, local shuffle reader breaks users' repartitioning for dynamic partition overwrite as in the following case. {code:java} test("repartition with local reader") { withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> PartitionOverwriteMode.DYNAMIC.toString, SQLConf.SHUFFLE_PARTITIONS.key -> "5", SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") { withTable("t") { val data = for ( i <- 1 to 10; j <- 1 to 3 ) yield (i, j) data.toDF("a", "b") .repartition($"b") .write .partitionBy("b") .mode("overwrite") .saveAsTable("t") assert(spark.read.table("t").inputFiles.length == 3) } } }{code} -Coalescing shuffle partitions could also break it.- was: With AQE, local shuffle reader breaks users' repartitioning for dynamic partition overwrite as in the following case. {code:java} test("repartition with local reader") { withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> PartitionOverwriteMode.DYNAMIC.toString, SQLConf.SHUFFLE_PARTITIONS.key -> "5", SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") { withTable("t") { val data = for ( i <- 1 to 10; j <- 1 to 3 ) yield (i, j) data.toDF("a", "b") .repartition($"b") .write .partitionBy("b") .mode("overwrite") .saveAsTable("t") assert(spark.read.table("t").inputFiles.length == 3) } } }{code} Coalescing shuffle partitions could also break it. > AQE local shuffle reader breaks repartitioning for dynamic partition overwrite > -- > > Key: SPARK-32932 > URL: https://issues.apache.org/jira/browse/SPARK-32932 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Priority: Major > > With AQE, local shuffle reader breaks users' repartitioning for dynamic > partition overwrite as in the following case. > {code:java} > test("repartition with local reader") { > withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> > PartitionOverwriteMode.DYNAMIC.toString, > SQLConf.SHUFFLE_PARTITIONS.key -> "5", > SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") { > withTable("t") { > val data = for ( > i <- 1 to 10; > j <- 1 to 3 > ) yield (i, j) > data.toDF("a", "b") > .repartition($"b") > .write > .partitionBy("b") > .mode("overwrite") > .saveAsTable("t") > assert(spark.read.table("t").inputFiles.length == 3) > } > } > }{code} > -Coalescing shuffle partitions could also break it.- -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32956) Duplicate Columns in a csv file
[ https://issues.apache.org/jira/browse/SPARK-32956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200382#comment-17200382 ] Chen Zhang commented on SPARK-32956: Okay, I will submit a PR later. > Duplicate Columns in a csv file > --- > > Key: SPARK-32956 > URL: https://issues.apache.org/jira/browse/SPARK-32956 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1 >Reporter: Punit Shah >Priority: Major > > Imagine a csv file shaped like: > > Id,Product,Sale_Amount,Sale_Units,Sale_Amount2,Sale_Amount,Sale_Price > 1,P,"6,40,728","6,40,728","6,40,728","6,40,728","6,40,728" > 2,P,"5,81,644","5,81,644","5,81,644","5,81,644","5,81,644" > = > Reading this with the header=True will result in a stacktrace. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29250) Upgrade to Hadoop 3.2.1
[ https://issues.apache.org/jira/browse/SPARK-29250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200366#comment-17200366 ] Apache Spark commented on SPARK-29250: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/29843 > Upgrade to Hadoop 3.2.1 > --- > > Key: SPARK-29250 > URL: https://issues.apache.org/jira/browse/SPARK-29250 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29250) Upgrade to Hadoop 3.2.1
[ https://issues.apache.org/jira/browse/SPARK-29250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-29250: Assignee: Apache Spark > Upgrade to Hadoop 3.2.1 > --- > > Key: SPARK-29250 > URL: https://issues.apache.org/jira/browse/SPARK-29250 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29250) Upgrade to Hadoop 3.2.1
[ https://issues.apache.org/jira/browse/SPARK-29250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-29250: Assignee: (was: Apache Spark) > Upgrade to Hadoop 3.2.1 > --- > > Key: SPARK-29250 > URL: https://issues.apache.org/jira/browse/SPARK-29250 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29250) Upgrade to Hadoop 3.2.1
[ https://issues.apache.org/jira/browse/SPARK-29250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200365#comment-17200365 ] Apache Spark commented on SPARK-29250: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/29843 > Upgrade to Hadoop 3.2.1 > --- > > Key: SPARK-29250 > URL: https://issues.apache.org/jira/browse/SPARK-29250 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-32306: Issue Type: Documentation (was: Bug) > `approx_percentile` in Spark SQL gives incorrect results > > > Key: SPARK-32306 > URL: https://issues.apache.org/jira/browse/SPARK-32306 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 2.4.4, 3.0.0, 3.1.0 >Reporter: Sean Malory >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0 > > > The `approx_percentile` function in Spark SQL does not give the correct > result. I'm not sure how incorrect it is; it may just be a boundary issue. > From the docs: > {quote}The accuracy parameter (default: 1) is a positive numeric literal > which controls approximation accuracy at the cost of memory. Higher value of > accuracy yields better accuracy, 1.0/accuracy is the relative error of the > approximation. > {quote} > This is not true. Here is a minimum example in `pyspark` where, essentially, > the median of 5 and 8 is being calculated as 5: > {code:python} > import pyspark.sql.functions as psf > df = spark.createDataFrame( > [('bar', 5), ('bar', 8)], ['name', 'val'] > ) > median = psf.expr('percentile_approx(val, 0.5, 2147483647)') > df.groupBy('name').agg(median.alias('median'))# gives the median as 5 > {code} > I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this > is an issue with the underlying algorithm. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200337#comment-17200337 ] L. C. Hsieh commented on SPARK-32306: - Resolved by https://github.com/apache/spark/pull/29835. > `approx_percentile` in Spark SQL gives incorrect results > > > Key: SPARK-32306 > URL: https://issues.apache.org/jira/browse/SPARK-32306 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.4 >Reporter: Sean Malory >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0 > > > The `approx_percentile` function in Spark SQL does not give the correct > result. I'm not sure how incorrect it is; it may just be a boundary issue. > From the docs: > {quote}The accuracy parameter (default: 1) is a positive numeric literal > which controls approximation accuracy at the cost of memory. Higher value of > accuracy yields better accuracy, 1.0/accuracy is the relative error of the > approximation. > {quote} > This is not true. Here is a minimum example in `pyspark` where, essentially, > the median of 5 and 8 is being calculated as 5: > {code:python} > import pyspark.sql.functions as psf > df = spark.createDataFrame( > [('bar', 5), ('bar', 8)], ['name', 'val'] > ) > median = psf.expr('percentile_approx(val, 0.5, 2147483647)') > df.groupBy('name').agg(median.alias('median'))# gives the median as 5 > {code} > I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this > is an issue with the underlying algorithm. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-32306: Affects Version/s: 3.1.0 3.0.0 > `approx_percentile` in Spark SQL gives incorrect results > > > Key: SPARK-32306 > URL: https://issues.apache.org/jira/browse/SPARK-32306 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.4, 3.0.0, 3.1.0 >Reporter: Sean Malory >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0 > > > The `approx_percentile` function in Spark SQL does not give the correct > result. I'm not sure how incorrect it is; it may just be a boundary issue. > From the docs: > {quote}The accuracy parameter (default: 1) is a positive numeric literal > which controls approximation accuracy at the cost of memory. Higher value of > accuracy yields better accuracy, 1.0/accuracy is the relative error of the > approximation. > {quote} > This is not true. Here is a minimum example in `pyspark` where, essentially, > the median of 5 and 8 is being calculated as 5: > {code:python} > import pyspark.sql.functions as psf > df = spark.createDataFrame( > [('bar', 5), ('bar', 8)], ['name', 'val'] > ) > median = psf.expr('percentile_approx(val, 0.5, 2147483647)') > df.groupBy('name').agg(median.alias('median'))# gives the median as 5 > {code} > I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this > is an issue with the underlying algorithm. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-32306. - Fix Version/s: 3.1.0 Assignee: Maxim Gekk Resolution: Fixed > `approx_percentile` in Spark SQL gives incorrect results > > > Key: SPARK-32306 > URL: https://issues.apache.org/jira/browse/SPARK-32306 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.4 >Reporter: Sean Malory >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0 > > > The `approx_percentile` function in Spark SQL does not give the correct > result. I'm not sure how incorrect it is; it may just be a boundary issue. > From the docs: > {quote}The accuracy parameter (default: 1) is a positive numeric literal > which controls approximation accuracy at the cost of memory. Higher value of > accuracy yields better accuracy, 1.0/accuracy is the relative error of the > approximation. > {quote} > This is not true. Here is a minimum example in `pyspark` where, essentially, > the median of 5 and 8 is being calculated as 5: > {code:python} > import pyspark.sql.functions as psf > df = spark.createDataFrame( > [('bar', 5), ('bar', 8)], ['name', 'val'] > ) > median = psf.expr('percentile_approx(val, 0.5, 2147483647)') > df.groupBy('name').agg(median.alias('median'))# gives the median as 5 > {code} > I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this > is an issue with the underlying algorithm. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32019) Add spark.sql.files.minPartitionNum config
[ https://issues.apache.org/jira/browse/SPARK-32019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200327#comment-17200327 ] Apache Spark commented on SPARK-32019: -- User 'tanelk' has created a pull request for this issue: https://github.com/apache/spark/pull/29842 > Add spark.sql.files.minPartitionNum config > -- > > Key: SPARK-32019 > URL: https://issues.apache.org/jira/browse/SPARK-32019 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32019) Add spark.sql.files.minPartitionNum config
[ https://issues.apache.org/jira/browse/SPARK-32019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200325#comment-17200325 ] Apache Spark commented on SPARK-32019: -- User 'tanelk' has created a pull request for this issue: https://github.com/apache/spark/pull/29842 > Add spark.sql.files.minPartitionNum config > -- > > Key: SPARK-32019 > URL: https://issues.apache.org/jira/browse/SPARK-32019 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32019) Add spark.sql.files.minPartitionNum config
[ https://issues.apache.org/jira/browse/SPARK-32019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200326#comment-17200326 ] Apache Spark commented on SPARK-32019: -- User 'tanelk' has created a pull request for this issue: https://github.com/apache/spark/pull/29842 > Add spark.sql.files.minPartitionNum config > -- > > Key: SPARK-32019 > URL: https://issues.apache.org/jira/browse/SPARK-32019 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32970) Reduce the runtime of unit test for SPARK-32019
[ https://issues.apache.org/jira/browse/SPARK-32970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32970: Assignee: (was: Apache Spark) > Reduce the runtime of unit test for SPARK-32019 > --- > > Key: SPARK-32970 > URL: https://issues.apache.org/jira/browse/SPARK-32970 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Tanel Kiis >Priority: Major > Labels: Test > > The UT for SPARK-32019 can run over 7 minutes on jenkins. > This sort of simple UT should run in few seconds - definitely less than a > minute. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32970) Reduce the runtime of unit test for SPARK-32019
[ https://issues.apache.org/jira/browse/SPARK-32970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200323#comment-17200323 ] Apache Spark commented on SPARK-32970: -- User 'tanelk' has created a pull request for this issue: https://github.com/apache/spark/pull/29842 > Reduce the runtime of unit test for SPARK-32019 > --- > > Key: SPARK-32970 > URL: https://issues.apache.org/jira/browse/SPARK-32970 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Tanel Kiis >Priority: Major > Labels: Test > > The UT for SPARK-32019 can run over 7 minutes on jenkins. > This sort of simple UT should run in few seconds - definitely less than a > minute. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32970) Reduce the runtime of unit test for SPARK-32019
[ https://issues.apache.org/jira/browse/SPARK-32970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32970: Assignee: Apache Spark > Reduce the runtime of unit test for SPARK-32019 > --- > > Key: SPARK-32970 > URL: https://issues.apache.org/jira/browse/SPARK-32970 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Tanel Kiis >Assignee: Apache Spark >Priority: Major > Labels: Test > > The UT for SPARK-32019 can run over 7 minutes on jenkins. > This sort of simple UT should run in few seconds - definitely less than a > minute. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.10.0
[ https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200307#comment-17200307 ] Xinli Shang commented on SPARK-27733: - We talked about the Parquet 1.11.0 adoption in Spark in today's Parquet community sync meeting. The Parquet community would like to help if there is any way to move faster. [~csun][~smilegator][~dongjoon][~iemejia] and others, are you interested in joining our next Parquet meeting to brainstorm solutions to move forward? > Upgrade to Avro 1.10.0 > -- > > Key: SPARK-27733 > URL: https://issues.apache.org/jira/browse/SPARK-27733 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 3.1.0 >Reporter: Ismaël Mejía >Priority: Minor > > Avro 1.9.2 was released with many nice features including reduced size (1MB > less), and removed dependencies, no paranamer, no shaded guava, security > updates, so probably a worth upgrade. > Avro 1.10.0 was released and this is still not done. > There is at the moment (2020/08) still a blocker because of Hive related > transitive dependencies bringing older versions of Avro, so we could say that > this is somehow still blocked until HIVE-21737 is solved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32970) Reduce the runtime of unit test for SPARK-32019
Tanel Kiis created SPARK-32970: -- Summary: Reduce the runtime of unit test for SPARK-32019 Key: SPARK-32970 URL: https://issues.apache.org/jira/browse/SPARK-32970 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 3.1.0 Reporter: Tanel Kiis The UT for SPARK-32019 can run over 7 minutes on jenkins. This sort of simple UT should run in few seconds - definitely less than a minute. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32969) Spark Submit process not exiting after session.stop()
[ https://issues.apache.org/jira/browse/SPARK-32969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] El R updated SPARK-32969: - Affects Version/s: (was: 3.0.1) > Spark Submit process not exiting after session.stop() > - > > Key: SPARK-32969 > URL: https://issues.apache.org/jira/browse/SPARK-32969 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 2.4.7 >Reporter: El R >Priority: Critical > > Exactly 3 spark submit processes are hanging from the first 3 jobs that were > submitted to the standalone cluster using client mode. Example from the > client: > {code:java} > root 1517 0.3 4.7 8412728 1532876 ? Sl 18:49 0:38 > /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp > /usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar > -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 > --conf spark.master=spark://3c520b0c6d6e:7077 --conf > spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml > --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e > --conf spark.fileserver.port=46102 --conf > packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf > spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf > spark.replClassServer.port=46104 --conf > spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf > spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True --conf > spark.blockManager.port=46105 --conf spark.dynamicAllocation.enabled=true > pyspark-shell > root 1746 0.4 3.5 8152640 1132420 ? Sl 18:59 0:36 > /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp > /usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar > -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 > --conf spark.master=spark://3c520b0c6d6e:7077 --conf > spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml > --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e > --conf spark.fileserver.port=46102 --conf > packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf > spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf > spark.replClassServer.port=46104 --conf > spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf > spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True --conf > spark.blockManager.port=46105 --conf spark.dynamicAllocation.enabled=true > pyspark-shell > root 2239 65.3 7.8 9743456 2527236 ? Sl 19:10 91:30 > /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp > /usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar > -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 > --conf spark.master=spark://3c520b0c6d6e:7077 --conf > spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml > --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e > --conf spark.fileserver.port=46102 --conf > packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf > spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf > spark.replClassServer.port=46104 --conf > spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf > spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True
[jira] [Created] (SPARK-32969) Spark Submit process not exiting after session.stop()
El R created SPARK-32969: Summary: Spark Submit process not exiting after session.stop() Key: SPARK-32969 URL: https://issues.apache.org/jira/browse/SPARK-32969 Project: Spark Issue Type: Bug Components: PySpark, Spark Submit Affects Versions: 3.0.1, 2.4.7 Reporter: El R Exactly 3 spark submit processes are hanging from the first 3 jobs that were submitted to the standalone cluster using client mode. Example from the client: {code:java} root 1517 0.3 4.7 8412728 1532876 ? Sl 18:49 0:38 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 --conf spark.master=spark://3c520b0c6d6e:7077 --conf spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e --conf spark.fileserver.port=46102 --conf packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf spark.replClassServer.port=46104 --conf spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True --conf spark.blockManager.port=46105 --conf spark.dynamicAllocation.enabled=true pyspark-shell root 1746 0.4 3.5 8152640 1132420 ? Sl 18:59 0:36 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 --conf spark.master=spark://3c520b0c6d6e:7077 --conf spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e --conf spark.fileserver.port=46102 --conf packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf spark.replClassServer.port=46104 --conf spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True --conf spark.blockManager.port=46105 --conf spark.dynamicAllocation.enabled=true pyspark-shell root 2239 65.3 7.8 9743456 2527236 ? Sl 19:10 91:30 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 --conf spark.master=spark://3c520b0c6d6e:7077 --conf spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e --conf spark.fileserver.port=46102 --conf packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf spark.replClassServer.port=46104 --conf spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True --conf spark.blockManager.port=46105 --conf spark.dynamicAllocation.enabled=true pyspark-shell {code} The corresponding jobs are showing as 'completed' in spark UI and have closed their sessions & exited according to their logs. No worker resources are being consumed by these jobs anymore & subsequent jobs are able to receive
[jira] [Commented] (SPARK-20525) ClassCast exception when interpreting UDFs from a String in spark-shell
[ https://issues.apache.org/jira/browse/SPARK-20525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200284#comment-17200284 ] Igor Kamyshnikov commented on SPARK-20525: -- I bet the issue is in JDK, but it could be solved in scala if they get rid of writeReplace/List$SerializationProxy. I've left some details [here|https://issues.apache.org/jira/browse/SPARK-19938?focusedCommentId=17200272=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17200272] in SPARK-19938. > ClassCast exception when interpreting UDFs from a String in spark-shell > --- > > Key: SPARK-20525 > URL: https://issues.apache.org/jira/browse/SPARK-20525 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell >Affects Versions: 2.1.0 > Environment: OS X 10.11.6, spark-2.1.0-bin-hadoop2.7, Scala version > 2.11.8 (bundled w/ Spark), Java 1.8.0_121 >Reporter: Dave Knoester >Priority: Major > Labels: bulk-closed > Attachments: UdfTest.scala > > > I'm trying to interpret a string containing Scala code from inside a Spark > session. Everything is working fine, except for User Defined Function-like > things (UDFs, map, flatMap, etc). This is a blocker for production launch of > a large number of Spark jobs. > I've been able to boil the problem down to a number of spark-shell examples, > shown below. Because it's reproducible in the spark-shell, these related > issues **don't apply**: > https://issues.apache.org/jira/browse/SPARK-9219 > https://issues.apache.org/jira/browse/SPARK-18075 > https://issues.apache.org/jira/browse/SPARK-19938 > http://apache-spark-developers-list.1001551.n3.nabble.com/This-Exception-has-been-really-hard-to-trace-td19362.html > https://community.mapr.com/thread/21488-spark-error-scalacollectionseq-in-instance-of-orgapachesparkrddmappartitionsrdd > https://github.com/scala/bug/issues/9237 > Any help is appreciated! > > Repro: > Run each of the below from a spark-shell. > Preamble: > import scala.tools.nsc.GenericRunnerSettings > import scala.tools.nsc.interpreter.IMain > val settings = new GenericRunnerSettings( println _ ) > settings.usejavacp.value = true > val interpreter = new IMain(settings, new java.io.PrintWriter(System.out)) > interpreter.bind("spark", spark); > These work: > // works: > interpreter.interpret("val x = 5") > // works: > interpreter.interpret("import spark.implicits._\nval df = > spark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.show") > These do not work: > // doesn't work, fails with seq/RDD serialization error: > interpreter.interpret("import org.apache.spark.sql.functions._\nimport > spark.implicits._\nval upper: String => String = _.toUpperCase\nval upperUDF > = > udf(upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\", > upperUDF($\"value\")).show") > // doesn't work, fails with seq/RDD serialization error: > interpreter.interpret("import org.apache.spark.sql.functions._\nimport > spark.implicits._\nval upper: String => String = > _.toUpperCase\nspark.udf.register(\"myUpper\", > upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\", > callUDF(\"myUpper\", ($\"value\"))).show") > The not-working ones fail with this exception: > Caused by: java.lang.ClassCastException: cannot assign instance of > scala.collection.immutable.List$SerializationProxy to field > org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type > scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD > at > java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133) > at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2237) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80) > at org.apache.spark.scheduler.Task.run(Task.scala:99) >
[jira] [Resolved] (SPARK-32964) Pass all `streaming` module UTs in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32964. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29836 [https://github.com/apache/spark/pull/29836] > Pass all `streaming` module UTs in Scala 2.13 > - > > Key: SPARK-32964 > URL: https://issues.apache.org/jira/browse/SPARK-32964 > Project: Spark > Issue Type: Sub-task > Components: DStreams, Spark Core >Affects Versions: 3.1.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.1.0 > > > There is only one failed case of `streaming` module in Scala 2.13: > * `start with non-serializable DStream checkpoint ` in StreamingContextSuite > StackOverflowError is thrown here when SerializationDebugger#visit method is > called. > The error stack as follow: > {code:java} > Expected exception java.io.NotSerializableException to be thrown, but > java.lang.StackOverflowError was thrownExpected exception > java.io.NotSerializableException to be thrown, but > java.lang.StackOverflowError was thrownScalaTestFailureLocation: > org.apache.spark.streaming.StreamingContextSuite at > (StreamingContextSuite.scala:159)org.scalatest.exceptions.TestFailedException: > Expected exception java.io.NotSerializableException to be thrown, but > java.lang.StackOverflowError was thrown at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > at > org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1562) > at org.scalatest.Assertions.intercept(Assertions.scala:756) at > org.scalatest.Assertions.intercept$(Assertions.scala:746) at > org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1562) at > org.apache.spark.streaming.StreamingContextSuite.$anonfun$new$13(StreamingContextSuite.scala:159) > ...Caused by: java.lang.StackOverflowError at > org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1397) at > org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:513) at > sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at > scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1(DefaultSerializationProxy.scala:38) > at > scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1$adapted(DefaultSerializationProxy.scala:37) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553) at > scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551) at > scala.collection.AbstractIterable.foreach(Iterable.scala:920) at > scala.collection.generic.DefaultSerializationProxy.writeObject(DefaultSerializationProxy.scala:37) > at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:230) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108) > at >
[jira] [Assigned] (SPARK-32964) Pass all `streaming` module UTs in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-32964: - Assignee: Yang Jie > Pass all `streaming` module UTs in Scala 2.13 > - > > Key: SPARK-32964 > URL: https://issues.apache.org/jira/browse/SPARK-32964 > Project: Spark > Issue Type: Sub-task > Components: DStreams, Spark Core >Affects Versions: 3.1.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > There is only one failed case of `streaming` module in Scala 2.13: > * `start with non-serializable DStream checkpoint ` in StreamingContextSuite > StackOverflowError is thrown here when SerializationDebugger#visit method is > called. > The error stack as follow: > {code:java} > Expected exception java.io.NotSerializableException to be thrown, but > java.lang.StackOverflowError was thrownExpected exception > java.io.NotSerializableException to be thrown, but > java.lang.StackOverflowError was thrownScalaTestFailureLocation: > org.apache.spark.streaming.StreamingContextSuite at > (StreamingContextSuite.scala:159)org.scalatest.exceptions.TestFailedException: > Expected exception java.io.NotSerializableException to be thrown, but > java.lang.StackOverflowError was thrown at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > at > org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1562) > at org.scalatest.Assertions.intercept(Assertions.scala:756) at > org.scalatest.Assertions.intercept$(Assertions.scala:746) at > org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1562) at > org.apache.spark.streaming.StreamingContextSuite.$anonfun$new$13(StreamingContextSuite.scala:159) > ...Caused by: java.lang.StackOverflowError at > org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1397) at > org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:513) at > sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at > scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1(DefaultSerializationProxy.scala:38) > at > scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1$adapted(DefaultSerializationProxy.scala:37) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553) at > scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551) at > scala.collection.AbstractIterable.foreach(Iterable.scala:920) at > scala.collection.generic.DefaultSerializationProxy.writeObject(DefaultSerializationProxy.scala:37) > at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:230) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243) > at >
[jira] [Comment Edited] (SPARK-19938) java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field
[ https://issues.apache.org/jira/browse/SPARK-19938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200272#comment-17200272 ] Igor Kamyshnikov edited comment on SPARK-19938 at 9/22/20, 5:55 PM: [~rdblue], my analysis shows the different root cause of the problem: https://bugs.openjdk.java.net/browse/JDK-8024931 (never fixed) https://github.com/scala/bug/issues/9777 (asking scala to solve on their side) It's about circular references among the objects being serialized: RDD1.dependencies_ = Seq1[RDD2] RDD2.dependences_ = Seq2[RDD3] RDD3 with some Dataset/catalyst magic can refer back to the Seq1[RDD2] Seq are instances of scala.collection.immutable.List which uses writeReplace, giving an instance of 'SerializationProxy'. The serialization of RDD3 puts a reference to the Seq1's SerializationProxy. When the deserialization works, it reads that reference to SerializationProxy earlier than the 'readResolve' method is called (see the JDK bug reported). was (Author: kamyshnikov): [~rdblue], my analysis shows the different root cause of the problem: https://bugs.openjdk.java.net/browse/JDK-8024931 https://github.com/scala/bug/issues/9777 It's about circular references among the objects being serialized: RDD1.dependencies_ = Seq1[RDD2] RDD2.dependences_ = Seq2[RDD3] RDD3 with some Dataset/catalyst magic can refer back to the Seq1[RDD2] Seq are instances of scala.collection.immutable.List which uses writeReplace, giving an instance of 'SerializationProxy'. The serialization of RDD3 puts a reference to the Seq1's SerializationProxy. When the deserialization works, it reads that reference to SerializationProxy earlier than the 'readResolve' method is called (see the JDK bug reported). > java.lang.ClassCastException: cannot assign instance of > scala.collection.immutable.List$SerializationProxy to field > --- > > Key: SPARK-19938 > URL: https://issues.apache.org/jira/browse/SPARK-19938 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.0.2 >Reporter: srinivas thallam >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19938) java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field
[ https://issues.apache.org/jira/browse/SPARK-19938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200272#comment-17200272 ] Igor Kamyshnikov commented on SPARK-19938: -- [~rdblue], my analysis shows the different root cause of the problem: https://bugs.openjdk.java.net/browse/JDK-8024931 https://github.com/scala/bug/issues/9777 It's about circular references among the objects being serialized: RDD1.dependencies_ = Seq1[RDD2] RDD2.dependences_ = Seq2[RDD3] RDD3 with some Dataset/catalyst magic can refer back to the Seq1[RDD2] Seq are instances of scala.collection.immutable.List which uses writeReplace, giving an instance of 'SerializationProxy'. The serialization of RDD3 puts a reference to the Seq1's SerializationProxy. When the deserialization works, it reads that reference to SerializationProxy earlier than the 'readResolve' method is called (see the JDK bug reported). > java.lang.ClassCastException: cannot assign instance of > scala.collection.immutable.List$SerializationProxy to field > --- > > Key: SPARK-19938 > URL: https://issues.apache.org/jira/browse/SPARK-19938 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.0.2 >Reporter: srinivas thallam >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32968) Column pruning for CsvToStructs
[ https://issues.apache.org/jira/browse/SPARK-32968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-32968: Description: We could do column pruning for CsvToStructs expression if we only require some fields from it. > Column pruning for CsvToStructs > --- > > Key: SPARK-32968 > URL: https://issues.apache.org/jira/browse/SPARK-32968 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > We could do column pruning for CsvToStructs expression if we only require > some fields from it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32968) Column pruning for CsvToStructs
L. C. Hsieh created SPARK-32968: --- Summary: Column pruning for CsvToStructs Key: SPARK-32968 URL: https://issues.apache.org/jira/browse/SPARK-32968 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32967) Optimize csv expression chain
L. C. Hsieh created SPARK-32967: --- Summary: Optimize csv expression chain Key: SPARK-32967 URL: https://issues.apache.org/jira/browse/SPARK-32967 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh Like json, we could do the same optimization to csv expression chain, e.g. from_csv + to_csv. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32966) Spark| PartitionBy is taking long time to process
Sujit Das created SPARK-32966: - Summary: Spark| PartitionBy is taking long time to process Key: SPARK-32966 URL: https://issues.apache.org/jira/browse/SPARK-32966 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.4.5 Environment: EMR - 5.30.0; Hadoop -2.8.5; Spark- 2.4.5 Reporter: Sujit Das # When I do a write without any partition it takes 8 min df2_merge.write.mode('overwrite').parquet(dest_path) 2. I have added conf - spark.sql.sources.partitionOverwriteMode=dynamic ; it took a longer time (more than 50 min before I force terminated the EMR cluster). But I have observed the partitions have been created and data files are present. But in EMR cluster the process is still showing as running, where as in spark history server it is showing no running or pending process. df2_merge.write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest) 3. I have modified with new conf - spark.sql.shuffle.partitions=3; it took 24 min df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest) 4. Again I disabled the conf and run plain write with partition. It took 30 min. df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest) Only one conf is common in the above scenarios is spark.sql.adaptive.coalescePartitions.initialPartitionNum=100 My point is to reduce the time of writing with partitionBy. Is there anything I am missing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32659) Fix the data issue of inserted DPP on non-atomic type
[ https://issues.apache.org/jira/browse/SPARK-32659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200176#comment-17200176 ] Apache Spark commented on SPARK-32659: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/29840 > Fix the data issue of inserted DPP on non-atomic type > - > > Key: SPARK-32659 > URL: https://issues.apache.org/jira/browse/SPARK-32659 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Labels: correctness > Fix For: 3.0.1, 3.1.0 > > > DPP has data issue when pruning on non-atomic type. for example: > {noformat} > spark.range(1000) > .select(col("id"), col("id").as("k")) > .write > .partitionBy("k") > .format("parquet") > .mode("overwrite") > .saveAsTable("df1"); > spark.range(100) > .select(col("id"), col("id").as("k")) > .write > .partitionBy("k") > .format("parquet") > .mode("overwrite") > .saveAsTable("df2") > spark.sql("set > spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2") > spark.sql("set > spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false") > spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = > struct(df2.k) AND df2.id < 2").show > {noformat} > It should return two records, but it returns empty. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32965) pyspark reading csv files with utf_16le encoding
Punit Shah created SPARK-32965: -- Summary: pyspark reading csv files with utf_16le encoding Key: SPARK-32965 URL: https://issues.apache.org/jira/browse/SPARK-32965 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1, 3.0.0, 2.4.7 Reporter: Punit Shah If you have a file encoded in utf_16le or utf_16be and try to use spark.read.csv("", encoding="utf_16le") the dataframe isn't rendered properly if you use python decoding like: prdd = spark_session._sc.binaryFiles(path_url).values().flatMap(lambda x : x.decode("utf_16le").splitlines()) and then do spark.read.csv(prdd), then it works. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32956) Duplicate Columns in a csv file
[ https://issues.apache.org/jira/browse/SPARK-32956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200163#comment-17200163 ] Punit Shah commented on SPARK-32956: That may work > Duplicate Columns in a csv file > --- > > Key: SPARK-32956 > URL: https://issues.apache.org/jira/browse/SPARK-32956 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1 >Reporter: Punit Shah >Priority: Major > > Imagine a csv file shaped like: > > Id,Product,Sale_Amount,Sale_Units,Sale_Amount2,Sale_Amount,Sale_Price > 1,P,"6,40,728","6,40,728","6,40,728","6,40,728","6,40,728" > 2,P,"5,81,644","5,81,644","5,81,644","5,81,644","5,81,644" > = > Reading this with the header=True will result in a stacktrace. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32153) .m2 repository corruption happens
[ https://issues.apache.org/jira/browse/SPARK-32153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200112#comment-17200112 ] Kousuke Saruta edited comment on SPARK-32153 at 9/22/20, 2:31 PM: -- [~shaneknapp]This issue seems to happen again especially for branch-2.4. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128982/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128981/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128976/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128966/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128982/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128966/ Could you help us?| was (Author: sarutak): [~shaneknapp]This issue seems to happen again especially for branch-2.4. [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128982/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128981/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128976/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128966/ |https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128982/] Could you help us? > .m2 repository corruption happens > - > > Key: SPARK-32153 > URL: https://issues.apache.org/jira/browse/SPARK-32153 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.8, 3.0.1, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Shane Knapp >Priority: Critical > > Build task on Jenkins-worker4 often fails with dependency problem. > [https://github.com/apache/spark/pull/28971#issuecomment-652570066] > [https://github.com/apache/spark/pull/28971#issuecomment-652611025] > [https://github.com/apache/spark/pull/28971#issuecomment-652690849] > [https://github.com/apache/spark/pull/28971#issuecomment-652611025] > [https://github.com/apache/spark/pull/28942#issuecomment-652842960] > [https://github.com/apache/spark/pull/28942#issuecomment-652835679] > These can be related to .m2 corruption. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32153) .m2 repository corruption happens
[ https://issues.apache.org/jira/browse/SPARK-32153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-32153: --- Affects Version/s: 2.4.8 > .m2 repository corruption happens > - > > Key: SPARK-32153 > URL: https://issues.apache.org/jira/browse/SPARK-32153 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.8, 3.0.1, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Shane Knapp >Priority: Critical > > Build task on Jenkins-worker4 often fails with dependency problem. > [https://github.com/apache/spark/pull/28971#issuecomment-652570066] > [https://github.com/apache/spark/pull/28971#issuecomment-652611025] > [https://github.com/apache/spark/pull/28971#issuecomment-652690849] > [https://github.com/apache/spark/pull/28971#issuecomment-652611025] > [https://github.com/apache/spark/pull/28942#issuecomment-652842960] > [https://github.com/apache/spark/pull/28942#issuecomment-652835679] > These can be related to .m2 corruption. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16190) Worker registration failed: Duplicate worker ID
[ https://issues.apache.org/jira/browse/SPARK-16190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-16190. --- Fix Version/s: 3.0.0 Resolution: Duplicate This is fixed via SPARK-23191 . Please see [~Ngone51]'s comment, https://github.com/apache/spark/pull/29809#issuecomment-696483018 . > Worker registration failed: Duplicate worker ID > --- > > Key: SPARK-16190 > URL: https://issues.apache.org/jira/browse/SPARK-16190 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.6.1 >Reporter: Thomas Huang >Priority: Minor > Fix For: 3.0.0 > > Attachments: > spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave19.out, > spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave2.out, > spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave7.out, > spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave8.out > > > Several worker crashed simultaneously due to this error: > Worker registration failed: Duplicate worker ID > This is the worker log on one of those crashed workers: > 16/06/24 16:28:53 INFO ExecutorRunner: Killing process! > 16/06/24 16:28:53 INFO ExecutorRunner: Runner thread for executor > app-20160624003013-0442/26 interrupted > 16/06/24 16:28:53 INFO ExecutorRunner: Killing process! > 16/06/24 16:29:03 WARN ExecutorRunner: Failed to terminate process: > java.lang.UNIXProcess@31340137. This process will likely be orphaned. > 16/06/24 16:29:03 WARN ExecutorRunner: Failed to terminate process: > java.lang.UNIXProcess@4d3bdb1d. This process will likely be orphaned. > 16/06/24 16:29:03 INFO Worker: Executor app-20160624003013-0442/8 finished > with state KILLED > 16/06/24 16:29:03 INFO Worker: Executor app-20160624003013-0442/26 finished > with state KILLED > 16/06/24 16:29:03 INFO Worker: Cleaning up local directories for application > app-20160624003013-0442 > 16/06/24 16:31:18 INFO ExternalShuffleBlockResolver: Application > app-20160624003013-0442 removed, cleanupLocalDirs = true > 16/06/24 16:31:18 INFO Worker: Asked to launch executor > app-20160624162905-0469/14 for SparkStreamingLRScala > 16/06/24 16:31:18 INFO SecurityManager: Changing view acls to: mqq > 16/06/24 16:31:18 INFO SecurityManager: Changing modify acls to: mqq > 16/06/24 16:31:18 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(mqq); users with > modify permissions: Set(mqq) > 16/06/24 16:31:18 INFO ExecutorRunner: Launch command: > "/data/jdk1.7.0_60/bin/java" "-cp" > "/data/spark-1.6.1-bin-cdh4/conf/:/data/spark-1.6.1-bin-cdh4/lib/spark-assembly-1.6.1-hadoop2.3.0.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-core-3.2.10.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-api-jdo-3.2.6.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-rdbms-3.2.9.jar" > "-Xms10240M" "-Xmx10240M" "-Dspark.driver.port=34792" "-XX:MaxPermSize=256m" > "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" > "spark://CoarseGrainedScheduler@100.65.21.199:34792" "--executor-id" "14" > "--hostname" "100.65.21.223" "--cores" "5" "--app-id" > "app-20160624162905-0469" "--worker-url" "spark://Worker@100.65.21.223:46581" > 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 > requested this worker to reconnect. > 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 > requested this worker to reconnect. > 16/06/24 16:31:18 INFO Worker: Connecting to master 100.65.21.199:7077... > 16/06/24 16:31:18 INFO Worker: Successfully registered with master > spark://100.65.21.199:7077 > 16/06/24 16:31:18 INFO Worker: Worker cleanup enabled; old application > directories will be deleted in: /data/spark-1.6.1-bin-cdh4/work > 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with > the master, since there is an attempt scheduled already. > 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 > requested this worker to reconnect. > 16/06/24 16:31:18 INFO Worker: Connecting to master 100.65.21.199:7077... > 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 > requested this worker to reconnect. > 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with > the master, since there is an attempt scheduled already. > 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 > requested this worker to reconnect. > 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with > the master, since there is an attempt scheduled already. > 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 > requested this worker to reconnect. > 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with >
[jira] [Reopened] (SPARK-32153) .m2 repository corruption happens
[ https://issues.apache.org/jira/browse/SPARK-32153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta reopened SPARK-32153: > .m2 repository corruption happens > - > > Key: SPARK-32153 > URL: https://issues.apache.org/jira/browse/SPARK-32153 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Shane Knapp >Priority: Critical > > Build task on Jenkins-worker4 often fails with dependency problem. > [https://github.com/apache/spark/pull/28971#issuecomment-652570066] > [https://github.com/apache/spark/pull/28971#issuecomment-652611025] > [https://github.com/apache/spark/pull/28971#issuecomment-652690849] > [https://github.com/apache/spark/pull/28971#issuecomment-652611025] > [https://github.com/apache/spark/pull/28942#issuecomment-652842960] > [https://github.com/apache/spark/pull/28942#issuecomment-652835679] > These can be related to .m2 corruption. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32153) .m2 repository corruption happens
[ https://issues.apache.org/jira/browse/SPARK-32153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200112#comment-17200112 ] Kousuke Saruta commented on SPARK-32153: [~shaneknapp]This issue seems to happen again especially for branch-2.4. [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128982/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128981/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128976/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128966/ |https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128982/] Could you help us? > .m2 repository corruption happens > - > > Key: SPARK-32153 > URL: https://issues.apache.org/jira/browse/SPARK-32153 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Shane Knapp >Priority: Critical > > Build task on Jenkins-worker4 often fails with dependency problem. > [https://github.com/apache/spark/pull/28971#issuecomment-652570066] > [https://github.com/apache/spark/pull/28971#issuecomment-652611025] > [https://github.com/apache/spark/pull/28971#issuecomment-652690849] > [https://github.com/apache/spark/pull/28971#issuecomment-652611025] > [https://github.com/apache/spark/pull/28942#issuecomment-652842960] > [https://github.com/apache/spark/pull/28942#issuecomment-652835679] > These can be related to .m2 corruption. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32153) .m2 repository corruption happens
[ https://issues.apache.org/jira/browse/SPARK-32153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-32153: --- Summary: .m2 repository corruption happens (was: .m2 repository corruption can happen on Jenkins-worker4) > .m2 repository corruption happens > - > > Key: SPARK-32153 > URL: https://issues.apache.org/jira/browse/SPARK-32153 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Shane Knapp >Priority: Critical > > Build task on Jenkins-worker4 often fails with dependency problem. > [https://github.com/apache/spark/pull/28971#issuecomment-652570066] > [https://github.com/apache/spark/pull/28971#issuecomment-652611025] > [https://github.com/apache/spark/pull/28971#issuecomment-652690849] > [https://github.com/apache/spark/pull/28971#issuecomment-652611025] > [https://github.com/apache/spark/pull/28942#issuecomment-652842960] > [https://github.com/apache/spark/pull/28942#issuecomment-652835679] > These can be related to .m2 corruption. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32956) Duplicate Columns in a csv file
[ https://issues.apache.org/jira/browse/SPARK-32956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated SPARK-32956: --- Component/s: (was: Spark Core) SQL > Duplicate Columns in a csv file > --- > > Key: SPARK-32956 > URL: https://issues.apache.org/jira/browse/SPARK-32956 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1 >Reporter: Punit Shah >Priority: Major > > Imagine a csv file shaped like: > > Id,Product,Sale_Amount,Sale_Units,Sale_Amount2,Sale_Amount,Sale_Price > 1,P,"6,40,728","6,40,728","6,40,728","6,40,728","6,40,728" > 2,P,"5,81,644","5,81,644","5,81,644","5,81,644","5,81,644" > = > Reading this with the header=True will result in a stacktrace. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32956) Duplicate Columns in a csv file
[ https://issues.apache.org/jira/browse/SPARK-32956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200092#comment-17200092 ] Chen Zhang commented on SPARK-32956: In SPARK-16896, if the CSV data has duplicate column headers, put the index as the suffix. In this case, _Sale_Amount_ is a duplicate column header. Original column header: {code:none} Id, Product, Sale_Amount, Sale_Units, Sale_Amount2, Sale_Amount, Sale_Price{code} Column header after adding index suffix: {code:none} Id, Product, Sale_Amount2, Sale_Units, Sale_Amount2, Sale_Amount5, Sale_Price{code} The _Sale_Amount2_ after adding the suffix is still the same as the other column header. Maybe we can add the suffix again when we find a new duplicate column header: {code:none} Id, Product, Sale_Amount22, Sale_Units, Sale_Amount24, Sale_Amount5, Sale_Price{code} > Duplicate Columns in a csv file > --- > > Key: SPARK-32956 > URL: https://issues.apache.org/jira/browse/SPARK-32956 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1 >Reporter: Punit Shah >Priority: Major > > Imagine a csv file shaped like: > > Id,Product,Sale_Amount,Sale_Units,Sale_Amount2,Sale_Amount,Sale_Price > 1,P,"6,40,728","6,40,728","6,40,728","6,40,728","6,40,728" > 2,P,"5,81,644","5,81,644","5,81,644","5,81,644","5,81,644" > = > Reading this with the header=True will result in a stacktrace. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32757) Physical InSubqueryExec should be consistent with logical InSubquery
[ https://issues.apache.org/jira/browse/SPARK-32757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200079#comment-17200079 ] Apache Spark commented on SPARK-32757: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/29839 > Physical InSubqueryExec should be consistent with logical InSubquery > > > Key: SPARK-32757 > URL: https://issues.apache.org/jira/browse/SPARK-32757 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32659) Fix the data issue of inserted DPP on non-atomic type
[ https://issues.apache.org/jira/browse/SPARK-32659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200063#comment-17200063 ] Apache Spark commented on SPARK-32659: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/29838 > Fix the data issue of inserted DPP on non-atomic type > - > > Key: SPARK-32659 > URL: https://issues.apache.org/jira/browse/SPARK-32659 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Labels: correctness > Fix For: 3.0.1, 3.1.0 > > > DPP has data issue when pruning on non-atomic type. for example: > {noformat} > spark.range(1000) > .select(col("id"), col("id").as("k")) > .write > .partitionBy("k") > .format("parquet") > .mode("overwrite") > .saveAsTable("df1"); > spark.range(100) > .select(col("id"), col("id").as("k")) > .write > .partitionBy("k") > .format("parquet") > .mode("overwrite") > .saveAsTable("df2") > spark.sql("set > spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2") > spark.sql("set > spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false") > spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = > struct(df2.k) AND df2.id < 2").show > {noformat} > It should return two records, but it returns empty. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31882) DAG-viz is not rendered correctly with pagination.
[ https://issues.apache.org/jira/browse/SPARK-31882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200035#comment-17200035 ] Apache Spark commented on SPARK-31882: -- User 'zhli1142015' has created a pull request for this issue: https://github.com/apache/spark/pull/29833 > DAG-viz is not rendered correctly with pagination. > -- > > Key: SPARK-31882 > URL: https://issues.apache.org/jira/browse/SPARK-31882 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.4, 3.0.0, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > Because DAG-viz for a job fetches link urls for each stage from the stage > table, rendering can fail with pagination. > You can reproduce this issue with the following operation. > {code:java} > sc.parallelize(1 to 10).map(value => (value > ,value)).repartition(1).repartition(1).repartition(1).reduceByKey(_ + > _).collect{code} > And then, visit the corresponding job page. > There are 5 stages so show <5 stages in the paged table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31882) DAG-viz is not rendered correctly with pagination.
[ https://issues.apache.org/jira/browse/SPARK-31882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200034#comment-17200034 ] Apache Spark commented on SPARK-31882: -- User 'zhli1142015' has created a pull request for this issue: https://github.com/apache/spark/pull/29833 > DAG-viz is not rendered correctly with pagination. > -- > > Key: SPARK-31882 > URL: https://issues.apache.org/jira/browse/SPARK-31882 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.4, 3.0.0, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > Because DAG-viz for a job fetches link urls for each stage from the stage > table, rendering can fail with pagination. > You can reproduce this issue with the following operation. > {code:java} > sc.parallelize(1 to 10).map(value => (value > ,value)).repartition(1).repartition(1).repartition(1).reduceByKey(_ + > _).collect{code} > And then, visit the corresponding job page. > There are 5 stages so show <5 stages in the paged table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32938) Spark can not cast long value from Kafka
[ https://issues.apache.org/jira/browse/SPARK-32938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200028#comment-17200028 ] Vinod KC commented on SPARK-32938: -- [~maseiler], Can you please test with this example? {code:java} spark.readStream.format("kafka").option("kafka.bootstrap.servers", "127.0.0.1:9092").option("subscribe", "longtest").load().withColumn("key", conv(hex(col("key")), 16, 10).cast("bigint")).withColumn("value", conv(hex(col("value")), 16, 10).cast("bigint")).select("key", "value").writeStream.outputMode("update").format("console").start() {code} > Spark can not cast long value from Kafka > > > Key: SPARK-32938 > URL: https://issues.apache.org/jira/browse/SPARK-32938 > Project: Spark > Issue Type: Bug > Components: Java API, SQL, Structured Streaming >Affects Versions: 3.0.0 > Environment: Debian 10 (Buster), AMD64 > Spark 3.0.0 > Kafka 2.5.0 > spark-sql-kafka-0-10_2.12 >Reporter: Matthias Seiler >Priority: Major > > Spark seems to be unable to cast the key (or value) part from Kafka to a > _{color:#172b4d}long{color}_ value and throws > {code:java} > org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`key` AS > BIGINT)' due to data type mismatch: cannot cast binary to bigint;;{code} > > {color:#172b4d}See this repo for further investigation:{color} > [https://github.com/maseiler/spark-kafka- > casting-bug|https://github.com/maseiler/spark-kafka-casting-bug] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32925) Support push-based shuffle in multiple deployment environments
[ https://issues.apache.org/jira/browse/SPARK-32925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200023#comment-17200023 ] qingwu.fu commented on SPARK-32925: --- Should send data to remote shuffle servioce bypass sort and spill data on local node?Because the process of data belongs to same partition gathered to same node can take the place of sort on local node. > Support push-based shuffle in multiple deployment environments > -- > > Key: SPARK-32925 > URL: https://issues.apache.org/jira/browse/SPARK-32925 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Priority: Major > > Create this ticket outside of SPARK-30602, since this is outside of the scope > of the immediate deliverables in that SPIP. Want to use this ticket to > discuss more about how to further improve push-based shuffle in different > environments. > The tasks created under SPARK-30602 would enable push-based shuffle on YARN > in a compute/storage colocated cluster. However, there are other deployment > environments that are getting more popular these days. We have seen 2 as we > discussed with other community members on the idea of push-based shuffle: > * Spark on K8S in a compute/storage colocated cluster. Because of the > limitation of concurrency of read/write of a mounted volume in K8S, multiple > executor pods on the same node in a K8S cluster cannot concurrently access > the same mounted disk volume. This creates some different requirements for > supporting external shuffle service as well as push-based shuffle. > * Spark on a compute/storage disaggregate cluster. Such a setup is more > typical in cloud environments, where the compute cluster has little/no local > storage, and the shuffle intermediate data needs to be stored in remote > disaggregate storage cluster. > Want to use this ticket to discuss ways to support push-based shuffle in > these different deployment environments. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32463) Document Data Type inference rule in SQL reference
[ https://issues.apache.org/jira/browse/SPARK-32463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1721#comment-1721 ] Apache Spark commented on SPARK-32463: -- User 'planga82' has created a pull request for this issue: https://github.com/apache/spark/pull/29837 > Document Data Type inference rule in SQL reference > -- > > Key: SPARK-32463 > URL: https://issues.apache.org/jira/browse/SPARK-32463 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Minor > > Document Data Type inference rule in SQL reference, under Data Types section. > Please see this PR https://github.com/apache/spark/pull/28896 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32463) Document Data Type inference rule in SQL reference
[ https://issues.apache.org/jira/browse/SPARK-32463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32463: Assignee: Apache Spark > Document Data Type inference rule in SQL reference > -- > > Key: SPARK-32463 > URL: https://issues.apache.org/jira/browse/SPARK-32463 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Minor > > Document Data Type inference rule in SQL reference, under Data Types section. > Please see this PR https://github.com/apache/spark/pull/28896 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32463) Document Data Type inference rule in SQL reference
[ https://issues.apache.org/jira/browse/SPARK-32463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32463: Assignee: (was: Apache Spark) > Document Data Type inference rule in SQL reference > -- > > Key: SPARK-32463 > URL: https://issues.apache.org/jira/browse/SPARK-32463 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Minor > > Document Data Type inference rule in SQL reference, under Data Types section. > Please see this PR https://github.com/apache/spark/pull/28896 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32964) Pass all `streaming` module UTs in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1715#comment-1715 ] Apache Spark commented on SPARK-32964: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/29836 > Pass all `streaming` module UTs in Scala 2.13 > - > > Key: SPARK-32964 > URL: https://issues.apache.org/jira/browse/SPARK-32964 > Project: Spark > Issue Type: Sub-task > Components: DStreams, Spark Core >Affects Versions: 3.1.0 >Reporter: Yang Jie >Priority: Minor > > There is only one failed case of `streaming` module in Scala 2.13: > * `start with non-serializable DStream checkpoint ` in StreamingContextSuite > StackOverflowError is thrown here when SerializationDebugger#visit method is > called. > The error stack as follow: > {code:java} > Expected exception java.io.NotSerializableException to be thrown, but > java.lang.StackOverflowError was thrownExpected exception > java.io.NotSerializableException to be thrown, but > java.lang.StackOverflowError was thrownScalaTestFailureLocation: > org.apache.spark.streaming.StreamingContextSuite at > (StreamingContextSuite.scala:159)org.scalatest.exceptions.TestFailedException: > Expected exception java.io.NotSerializableException to be thrown, but > java.lang.StackOverflowError was thrown at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > at > org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1562) > at org.scalatest.Assertions.intercept(Assertions.scala:756) at > org.scalatest.Assertions.intercept$(Assertions.scala:746) at > org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1562) at > org.apache.spark.streaming.StreamingContextSuite.$anonfun$new$13(StreamingContextSuite.scala:159) > ...Caused by: java.lang.StackOverflowError at > org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1397) at > org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:513) at > sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at > scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1(DefaultSerializationProxy.scala:38) > at > scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1$adapted(DefaultSerializationProxy.scala:37) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553) at > scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551) at > scala.collection.AbstractIterable.foreach(Iterable.scala:920) at > scala.collection.generic.DefaultSerializationProxy.writeObject(DefaultSerializationProxy.scala:37) > at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:230) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108) > at >
[jira] [Commented] (SPARK-32964) Pass all `streaming` module UTs in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1717#comment-1717 ] Apache Spark commented on SPARK-32964: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/29836 > Pass all `streaming` module UTs in Scala 2.13 > - > > Key: SPARK-32964 > URL: https://issues.apache.org/jira/browse/SPARK-32964 > Project: Spark > Issue Type: Sub-task > Components: DStreams, Spark Core >Affects Versions: 3.1.0 >Reporter: Yang Jie >Priority: Minor > > There is only one failed case of `streaming` module in Scala 2.13: > * `start with non-serializable DStream checkpoint ` in StreamingContextSuite > StackOverflowError is thrown here when SerializationDebugger#visit method is > called. > The error stack as follow: > {code:java} > Expected exception java.io.NotSerializableException to be thrown, but > java.lang.StackOverflowError was thrownExpected exception > java.io.NotSerializableException to be thrown, but > java.lang.StackOverflowError was thrownScalaTestFailureLocation: > org.apache.spark.streaming.StreamingContextSuite at > (StreamingContextSuite.scala:159)org.scalatest.exceptions.TestFailedException: > Expected exception java.io.NotSerializableException to be thrown, but > java.lang.StackOverflowError was thrown at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > at > org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1562) > at org.scalatest.Assertions.intercept(Assertions.scala:756) at > org.scalatest.Assertions.intercept$(Assertions.scala:746) at > org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1562) at > org.apache.spark.streaming.StreamingContextSuite.$anonfun$new$13(StreamingContextSuite.scala:159) > ...Caused by: java.lang.StackOverflowError at > org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1397) at > org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:513) at > sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at > scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1(DefaultSerializationProxy.scala:38) > at > scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1$adapted(DefaultSerializationProxy.scala:37) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553) at > scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551) at > scala.collection.AbstractIterable.foreach(Iterable.scala:920) at > scala.collection.generic.DefaultSerializationProxy.writeObject(DefaultSerializationProxy.scala:37) > at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:230) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108) > at >
[jira] [Assigned] (SPARK-32964) Pass all `streaming` module UTs in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32964: Assignee: (was: Apache Spark) > Pass all `streaming` module UTs in Scala 2.13 > - > > Key: SPARK-32964 > URL: https://issues.apache.org/jira/browse/SPARK-32964 > Project: Spark > Issue Type: Sub-task > Components: DStreams, Spark Core >Affects Versions: 3.1.0 >Reporter: Yang Jie >Priority: Minor > > There is only one failed case of `streaming` module in Scala 2.13: > * `start with non-serializable DStream checkpoint ` in StreamingContextSuite > StackOverflowError is thrown here when SerializationDebugger#visit method is > called. > The error stack as follow: > {code:java} > Expected exception java.io.NotSerializableException to be thrown, but > java.lang.StackOverflowError was thrownExpected exception > java.io.NotSerializableException to be thrown, but > java.lang.StackOverflowError was thrownScalaTestFailureLocation: > org.apache.spark.streaming.StreamingContextSuite at > (StreamingContextSuite.scala:159)org.scalatest.exceptions.TestFailedException: > Expected exception java.io.NotSerializableException to be thrown, but > java.lang.StackOverflowError was thrown at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > at > org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1562) > at org.scalatest.Assertions.intercept(Assertions.scala:756) at > org.scalatest.Assertions.intercept$(Assertions.scala:746) at > org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1562) at > org.apache.spark.streaming.StreamingContextSuite.$anonfun$new$13(StreamingContextSuite.scala:159) > ...Caused by: java.lang.StackOverflowError at > org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1397) at > org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:513) at > sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at > scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1(DefaultSerializationProxy.scala:38) > at > scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1$adapted(DefaultSerializationProxy.scala:37) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553) at > scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551) at > scala.collection.AbstractIterable.foreach(Iterable.scala:920) at > scala.collection.generic.DefaultSerializationProxy.writeObject(DefaultSerializationProxy.scala:37) > at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:230) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243) > at >
[jira] [Assigned] (SPARK-32964) Pass all `streaming` module UTs in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32964: Assignee: Apache Spark > Pass all `streaming` module UTs in Scala 2.13 > - > > Key: SPARK-32964 > URL: https://issues.apache.org/jira/browse/SPARK-32964 > Project: Spark > Issue Type: Sub-task > Components: DStreams, Spark Core >Affects Versions: 3.1.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > There is only one failed case of `streaming` module in Scala 2.13: > * `start with non-serializable DStream checkpoint ` in StreamingContextSuite > StackOverflowError is thrown here when SerializationDebugger#visit method is > called. > The error stack as follow: > {code:java} > Expected exception java.io.NotSerializableException to be thrown, but > java.lang.StackOverflowError was thrownExpected exception > java.io.NotSerializableException to be thrown, but > java.lang.StackOverflowError was thrownScalaTestFailureLocation: > org.apache.spark.streaming.StreamingContextSuite at > (StreamingContextSuite.scala:159)org.scalatest.exceptions.TestFailedException: > Expected exception java.io.NotSerializableException to be thrown, but > java.lang.StackOverflowError was thrown at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > at > org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1562) > at org.scalatest.Assertions.intercept(Assertions.scala:756) at > org.scalatest.Assertions.intercept$(Assertions.scala:746) at > org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1562) at > org.apache.spark.streaming.StreamingContextSuite.$anonfun$new$13(StreamingContextSuite.scala:159) > ...Caused by: java.lang.StackOverflowError at > org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1397) at > org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:513) at > sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at > scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1(DefaultSerializationProxy.scala:38) > at > scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1$adapted(DefaultSerializationProxy.scala:37) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553) at > scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551) at > scala.collection.AbstractIterable.foreach(Iterable.scala:920) at > scala.collection.generic.DefaultSerializationProxy.writeObject(DefaultSerializationProxy.scala:37) > at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:230) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108) > at > org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243) > at >
[jira] [Updated] (SPARK-32964) Pass all `streaming` module UTs in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-32964: - Description: There is only one failed case of `streaming` module in Scala 2.13: * `start with non-serializable DStream checkpoint ` in StreamingContextSuite StackOverflowError is thrown here when SerializationDebugger#visit method is called. The error stack as follow: {code:java} Expected exception java.io.NotSerializableException to be thrown, but java.lang.StackOverflowError was thrownExpected exception java.io.NotSerializableException to be thrown, but java.lang.StackOverflowError was thrownScalaTestFailureLocation: org.apache.spark.streaming.StreamingContextSuite at (StreamingContextSuite.scala:159)org.scalatest.exceptions.TestFailedException: Expected exception java.io.NotSerializableException to be thrown, but java.lang.StackOverflowError was thrown at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1562) at org.scalatest.Assertions.intercept(Assertions.scala:756) at org.scalatest.Assertions.intercept$(Assertions.scala:746) at org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1562) at org.apache.spark.streaming.StreamingContextSuite.$anonfun$new$13(StreamingContextSuite.scala:159) ...Caused by: java.lang.StackOverflowError at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1397) at org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:513) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1(DefaultSerializationProxy.scala:38) at scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1$adapted(DefaultSerializationProxy.scala:37) at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553) at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551) at scala.collection.AbstractIterable.foreach(Iterable.scala:920) at scala.collection.generic.DefaultSerializationProxy.writeObject(DefaultSerializationProxy.scala:37) at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:230) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108) {code} was: There is only one failed case of `streaming` module in Scala 2.13: * `start with non-serializable DStream checkpoint ` in StreamingContextSuite StackOverflowError is thrown here when SerializationDebugger#visit method is called. The error msg as follow: {code:java} Expected exception java.io.NotSerializableException to be thrown, but java.lang.StackOverflowError was
[jira] [Updated] (SPARK-32964) Pass all `streaming` module UTs in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-32964: - Description: There is only one failed case of `streaming` module in Scala 2.13: * `start with non-serializable DStream checkpoint ` in StreamingContextSuite StackOverflowError is thrown here when SerializationDebugger#visit method is called. The error msg as follow: {code:java} Expected exception java.io.NotSerializableException to be thrown, but java.lang.StackOverflowError was thrownExpected exception java.io.NotSerializableException to be thrown, but java.lang.StackOverflowError was thrownScalaTestFailureLocation: org.apache.spark.streaming.StreamingContextSuite at (StreamingContextSuite.scala:159)org.scalatest.exceptions.TestFailedException: Expected exception java.io.NotSerializableException to be thrown, but java.lang.StackOverflowError was thrown at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1562) at org.scalatest.Assertions.intercept(Assertions.scala:756) at org.scalatest.Assertions.intercept$(Assertions.scala:746) at org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1562) at org.apache.spark.streaming.StreamingContextSuite.$anonfun$new$13(StreamingContextSuite.scala:159) ...Caused by: java.lang.StackOverflowError at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1397) at org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:513) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1(DefaultSerializationProxy.scala:38) at scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1$adapted(DefaultSerializationProxy.scala:37) at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553) at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551) at scala.collection.AbstractIterable.foreach(Iterable.scala:920) at scala.collection.generic.DefaultSerializationProxy.writeObject(DefaultSerializationProxy.scala:37) at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:230) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189) {code} was: There is only one failed case of `streaming` module in Scala 2.13: * `start with non-serializable
[jira] [Created] (SPARK-32964) Pass all `streaming` module UTs in Scala 2.13
Yang Jie created SPARK-32964: Summary: Pass all `streaming` module UTs in Scala 2.13 Key: SPARK-32964 URL: https://issues.apache.org/jira/browse/SPARK-32964 Project: Spark Issue Type: Sub-task Components: DStreams, Spark Core Affects Versions: 3.1.0 Reporter: Yang Jie There is only one failed case of `streaming` module in Scala 2.13: * `start with non-serializable DStream checkpoint ` in StreamingContextSuite StackOverflowError is thrown here when SerializationDebugger#visit method is called. The error msg as follow: {code:java} Expected exception java.io.NotSerializableException to be thrown, but java.lang.StackOverflowError was thrown ScalaTestFailureLocation: org.apache.spark.streaming.StreamingContextSuite at (StreamingContextSuite.scala:159) org.scalatest.exceptions.TestFailedException: Expected exception java.io.NotSerializableException to be thrown, but java.lang.StackOverflowError was thrown {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199953#comment-17199953 ] Apache Spark commented on SPARK-32306: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/29835 > `approx_percentile` in Spark SQL gives incorrect results > > > Key: SPARK-32306 > URL: https://issues.apache.org/jira/browse/SPARK-32306 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.4 >Reporter: Sean Malory >Priority: Major > > The `approx_percentile` function in Spark SQL does not give the correct > result. I'm not sure how incorrect it is; it may just be a boundary issue. > From the docs: > {quote}The accuracy parameter (default: 1) is a positive numeric literal > which controls approximation accuracy at the cost of memory. Higher value of > accuracy yields better accuracy, 1.0/accuracy is the relative error of the > approximation. > {quote} > This is not true. Here is a minimum example in `pyspark` where, essentially, > the median of 5 and 8 is being calculated as 5: > {code:python} > import pyspark.sql.functions as psf > df = spark.createDataFrame( > [('bar', 5), ('bar', 8)], ['name', 'val'] > ) > median = psf.expr('percentile_approx(val, 0.5, 2147483647)') > df.groupBy('name').agg(median.alias('median'))# gives the median as 5 > {code} > I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this > is an issue with the underlying algorithm. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32306: Assignee: (was: Apache Spark) > `approx_percentile` in Spark SQL gives incorrect results > > > Key: SPARK-32306 > URL: https://issues.apache.org/jira/browse/SPARK-32306 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.4 >Reporter: Sean Malory >Priority: Major > > The `approx_percentile` function in Spark SQL does not give the correct > result. I'm not sure how incorrect it is; it may just be a boundary issue. > From the docs: > {quote}The accuracy parameter (default: 1) is a positive numeric literal > which controls approximation accuracy at the cost of memory. Higher value of > accuracy yields better accuracy, 1.0/accuracy is the relative error of the > approximation. > {quote} > This is not true. Here is a minimum example in `pyspark` where, essentially, > the median of 5 and 8 is being calculated as 5: > {code:python} > import pyspark.sql.functions as psf > df = spark.createDataFrame( > [('bar', 5), ('bar', 8)], ['name', 'val'] > ) > median = psf.expr('percentile_approx(val, 0.5, 2147483647)') > df.groupBy('name').agg(median.alias('median'))# gives the median as 5 > {code} > I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this > is an issue with the underlying algorithm. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32306: Assignee: Apache Spark > `approx_percentile` in Spark SQL gives incorrect results > > > Key: SPARK-32306 > URL: https://issues.apache.org/jira/browse/SPARK-32306 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.4 >Reporter: Sean Malory >Assignee: Apache Spark >Priority: Major > > The `approx_percentile` function in Spark SQL does not give the correct > result. I'm not sure how incorrect it is; it may just be a boundary issue. > From the docs: > {quote}The accuracy parameter (default: 1) is a positive numeric literal > which controls approximation accuracy at the cost of memory. Higher value of > accuracy yields better accuracy, 1.0/accuracy is the relative error of the > approximation. > {quote} > This is not true. Here is a minimum example in `pyspark` where, essentially, > the median of 5 and 8 is being calculated as 5: > {code:python} > import pyspark.sql.functions as psf > df = spark.createDataFrame( > [('bar', 5), ('bar', 8)], ['name', 'val'] > ) > median = psf.expr('percentile_approx(val, 0.5, 2147483647)') > df.groupBy('name').agg(median.alias('median'))# gives the median as 5 > {code} > I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this > is an issue with the underlying algorithm. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199952#comment-17199952 ] Maxim Gekk commented on SPARK-32306: I opened PR https://github.com/apache/spark/pull/29835 with clarification. > `approx_percentile` in Spark SQL gives incorrect results > > > Key: SPARK-32306 > URL: https://issues.apache.org/jira/browse/SPARK-32306 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.4 >Reporter: Sean Malory >Priority: Major > > The `approx_percentile` function in Spark SQL does not give the correct > result. I'm not sure how incorrect it is; it may just be a boundary issue. > From the docs: > {quote}The accuracy parameter (default: 1) is a positive numeric literal > which controls approximation accuracy at the cost of memory. Higher value of > accuracy yields better accuracy, 1.0/accuracy is the relative error of the > approximation. > {quote} > This is not true. Here is a minimum example in `pyspark` where, essentially, > the median of 5 and 8 is being calculated as 5: > {code:python} > import pyspark.sql.functions as psf > df = spark.createDataFrame( > [('bar', 5), ('bar', 8)], ['name', 'val'] > ) > median = psf.expr('percentile_approx(val, 0.5, 2147483647)') > df.groupBy('name').agg(median.alias('median'))# gives the median as 5 > {code} > I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this > is an issue with the underlying algorithm. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32963) empty string should be consistent for schema name in SparkGetSchemasOperation
[ https://issues.apache.org/jira/browse/SPARK-32963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199933#comment-17199933 ] Apache Spark commented on SPARK-32963: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/29834 > empty string should be consistent for schema name in SparkGetSchemasOperation > - > > Key: SPARK-32963 > URL: https://issues.apache.org/jira/browse/SPARK-32963 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kent Yao >Priority: Major > > When the schema name is empty string, it is considered as ".*" and can match > all databases in the catalog. > But when it can not match the global temp view as it is not converted to ".*" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32963) empty string should be consistent for schema name in SparkGetSchemasOperation
[ https://issues.apache.org/jira/browse/SPARK-32963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32963: Assignee: Apache Spark > empty string should be consistent for schema name in SparkGetSchemasOperation > - > > Key: SPARK-32963 > URL: https://issues.apache.org/jira/browse/SPARK-32963 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > > When the schema name is empty string, it is considered as ".*" and can match > all databases in the catalog. > But when it can not match the global temp view as it is not converted to ".*" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32963) empty string should be consistent for schema name in SparkGetSchemasOperation
[ https://issues.apache.org/jira/browse/SPARK-32963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32963: Assignee: (was: Apache Spark) > empty string should be consistent for schema name in SparkGetSchemasOperation > - > > Key: SPARK-32963 > URL: https://issues.apache.org/jira/browse/SPARK-32963 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kent Yao >Priority: Major > > When the schema name is empty string, it is considered as ".*" and can match > all databases in the catalog. > But when it can not match the global temp view as it is not converted to ".*" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32963) empty string should be consistent for schema name in SparkGetSchemasOperation
[ https://issues.apache.org/jira/browse/SPARK-32963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199936#comment-17199936 ] Apache Spark commented on SPARK-32963: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/29834 > empty string should be consistent for schema name in SparkGetSchemasOperation > - > > Key: SPARK-32963 > URL: https://issues.apache.org/jira/browse/SPARK-32963 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kent Yao >Priority: Major > > When the schema name is empty string, it is considered as ".*" and can match > all databases in the catalog. > But when it can not match the global temp view as it is not converted to ".*" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32963) empty string should be consistent for schema name in SparkGetSchemasOperation
Kent Yao created SPARK-32963: Summary: empty string should be consistent for schema name in SparkGetSchemasOperation Key: SPARK-32963 URL: https://issues.apache.org/jira/browse/SPARK-32963 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1, 3.1.0 Reporter: Kent Yao When the schema name is empty string, it is considered as ".*" and can match all databases in the catalog. But when it can not match the global temp view as it is not converted to ".*" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32962) Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-32962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Menashe updated SPARK-32962: - Priority: Trivial (was: Major) > Spark Streaming > --- > > Key: SPARK-32962 > URL: https://issues.apache.org/jira/browse/SPARK-32962 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.5 >Reporter: Amit Menashe >Priority: Trivial > > Hey there, > I'm using this spark streaming job which integrated with Kafka (and manage > its offsets commitions at Kafka itself), > The problem is when I have a failure I want to repeat the work on those > offset ranges (that something went wrong with them) , therefore I catch the > exception and NOT commit (with commitAsync) this range. > However I notice it keeps proceeding (without any commit made). > moreover I removed later all the commitAsync calls and I the stream keep > proceeding! > I guess there might be any inner cache or something that helps the streaming > job to consume the entries from Kafka. > > Could you please advice? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32962) Spark Streaming
Amit Menashe created SPARK-32962: Summary: Spark Streaming Key: SPARK-32962 URL: https://issues.apache.org/jira/browse/SPARK-32962 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 2.4.5 Reporter: Amit Menashe Hey there, I'm using this spark streaming job which integrated with Kafka (and manage its offsets commitions at Kafka itself), The problem is when I have a failure I want to repeat the work on those offset ranges (that something went wrong with them) , therefore I catch the exception and NOT commit (with commitAsync) this range. However I notice it keeps proceeding (without any commit made). moreover I removed later all the commitAsync calls and I the stream keep proceeding! I guess there might be any inner cache or something that helps the streaming job to consume the entries from Kafka. Could you please advice? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32886) '.../jobs/undefined' link from "Event Timeline" in jobs page
[ https://issues.apache.org/jira/browse/SPARK-32886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199898#comment-17199898 ] Apache Spark commented on SPARK-32886: -- User 'zhli1142015' has created a pull request for this issue: https://github.com/apache/spark/pull/29833 > '.../jobs/undefined' link from "Event Timeline" in jobs page > > > Key: SPARK-32886 > URL: https://issues.apache.org/jira/browse/SPARK-32886 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.4, 3.0.0, 3.1.0 >Reporter: Zhen Li >Assignee: Zhen Li >Priority: Minor > Fix For: 3.0.2, 3.1.0 > > Attachments: undefinedlink.JPG > > > In event timeline view of jobs page, clicking job item would redirect you to > corresponding job page. when there are two many jobs, some job items' link > would redirect to wrong link like '.../jobs/undefined' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32886) '.../jobs/undefined' link from "Event Timeline" in jobs page
[ https://issues.apache.org/jira/browse/SPARK-32886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199900#comment-17199900 ] Apache Spark commented on SPARK-32886: -- User 'zhli1142015' has created a pull request for this issue: https://github.com/apache/spark/pull/29833 > '.../jobs/undefined' link from "Event Timeline" in jobs page > > > Key: SPARK-32886 > URL: https://issues.apache.org/jira/browse/SPARK-32886 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.4, 3.0.0, 3.1.0 >Reporter: Zhen Li >Assignee: Zhen Li >Priority: Minor > Fix For: 3.0.2, 3.1.0 > > Attachments: undefinedlink.JPG > > > In event timeline view of jobs page, clicking job item would redirect you to > corresponding job page. when there are two many jobs, some job items' link > would redirect to wrong link like '.../jobs/undefined' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32898) totalExecutorRunTimeMs is too big
[ https://issues.apache.org/jira/browse/SPARK-32898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32898: -- Fix Version/s: 2.4.8 > totalExecutorRunTimeMs is too big > - > > Key: SPARK-32898 > URL: https://issues.apache.org/jira/browse/SPARK-32898 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7, 3.0.1 >Reporter: Linhong Liu >Assignee: wuyi >Priority: Major > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > This might be because of incorrectly calculating executorRunTimeMs in > Executor.scala > The function collectAccumulatorsAndResetStatusOnFailure(taskStartTimeNs) can > be called when taskStartTimeNs is not set yet (it is 0). > As of now in master branch, here is the problematic code: > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L470] > > There is a throw exception before this line. The catch branch still updates > the metric. > However the query shows as SUCCESSful. Maybe this task is speculative. Not > sure. > > submissionTime in LiveExecutionData may also have similar problem. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala#L449] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199846#comment-17199846 ] Sean Malory commented on SPARK-32306: - [~maxgekk]; thanks for the definition. Can we please update the docs to state that this is how it's being calculated? > `approx_percentile` in Spark SQL gives incorrect results > > > Key: SPARK-32306 > URL: https://issues.apache.org/jira/browse/SPARK-32306 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.4 >Reporter: Sean Malory >Priority: Major > > The `approx_percentile` function in Spark SQL does not give the correct > result. I'm not sure how incorrect it is; it may just be a boundary issue. > From the docs: > {quote}The accuracy parameter (default: 1) is a positive numeric literal > which controls approximation accuracy at the cost of memory. Higher value of > accuracy yields better accuracy, 1.0/accuracy is the relative error of the > approximation. > {quote} > This is not true. Here is a minimum example in `pyspark` where, essentially, > the median of 5 and 8 is being calculated as 5: > {code:python} > import pyspark.sql.functions as psf > df = spark.createDataFrame( > [('bar', 5), ('bar', 8)], ['name', 'val'] > ) > median = psf.expr('percentile_approx(val, 0.5, 2147483647)') > df.groupBy('name').agg(median.alias('median'))# gives the median as 5 > {code} > I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this > is an issue with the underlying algorithm. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199844#comment-17199844 ] Sean Malory commented on SPARK-32306: - Exactly; you should get the median, which is defined, almost universally, as the average of the middle two numbers if there are an even number of elements in the list. As you've hinted at, it doesn't really matter. If you decide that the percentile should always give you the lower of the two numbers (as it appears to do), that's fine, but I think it should be documented as such. The way this actually came about was me creating a median function and then testing that the function was doing the right thing by comparing it with the `pandas` equivalent: {code:python} import numpy as np import pandas as pd import pyspark.sql.functions as psf median = psf.expr('percentile_approx(val, 0.5, 2147483647)') xs = np.random.rand(10) ys = np.random.rand(10) data = [('foo', float(x)) for x in xs] + [('bar', float(y)) for y in ys] sparkdf = spark.createDataFrame(data, ['name', 'val']) spark_meds = sparkdf.groupBy('name').agg(median.alias('median')) pddf = pd.DataFrame(data, columns=['name', 'val']) pd_meds = pddf.groupby('name')['val'].median() {code} > `approx_percentile` in Spark SQL gives incorrect results > > > Key: SPARK-32306 > URL: https://issues.apache.org/jira/browse/SPARK-32306 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.4 >Reporter: Sean Malory >Priority: Major > > The `approx_percentile` function in Spark SQL does not give the correct > result. I'm not sure how incorrect it is; it may just be a boundary issue. > From the docs: > {quote}The accuracy parameter (default: 1) is a positive numeric literal > which controls approximation accuracy at the cost of memory. Higher value of > accuracy yields better accuracy, 1.0/accuracy is the relative error of the > approximation. > {quote} > This is not true. Here is a minimum example in `pyspark` where, essentially, > the median of 5 and 8 is being calculated as 5: > {code:python} > import pyspark.sql.functions as psf > df = spark.createDataFrame( > [('bar', 5), ('bar', 8)], ['name', 'val'] > ) > median = psf.expr('percentile_approx(val, 0.5, 2147483647)') > df.groupBy('name').agg(median.alias('median'))# gives the median as 5 > {code} > I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this > is an issue with the underlying algorithm. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199839#comment-17199839 ] Maxim Gekk commented on SPARK-32306: The function returns an element of the input sequence, see https://en.wikipedia.org/wiki/Percentile#The_nearest-rank_method > `approx_percentile` in Spark SQL gives incorrect results > > > Key: SPARK-32306 > URL: https://issues.apache.org/jira/browse/SPARK-32306 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.4 >Reporter: Sean Malory >Priority: Major > > The `approx_percentile` function in Spark SQL does not give the correct > result. I'm not sure how incorrect it is; it may just be a boundary issue. > From the docs: > {quote}The accuracy parameter (default: 1) is a positive numeric literal > which controls approximation accuracy at the cost of memory. Higher value of > accuracy yields better accuracy, 1.0/accuracy is the relative error of the > approximation. > {quote} > This is not true. Here is a minimum example in `pyspark` where, essentially, > the median of 5 and 8 is being calculated as 5: > {code:python} > import pyspark.sql.functions as psf > df = spark.createDataFrame( > [('bar', 5), ('bar', 8)], ['name', 'val'] > ) > median = psf.expr('percentile_approx(val, 0.5, 2147483647)') > df.groupBy('name').agg(median.alias('median'))# gives the median as 5 > {code} > I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this > is an issue with the underlying algorithm. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32898) totalExecutorRunTimeMs is too big
[ https://issues.apache.org/jira/browse/SPARK-32898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199840#comment-17199840 ] Apache Spark commented on SPARK-32898: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/29832 > totalExecutorRunTimeMs is too big > - > > Key: SPARK-32898 > URL: https://issues.apache.org/jira/browse/SPARK-32898 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7, 3.0.1 >Reporter: Linhong Liu >Assignee: wuyi >Priority: Major > Fix For: 3.0.2, 3.1.0 > > > This might be because of incorrectly calculating executorRunTimeMs in > Executor.scala > The function collectAccumulatorsAndResetStatusOnFailure(taskStartTimeNs) can > be called when taskStartTimeNs is not set yet (it is 0). > As of now in master branch, here is the problematic code: > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L470] > > There is a throw exception before this line. The catch branch still updates > the metric. > However the query shows as SUCCESSful. Maybe this task is speculative. Not > sure. > > submissionTime in LiveExecutionData may also have similar problem. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala#L449] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org