[jira] [Updated] (SPARK-22357) SparkContext.binaryFiles ignore minPartitions parameter
[ https://issues.apache.org/jira/browse/SPARK-22357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22357: Labels: behavior-changes (was: ) > SparkContext.binaryFiles ignore minPartitions parameter > --- > > Key: SPARK-22357 > URL: https://issues.apache.org/jira/browse/SPARK-22357 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.2, 2.2.0 >Reporter: Weichen Xu >Assignee: Bo Meng >Priority: Major > Labels: behavior-changes > Fix For: 2.4.0 > > > this is a bug in binaryFiles - even though we give it the partitions, > binaryFiles ignores it. > This is a bug introduced in spark 2.1 from spark 2.0, in file > PortableDataStream.scala the argument “minPartitions” is no longer used (with > the push to master on 11/7/6): > {code} > /** > Allow minPartitions set by end-user in order to keep compatibility with old > Hadoop API > which is set through setMaxSplitSize > */ > def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: > Int) { > val defaultMaxSplitBytes = > sc.getConf.get(config.FILES_MAX_PARTITION_BYTES) > val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES) > val defaultParallelism = sc.defaultParallelism > val files = listStatus(context).asScala > val totalBytes = files.filterNot(.isDirectory).map(.getLen + > openCostInBytes).sum > val bytesPerCore = totalBytes / defaultParallelism > val maxSplitSize = Math.min(defaultMaxSplitBytes, > Math.max(openCostInBytes, bytesPerCore)) > super.setMaxSplitSize(maxSplitSize) > } > {code} > The code previously, in version 2.0, was: > {code} > def setMinPartitions(context: JobContext, minPartitions: Int) { > val totalLen = > listStatus(context).asScala.filterNot(.isDirectory).map(.getLen).sum > val maxSplitSize = math.ceil(totalLen / math.max(minPartitions, > 1.0)).toLong > super.setMaxSplitSize(maxSplitSize) > } > {code} > The new code is very smart, but it ignores what the user passes in and uses > the data size, which is kind of a breaking change in some sense > In our specific case this was a problem, because we initially read in just > the files names and only after that the dataframe becomes very large, when > reading in the images themselves – and in this case the new code does not > handle the partitioning very well. > I’m not sure if it can be easily fixed because I don’t understand the full > context of the change in spark (but at the very least the unused parameter > should be removed to avoid confusion). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25312) Add description for the conf spark.network.crypto.keyFactoryIterations
[ https://issues.apache.org/jira/browse/SPARK-25312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25312: Labels: starter (was: ) > Add description for the conf spark.network.crypto.keyFactoryIterations > -- > > Key: SPARK-25312 > URL: https://issues.apache.org/jira/browse/SPARK-25312 > Project: Spark > Issue Type: Documentation > Components: Documentation, Spark Core >Affects Versions: 2.3.2 >Reporter: Xiao Li >Priority: Major > Labels: starter > > https://github.com/apache/spark/pull/22195 fixed the typo of an undocumented > conf `spark.network.crypto.keyFactoryIterations`. We should document it like > what we did for the other confs spark.network.crypto.xyz in > https://spark.apache.org/docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25312) Add description for the conf spark.network.crypto.keyFactoryIterations
Xiao Li created SPARK-25312: --- Summary: Add description for the conf spark.network.crypto.keyFactoryIterations Key: SPARK-25312 URL: https://issues.apache.org/jira/browse/SPARK-25312 Project: Spark Issue Type: Documentation Components: Documentation, Spark Core Affects Versions: 2.3.2 Reporter: Xiao Li https://github.com/apache/spark/pull/22195 fixed the typo of an undocumented conf `spark.network.crypto.keyFactoryIterations`. We should document it like what we did for the other confs spark.network.crypto.xyz in https://spark.apache.org/docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601786#comment-16601786 ] Peter Toth commented on SPARK-25150: [~EeveeB], sorry, I have just noticed that you might have started working on a patch. I think I came to the same conclusion as you and submitted a PR, but I'm quite new to Spark so any comments are welcome. > Joining DataFrames derived from the same source yields confusing/incorrect > results > -- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nicholas Chammas >Priority: Major > Attachments: output-with-implicit-cross-join.txt, > output-without-implicit-cross-join.txt, persons.csv, states.csv, > zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not correct in the sense that it should be > left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25044) Address translation of LMF closure primitive args to Object in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-25044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601777#comment-16601777 ] Apache Spark commented on SPARK-25044: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/22319 > Address translation of LMF closure primitive args to Object in Scala 2.12 > - > > Key: SPARK-25044 > URL: https://issues.apache.org/jira/browse/SPARK-25044 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 2.4.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Major > Fix For: 2.4.0 > > > A few SQL-related tests fail in Scala 2.12, such as UDFSuite's "SPARK-24891 > Fix HandleNullInputsForUDF rule": > {code:java} > - SPARK-24891 Fix HandleNullInputsForUDF rule *** FAILED *** > Results do not match for query: > ... > == Results == > == Results == > !== Correct Answer - 3 == == Spark Answer - 3 == > !struct<> struct > ![0,10,null] [0,10,0] > ![1,12,null] [1,12,1] > ![2,14,null] [2,14,2] (QueryTest.scala:163){code} > You can kind of get what's going on reading the test: > {code:java} > test("SPARK-24891 Fix HandleNullInputsForUDF rule") { > // assume(!ClosureCleanerSuite2.supportsLMFs) > // This test won't test what it intends to in 2.12, as lambda metafactory > closures > // have arg types that are not primitive, but Object > val udf1 = udf({(x: Int, y: Int) => x + y}) > val df = spark.range(0, 3).toDF("a") > .withColumn("b", udf1($"a", udf1($"a", lit(10 > .withColumn("c", udf1($"a", lit(null))) > val plan = spark.sessionState.executePlan(df.logicalPlan).analyzed > comparePlans(df.logicalPlan, plan) > checkAnswer( > df, > Seq( > Row(0, 10, null), > Row(1, 12, null), > Row(2, 14, null))) > }{code} > > It seems that the closure that is fed in as a UDF changes behavior, in a way > that primitive-type arguments are handled differently. For example an Int > argument, when fed 'null', acts like 0. > I'm sure it's a difference in the LMF closure and how its types are > understood, but not exactly sure of the cause yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25301) When a view uses an UDF from a non default database, Spark analyser throws AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-25301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod KC updated SPARK-25301: - Description: When a hive view uses an UDF from a non default database, Spark analyser throws AnalysisException Steps to simulate this issue - In Hive 1) CREATE DATABASE d100; 2) create function d100.udf100 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper'; // Note: udf100 is created in d100 3) create view d100.v100 as select *d100.udf100*(name) from default.emp; // Note : table default.emp has two columns 'name', 'address', 5) select * from d100.v100; // query on view d100.v100 gives correct result In Spark - 1) spark.sql("select * from d100.v100").show throws ``` org.apache.spark.sql.AnalysisException: Undefined function: '*d100.udf100*'. This function is neither a registered temporary function nor a permanent function registered in the database '*default*' ``` This is because, while parsing the SQL statement of the View 'select `d100.udf100`(`emp`.`name`) from `default`.`emp`' , spark parser fails to split database name and udf name and hence Spark function registry tries to load the UDF 'd100.udf100' from 'default' database. was: When a hive view uses an UDF from a non default database, Spark analyser throws AnalysisException Steps to simulate this issue - In Hive 1) CREATE DATABASE d100; 2) ADD JAR /usr/udf/masking.jar // masking.jar has a custom udf class 'com.uzx.udf.Masking' 3) create function d100.udf100 as "com.uzx.udf.Masking"; // Note: udf100 is created in d100 4) create view d100.v100 as select *d100.udf100*(name) from default.emp; // Note : table default.emp has two columns 'nanme', 'address', 5) select * from d100.v100; // query on view d100.v100 gives correct result In Spark - 1) spark.sql("select * from d100.v100").show throws ``` org.apache.spark.sql.AnalysisException: Undefined function: '*d100.udf100*'. This function is neither a registered temporary function nor a permanent function registered in the database '*default*' ``` This is because, while parsing the SQL statement of the View 'select `d100.udf100`(`emp`.`name`) from `default`.`emp`' , spark parser fails to split database name and udf name and hence Spark function registry tries to load the UDF 'd100.udf100' from 'default' database. > When a view uses an UDF from a non default database, Spark analyser throws > AnalysisException > > > Key: SPARK-25301 > URL: https://issues.apache.org/jira/browse/SPARK-25301 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Vinod KC >Priority: Minor > > When a hive view uses an UDF from a non default database, Spark analyser > throws AnalysisException > Steps to simulate this issue > - > In Hive > > 1) CREATE DATABASE d100; > 2) create function d100.udf100 as > 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper'; // Note: udf100 is > created in d100 > 3) create view d100.v100 as select *d100.udf100*(name) from default.emp; // > Note : table default.emp has two columns 'name', 'address', > 5) select * from d100.v100; // query on view d100.v100 gives correct result > In Spark > - > 1) spark.sql("select * from d100.v100").show > throws > ``` > org.apache.spark.sql.AnalysisException: Undefined function: '*d100.udf100*'. > This function is neither a registered temporary function nor a permanent > function registered in the database '*default*' > ``` > This is because, while parsing the SQL statement of the View 'select > `d100.udf100`(`emp`.`name`) from `default`.`emp`' , spark parser fails to > split database name and udf name and hence Spark function registry tries to > load the UDF 'd100.udf100' from 'default' database. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25311) `SPARK_LOCAL_HOSTNAME` unsupport IPV6 when do host checking
[ https://issues.apache.org/jira/browse/SPARK-25311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601751#comment-16601751 ] Xiaochen Ouyang commented on SPARK-25311: - IPV6, IPV4 regular expression can be used to solve this problem, but `checkHost` is revoked more frequently, and there is a certain loss in performance. I don't know if anyone else has a recommended solution? > `SPARK_LOCAL_HOSTNAME` unsupport IPV6 when do host checking > --- > > Key: SPARK-25311 > URL: https://issues.apache.org/jira/browse/SPARK-25311 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1, 2.2.2 >Reporter: Xiaochen Ouyang >Priority: Major > > IPV4 can pass the flowing check > {code:java} > def checkHost(host: String, message: String = "") { > assert(host.indexOf(':') == -1, message) > } > {code} > But IPV6 check failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25311) `SPARK_LOCAL_HOSTNAME` unsupport IPV6 when do host checking
Xiaochen Ouyang created SPARK-25311: --- Summary: `SPARK_LOCAL_HOSTNAME` unsupport IPV6 when do host checking Key: SPARK-25311 URL: https://issues.apache.org/jira/browse/SPARK-25311 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.2, 2.2.1 Reporter: Xiaochen Ouyang IPV4 can pass the flowing check {code:java} def checkHost(host: String, message: String = "") { assert(host.indexOf(':') == -1, message) } {code} But IPV6 check failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25304) enable HiveSparkSubmitSuite SPARK-8489 test for Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-25304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25304: - Assignee: Darcy Shen > enable HiveSparkSubmitSuite SPARK-8489 test for Scala 2.12 > -- > > Key: SPARK-25304 > URL: https://issues.apache.org/jira/browse/SPARK-25304 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Darcy Shen >Assignee: Darcy Shen >Priority: Minor > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25304) enable HiveSparkSubmitSuite SPARK-8489 test for Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-25304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25304. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22308 [https://github.com/apache/spark/pull/22308] > enable HiveSparkSubmitSuite SPARK-8489 test for Scala 2.12 > -- > > Key: SPARK-25304 > URL: https://issues.apache.org/jira/browse/SPARK-25304 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Darcy Shen >Assignee: Darcy Shen >Priority: Minor > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25304) enable HiveSparkSubmitSuite SPARK-8489 test for Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-25304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-25304: -- Priority: Minor (was: Major) > enable HiveSparkSubmitSuite SPARK-8489 test for Scala 2.12 > -- > > Key: SPARK-25304 > URL: https://issues.apache.org/jira/browse/SPARK-25304 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Darcy Shen >Priority: Minor > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23253) Only write shuffle temporary index file when there is not an existing one
[ https://issues.apache.org/jira/browse/SPARK-23253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601747#comment-16601747 ] Wenchen Fan commented on SPARK-23253: - Hi [~irashid] thanks for providing these references and sorry for the false alert! I was too anxious when searching the commit history and mistakenly got to this ticket. You are right, https://github.com/apache/spark/pull/9610 is the one that needs to revert(partially) to make my test pass. According to the discussion in https://github.com/apache/spark/pull/9214 , seems we've already known the problem of non-dererministic output, but decided to leave it and stick with "first write wins", as it's too hard to fix. I think https://github.com/apache/spark/pull/6648 is the right fix. Since it's not possible to finish https://github.com/apache/spark/pull/6648 before Spark 2.4, I'll refer it in the code comment and just fail the job if non-deterministic shuffle writing is detected. In the next release, I can help with https://github.com/apache/spark/pull/6648 to really fix the repartition bug. Thanks! > Only write shuffle temporary index file when there is not an existing one > - > > Key: SPARK-23253 > URL: https://issues.apache.org/jira/browse/SPARK-23253 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.2.1 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 2.4.0 > > > Shuffle Index temporay file is used for atomic creating shuffle index file, > it is not needed when the index file already exists after another attempts of > same task had it done. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25150: Assignee: (was: Apache Spark) > Joining DataFrames derived from the same source yields confusing/incorrect > results > -- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nicholas Chammas >Priority: Major > Attachments: output-with-implicit-cross-join.txt, > output-without-implicit-cross-join.txt, persons.csv, states.csv, > zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not correct in the sense that it should be > left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601734#comment-16601734 ] Apache Spark commented on SPARK-25150: -- User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/22318 > Joining DataFrames derived from the same source yields confusing/incorrect > results > -- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nicholas Chammas >Priority: Major > Attachments: output-with-implicit-cross-join.txt, > output-without-implicit-cross-join.txt, persons.csv, states.csv, > zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not correct in the sense that it should be > left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25150: Assignee: Apache Spark > Joining DataFrames derived from the same source yields confusing/incorrect > results > -- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nicholas Chammas >Assignee: Apache Spark >Priority: Major > Attachments: output-with-implicit-cross-join.txt, > output-without-implicit-cross-join.txt, persons.csv, states.csv, > zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not correct in the sense that it should be > left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25135) insert datasource table may all null when select from view on parquet
[ https://issues.apache.org/jira/browse/SPARK-25135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601728#comment-16601728 ] Yuming Wang commented on SPARK-25135: - [https://github.com/apache/spark/pull/22311] [https://github.com/apache/spark/pull/22287] We are trying to fix it. > insert datasource table may all null when select from view on parquet > - > > Key: SPARK-25135 > URL: https://issues.apache.org/jira/browse/SPARK-25135 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Yuming Wang >Priority: Blocker > Labels: Parquet, correctness > > This happens on parquet. > How to reproduce in parquet. > {code:scala} > val path = "/tmp/spark/parquet" > val cnt = 30 > spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) > as col2").write.mode("overwrite").parquet(path) > spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet > location '$path'") > spark.sql("create view view1 as select col1, col2 from table1 where col1 > > -20") > spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet") > spark.sql("insert overwrite table table2 select COL1, COL2 from view1") > spark.table("table2").show > {code} > FYI, the following is orc. > {code} > scala> val path = "/tmp/spark/orc" > scala> val cnt = 30 > scala> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as > bigint) as col2").write.mode("overwrite").orc(path) > scala> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using orc > location '$path'") > scala> spark.sql("create view view1 as select col1, col2 from table1 where > col1 > -20") > scala> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using orc") > scala> spark.sql("insert overwrite table table2 select COL1, COL2 from view1") > scala> spark.table("table2").show > +++ > |COL1|COL2| > +++ > | 15| 15| > | 16| 16| > | 17| 17| > ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25265) Fix memory leak in Barrier Execution Mode
[ https://issues.apache.org/jira/browse/SPARK-25265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-25265: --- Summary: Fix memory leak in Barrier Execution Mode (was: Fix memory leak vulnerability in Barrier Execution Mode) > Fix memory leak in Barrier Execution Mode > - > > Key: SPARK-25265 > URL: https://issues.apache.org/jira/browse/SPARK-25265 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.4.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Critical > > BarrierCoordinator$ uses Timer and TimerTask. `TimerTask#cancel()` is invoked > in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked. > Once a TimerTask is scheduled, the reference to it is not released until > `Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25265) Fix memory leak vulnerability in Barrier Execution Mode
[ https://issues.apache.org/jira/browse/SPARK-25265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-25265. Resolution: Duplicate Thanks for the notification. It may be accidentally duplicated. I'll close this one. > Fix memory leak vulnerability in Barrier Execution Mode > --- > > Key: SPARK-25265 > URL: https://issues.apache.org/jira/browse/SPARK-25265 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.4.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Critical > > BarrierCoordinator$ uses Timer and TimerTask. `TimerTask#cancel()` is invoked > in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked. > Once a TimerTask is scheduled, the reference to it is not released until > `Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25176) Kryo fails to serialize a parametrised type hierarchy
[ https://issues.apache.org/jira/browse/SPARK-25176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601643#comment-16601643 ] Dongjoon Hyun commented on SPARK-25176: --- Sorry, [~m.pryahin]. I overlooked the example in [4]. I deleted my previous comment. > Kryo fails to serialize a parametrised type hierarchy > - > > Key: SPARK-25176 > URL: https://issues.apache.org/jira/browse/SPARK-25176 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.1 >Reporter: Mikhail Pryakhin >Priority: Major > > I'm using the latest spark version spark-core_2.11:2.3.1 which > transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the > com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo > serializer contains an issue [1,2] which results in throwing > ClassCastExceptions when serialising parameterised type hierarchy. > This issue has been fixed in kryo version 4.0.0 [3]. It would be great to > have this update in Spark as well. Could you please upgrade the version of > com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2? > You can find a simple test to reproduce the issue [4]. > [1] https://github.com/EsotericSoftware/kryo/issues/384 > [2] https://github.com/EsotericSoftware/kryo/issues/377 > [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0 > [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-25176) Kryo fails to serialize a parametrised type hierarchy
[ https://issues.apache.org/jira/browse/SPARK-25176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25176: -- Comment: was deleted (was: [~m.pryahin]. There is not much information for this. Since this is general suggestion for upgrade, let's close this as duplicate of SPARK-23131. SPARK-23131 has a PR for you.) > Kryo fails to serialize a parametrised type hierarchy > - > > Key: SPARK-25176 > URL: https://issues.apache.org/jira/browse/SPARK-25176 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.1 >Reporter: Mikhail Pryakhin >Priority: Major > > I'm using the latest spark version spark-core_2.11:2.3.1 which > transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the > com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo > serializer contains an issue [1,2] which results in throwing > ClassCastExceptions when serialising parameterised type hierarchy. > This issue has been fixed in kryo version 4.0.0 [3]. It would be great to > have this update in Spark as well. Could you please upgrade the version of > com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2? > You can find a simple test to reproduce the issue [4]. > [1] https://github.com/EsotericSoftware/kryo/issues/384 > [2] https://github.com/EsotericSoftware/kryo/issues/377 > [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0 > [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25176) Kryo fails to serialize a parametrised type hierarchy
[ https://issues.apache.org/jira/browse/SPARK-25176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601642#comment-16601642 ] Dongjoon Hyun commented on SPARK-25176: --- [~m.pryahin]. There is not much information for this. Since this is general suggestion for upgrade, let's close this as duplicate of SPARK-23131. SPARK-23131 has a PR for you. > Kryo fails to serialize a parametrised type hierarchy > - > > Key: SPARK-25176 > URL: https://issues.apache.org/jira/browse/SPARK-25176 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.1 >Reporter: Mikhail Pryakhin >Priority: Major > > I'm using the latest spark version spark-core_2.11:2.3.1 which > transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the > com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo > serializer contains an issue [1,2] which results in throwing > ClassCastExceptions when serialising parameterised type hierarchy. > This issue has been fixed in kryo version 4.0.0 [3]. It would be great to > have this update in Spark as well. Could you please upgrade the version of > com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2? > You can find a simple test to reproduce the issue [4]. > [1] https://github.com/EsotericSoftware/kryo/issues/384 > [2] https://github.com/EsotericSoftware/kryo/issues/377 > [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0 > [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20389) Upgrade kryo to fix NegativeArraySizeException
[ https://issues.apache.org/jira/browse/SPARK-20389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601640#comment-16601640 ] Dongjoon Hyun commented on SPARK-20389: --- Hi, [~georg.kf.hei...@gmail.com] and [~tashoyan]. SPARK-23131 is trying to resolve this issue via https://github.com/apache/spark/pull/22179 . Could you test the patch in your environment in order to resolve this issue together? > Upgrade kryo to fix NegativeArraySizeException > -- > > Key: SPARK-20389 > URL: https://issues.apache.org/jira/browse/SPARK-20389 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Submit >Affects Versions: 2.1.0, 2.2.1 > Environment: Linux, Centos7, jdk8 >Reporter: Georg Heiler >Priority: Major > > I am experiencing an issue with Kryo when writing parquet files. Similar to > https://github.com/broadinstitute/gatk/issues/1524 a > NegativeArraySizeException occurs. Apparently this is fixed in a current Kryo > version. Spark is still using the very old 3.3 Kryo. > Can you please upgrade to a fixed Kryo version. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25310) ArraysOverlap may throw a CompileException
[ https://issues.apache.org/jira/browse/SPARK-25310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601260#comment-16601260 ] Apache Spark commented on SPARK-25310: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/22317 > ArraysOverlap may throw a CompileException > -- > > Key: SPARK-25310 > URL: https://issues.apache.org/jira/browse/SPARK-25310 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > Invoking {{ArraysOverlap}} function with non-nullable array type throws the > following error in the code generation phase. > {code:java} > Code generation of arrays_overlap([1,2,3], [4,5,3]) failed: > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: Expression "isNull_0" is not an rvalue > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: Expression "isNull_0" is not an rvalue > at > com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) > at > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) > at > com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410) > at > com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380) > at > com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257) > at com.google.common.cache.LocalCache.get(LocalCache.java:4000) > at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) > at > com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1305) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:143) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:48) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:32) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1260) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25310) ArraysOverlap may throw a CompileException
[ https://issues.apache.org/jira/browse/SPARK-25310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25310: Assignee: Apache Spark > ArraysOverlap may throw a CompileException > -- > > Key: SPARK-25310 > URL: https://issues.apache.org/jira/browse/SPARK-25310 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark >Priority: Major > > Invoking {{ArraysOverlap}} function with non-nullable array type throws the > following error in the code generation phase. > {code:java} > Code generation of arrays_overlap([1,2,3], [4,5,3]) failed: > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: Expression "isNull_0" is not an rvalue > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: Expression "isNull_0" is not an rvalue > at > com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) > at > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) > at > com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410) > at > com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380) > at > com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257) > at com.google.common.cache.LocalCache.get(LocalCache.java:4000) > at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) > at > com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1305) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:143) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:48) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:32) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1260) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25310) ArraysOverlap may throw a CompileException
[ https://issues.apache.org/jira/browse/SPARK-25310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25310: Assignee: (was: Apache Spark) > ArraysOverlap may throw a CompileException > -- > > Key: SPARK-25310 > URL: https://issues.apache.org/jira/browse/SPARK-25310 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > Invoking {{ArraysOverlap}} function with non-nullable array type throws the > following error in the code generation phase. > {code:java} > Code generation of arrays_overlap([1,2,3], [4,5,3]) failed: > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: Expression "isNull_0" is not an rvalue > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: Expression "isNull_0" is not an rvalue > at > com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) > at > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) > at > com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410) > at > com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380) > at > com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257) > at com.google.common.cache.LocalCache.get(LocalCache.java:4000) > at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) > at > com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1305) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:143) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:48) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:32) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1260) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25310) ArraysOverlap may throw a CompileException
[ https://issues.apache.org/jira/browse/SPARK-25310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25310: - Description: Invoking {{ArraysOverlap}} function with non-nullable array type throws the following error in the code generation phase. {code:java} Code generation of arrays_overlap([1,2,3], [4,5,3]) failed: java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, Column 11: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, Column 11: Expression "isNull_0" is not an rvalue java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, Column 11: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, Column 11: Expression "isNull_0" is not an rvalue at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) at com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257) at com.google.common.cache.LocalCache.get(LocalCache.java:4000) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1305) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:143) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:48) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:32) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1260) {code} > ArraysOverlap may throw a CompileException > -- > > Key: SPARK-25310 > URL: https://issues.apache.org/jira/browse/SPARK-25310 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > Invoking {{ArraysOverlap}} function with non-nullable array type throws the > following error in the code generation phase. > {code:java} > Code generation of arrays_overlap([1,2,3], [4,5,3]) failed: > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: Expression "isNull_0" is not an rvalue > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: Expression "isNull_0" is not an rvalue > at > com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) > at > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) > at > com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410) > at > com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380) > at > com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257) > at com.google.common.cache.LocalCache.get(LocalCache.java:4000) > at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) > at > com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1305) > at >
[jira] [Updated] (SPARK-25310) ArraysOverlap may throw a CompileException
[ https://issues.apache.org/jira/browse/SPARK-25310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25310: - Summary: ArraysOverlap may throw a CompileException (was: ArraysOverlap throws an Exception) > ArraysOverlap may throw a CompileException > -- > > Key: SPARK-25310 > URL: https://issues.apache.org/jira/browse/SPARK-25310 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25310) ArraysOverlap throws an Exception
Kazuaki Ishizaki created SPARK-25310: Summary: ArraysOverlap throws an Exception Key: SPARK-25310 URL: https://issues.apache.org/jira/browse/SPARK-25310 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Kazuaki Ishizaki -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25309) Sci-kit Learn like Auto Pipeline Parallelization in Spark
Ravi created SPARK-25309: Summary: Sci-kit Learn like Auto Pipeline Parallelization in Spark Key: SPARK-25309 URL: https://issues.apache.org/jira/browse/SPARK-25309 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.3.1 Reporter: Ravi SPARK-19357 and SPARK-21911 have helped parallelize Pipelines in Spark. However, instead of setting the parallelism Parameter in the CrossValidator it would be good to have something like njobs=-1 (like Scikit Learn) where the Pipleline DAG could be automatically parallelized and scheduled based on the resources allocated to the Spark Session instead of having the user pick the integer value for this parameter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25048) Pivoting by multiple columns in Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-25048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601210#comment-16601210 ] Apache Spark commented on SPARK-25048: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/22316 > Pivoting by multiple columns in Scala/Java > -- > > Key: SPARK-25048 > URL: https://issues.apache.org/jira/browse/SPARK-25048 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Need to change or extend existing API to make pivoting by multiple columns > possible. Users should be able to use many columns and values like in the > example: > {code:scala} > trainingSales > .groupBy($"sales.year") > .pivot(struct(lower($"sales.course"), $"training"), Seq( > struct(lit("dotnet"), lit("Experts")), > struct(lit("java"), lit("Dummies"))) > ).agg(sum($"sales.earnings")) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25007) Add array_intersect / array_except /array_union / array_shuffle to SparkR
[ https://issues.apache.org/jira/browse/SPARK-25007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-25007. -- Resolution: Fixed Fix Version/s: 2.4.0 > Add array_intersect / array_except /array_union / array_shuffle to SparkR > - > > Key: SPARK-25007 > URL: https://issues.apache.org/jira/browse/SPARK-25007 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 2.4.0 > > > Add R version of > * array_intersect -SPARK-23913- > * array_except -SPARK-23915- > * array_union -SPARK-23914- > * array_shuffle -SPARK-23928- -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25007) Add array_intersect / array_except /array_union / array_shuffle to SparkR
[ https://issues.apache.org/jira/browse/SPARK-25007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung reassigned SPARK-25007: Assignee: Huaxin Gao > Add array_intersect / array_except /array_union / array_shuffle to SparkR > - > > Key: SPARK-25007 > URL: https://issues.apache.org/jira/browse/SPARK-25007 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > > Add R version of > * array_intersect -SPARK-23913- > * array_except -SPARK-23915- > * array_union -SPARK-23914- > * array_shuffle -SPARK-23928- -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org