[jira] [Created] (SPARK-20121) simplify NullPropagation with NullIntolerant
Wenchen Fan created SPARK-20121: --- Summary: simplify NullPropagation with NullIntolerant Key: SPARK-20121 URL: https://issues.apache.org/jira/browse/SPARK-20121 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20120) spark-sql CLI support silent mode
[ https://issues.apache.org/jira/browse/SPARK-20120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20120: Assignee: (was: Apache Spark) > spark-sql CLI support silent mode > - > > Key: SPARK-20120 > URL: https://issues.apache.org/jira/browse/SPARK-20120 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Yuming Wang > > It is similar to Hive silent mode, just show the query result. see: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20120) spark-sql CLI support silent mode
[ https://issues.apache.org/jira/browse/SPARK-20120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944549#comment-15944549 ] Apache Spark commented on SPARK-20120: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/17449 > spark-sql CLI support silent mode > - > > Key: SPARK-20120 > URL: https://issues.apache.org/jira/browse/SPARK-20120 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Yuming Wang > > It is similar to Hive silent mode, just show the query result. see: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20120) spark-sql CLI support silent mode
[ https://issues.apache.org/jira/browse/SPARK-20120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20120: Assignee: Apache Spark > spark-sql CLI support silent mode > - > > Key: SPARK-20120 > URL: https://issues.apache.org/jira/browse/SPARK-20120 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Yuming Wang >Assignee: Apache Spark > > It is similar to Hive silent mode, just show the query result. see: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20120) spark-sql CLI support silent mode
Yuming Wang created SPARK-20120: --- Summary: spark-sql CLI support silent mode Key: SPARK-20120 URL: https://issues.apache.org/jira/browse/SPARK-20120 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Yuming Wang It is similar to Hive silent mode, just show the query result. see: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20119) Flaky Test: org.apache.spark.sql.execution. DataSourceScanExecRedactionSuite
[ https://issues.apache.org/jira/browse/SPARK-20119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20119: Assignee: Apache Spark (was: Xiao Li) > Flaky Test: org.apache.spark.sql.execution. DataSourceScanExecRedactionSuite > > > Key: SPARK-20119 > URL: https://issues.apache.org/jira/browse/SPARK-20119 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Failed in > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6/2895/testReport/org.apache.spark.sql.execution/DataSourceScanExecRedactionSuite/treeString_is_redacted/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20119) Flaky Test: org.apache.spark.sql.execution. DataSourceScanExecRedactionSuite
[ https://issues.apache.org/jira/browse/SPARK-20119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944535#comment-15944535 ] Apache Spark commented on SPARK-20119: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/17448 > Flaky Test: org.apache.spark.sql.execution. DataSourceScanExecRedactionSuite > > > Key: SPARK-20119 > URL: https://issues.apache.org/jira/browse/SPARK-20119 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Failed in > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6/2895/testReport/org.apache.spark.sql.execution/DataSourceScanExecRedactionSuite/treeString_is_redacted/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20119) Flaky Test: org.apache.spark.sql.execution. DataSourceScanExecRedactionSuite
[ https://issues.apache.org/jira/browse/SPARK-20119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20119: Assignee: Xiao Li (was: Apache Spark) > Flaky Test: org.apache.spark.sql.execution. DataSourceScanExecRedactionSuite > > > Key: SPARK-20119 > URL: https://issues.apache.org/jira/browse/SPARK-20119 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Failed in > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6/2895/testReport/org.apache.spark.sql.execution/DataSourceScanExecRedactionSuite/treeString_is_redacted/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20119) Flaky Test: org.apache.spark.sql.execution. DataSourceScanExecRedactionSuite
Xiao Li created SPARK-20119: --- Summary: Flaky Test: org.apache.spark.sql.execution. DataSourceScanExecRedactionSuite Key: SPARK-20119 URL: https://issues.apache.org/jira/browse/SPARK-20119 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Xiao Li Assignee: Xiao Li Failed in https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6/2895/testReport/org.apache.spark.sql.execution/DataSourceScanExecRedactionSuite/treeString_is_redacted/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-19476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944506#comment-15944506 ] Sean Owen commented on SPARK-19476: --- Why do that in a thread instead of doing that same work synchronously? It sounds like you even need to make it synchronous externally. > Running threads in Spark DataFrame foreachPartition() causes > NullPointerException > - > > Key: SPARK-19476 > URL: https://issues.apache.org/jira/browse/SPARK-19476 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Gal Topper >Priority: Minor > > First reported on [Stack > overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition]. > I use multiple threads inside foreachPartition(), which works great for me > except for when the underlying iterator is TungstenAggregationIterator. Here > is a minimal code snippet to reproduce: > {code:title=Reproduce.scala|borderStyle=solid} > import scala.concurrent.ExecutionContext.Implicits.global > import scala.concurrent.duration.Duration > import scala.concurrent.{Await, Future} > import org.apache.spark.SparkContext > import org.apache.spark.sql.SQLContext > object Reproduce extends App { > val sc = new SparkContext("local", "reproduce") > val sqlContext = new SQLContext(sc) > import sqlContext.implicits._ > val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count() > df.foreachPartition { iterator => > val f = Future(iterator.toVector) > Await.result(f, Duration.Inf) > } > } > {code} > When I run this, I get: > {noformat} > java.lang.NullPointerException > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > {noformat} > I believe I actually understand why this happens - > TungstenAggregationIterator uses a ThreadLocal variable that returns null > when called from a thread other than the original thread that got the > iterator from Spark. From examining the code, this does not appear to differ > between recent Spark versions. > However, this limitation is specific to TungstenAggregationIterator, and not > documented, as far as I'm aware. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20118) spark2.1 support numeric datatype
[ https://issues.apache.org/jira/browse/SPARK-20118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] QQShu1 closed SPARK-20118. -- > spark2.1 support numeric datatype > - > > Key: SPARK-20118 > URL: https://issues.apache.org/jira/browse/SPARK-20118 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: QQShu1 > > spark2.1 now don`t support numeric. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20118) spark2.1 support numeric datatype
[ https://issues.apache.org/jira/browse/SPARK-20118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20118. --- Resolution: Invalid > spark2.1 support numeric datatype > - > > Key: SPARK-20118 > URL: https://issues.apache.org/jira/browse/SPARK-20118 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: QQShu1 > > spark2.1 now don`t support numeric. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19963) create view from select Fails when nullif() is used
[ https://issues.apache.org/jira/browse/SPARK-19963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944461#comment-15944461 ] dharani_sugumar commented on SPARK-19963: - @Jay Danielsen: I'm looking into this. > create view from select Fails when nullif() is used > --- > > Key: SPARK-19963 > URL: https://issues.apache.org/jira/browse/SPARK-19963 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jay Danielsen >Priority: Minor > > Test Case : Any valid query using nullif. > SELECT nullif(mycol,0) from mytable; > Create view FAILS when nullif used in select. > CREATE VIEW my_view as > SELECT nullif(mycol,0) from mytable; > Error: java.lang.RuntimeException: Failed to analyze the canonicalized SQL: > ... > I can refactor with CASE statement and create view successfully. > CREATE VIEW my_view as > SELECT CASE WHEN mycol = 0 THEN NULL ELSE mycol END mycol from mytable; -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter
[ https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mitesh updated SPARK-20112: --- Description: I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The hs_err_pid and codegen file are attached (with query plans). Its not a deterministic repro, but running a big query load, I eventually see it come up within a few minutes. Here is some interesting repro information: - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I think that means its not an issue with the code-gen, but I cant figure out what the difference in behavior is. - The broadcast joins in the plan are all small tables. I have autoJoinBroadcast=-1 because I always hint which tables should be broadcast. - As you can see from the plan, all the sources are cached memory tables. And we partition/sort them all beforehand so its always sort-merge-joins or broadcast joins (with small tables). {noformat} # A fatal error has been detected by the Java Runtime Environment: # # [thread 139872345896704 also had an error] SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688 # # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 compressed oops) [thread 139872348002048 also had an error]# Problematic frame: # J 28454 C1 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3] {noformat} This kind of looks like https://issues.apache.org/jira/browse/SPARK-15822, but that is marked fix in 2.0.0 was: I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The hs_err_pid and codegen file are attached (with query plans). Its not a deterministic repro, but running a big query load, I eventually see it come up within a few minutes. Here is some interesting repro information: - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I think that means its not an issue with the code-gen, but I cant figure out what the difference in behavior is. - The broadcast joins in the plan are all small tables. I have autoJoinBroadcast=-1 because I always hint which tables should be broadcast. - As you can see from the plan, all the sources are cached memory tables {noformat} # A fatal error has been detected by the Java Runtime Environment: # # [thread 139872345896704 also had an error] SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688 # # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 compressed oops) [thread 139872348002048 also had an error]# Problematic frame: # J 28454 C1 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3] {noformat} This kind of looks like https://issues.apache.org/jira/browse/SPARK-15822, but that is marked fix in 2.0.0 > SIGSEGV in GeneratedIterator.sort_addToSorter > - > > Key: SPARK-20112 > URL: https://issues.apache.org/jira/browse/SPARK-20112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops) >Reporter: Mitesh > Attachments: codegen_sorter_crash.log, hs_err_pid19271.log > > > I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The > hs_err_pid and codegen file are attached (with query plans). Its not a > deterministic repro, but running a big query load, I eventually see it come > up within a few minutes. > Here is some interesting repro information: > - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't > repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I > think that means its not an issue with the code-gen, but I cant figure out > what the difference in behavior is. > - The broadcast joins in the plan are all small tables. I have > autoJoinBroadcast=-1 because I always hint which tables should be broadcast. > - As you can see from the plan, all the sources are cached memory tables. And > we partition/sort them all beforehand so its always sort-merge-joins or > broadcast joins (with small tables). > {noformat} > # A fatal error has been detected by the Java Runtime
[jira] [Created] (SPARK-20118) spark2.1 support numeric datatype
QQShu1 created SPARK-20118: -- Summary: spark2.1 support numeric datatype Key: SPARK-20118 URL: https://issues.apache.org/jira/browse/SPARK-20118 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.1.0 Reporter: QQShu1 spark2.1 now don`t support numeric. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20117) TaskSetManager checkSpeculatableTasks variables immutability and Use string interpolation
[ https://issues.apache.org/jira/browse/SPARK-20117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jianran.tfh closed SPARK-20117. --- Resolution: Invalid > TaskSetManager checkSpeculatableTasks variables immutability and Use string > interpolation > - > > Key: SPARK-20117 > URL: https://issues.apache.org/jira/browse/SPARK-20117 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: jianran.tfh >Priority: Trivial > > TaskSetManager checkSpeculatableTasks variables immutability and Use string > interpolation -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19757) Executor with task scheduled could be killed due to idleness
[ https://issues.apache.org/jira/browse/SPARK-19757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-19757: - Component/s: Scheduler > Executor with task scheduled could be killed due to idleness > > > Key: SPARK-19757 > URL: https://issues.apache.org/jira/browse/SPARK-19757 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 1.6.0 >Reporter: jin xing >Assignee: Jimmy Xiang >Priority: Minor > Fix For: 2.2.0 > > > With dynamic executor allocation enabled on yarn mode, after one job is > finished for a while, submit another job, then there is race between killing > idle executors and scheduling new task on these executors. Sometimes, some > executor is killed right after a task is scheduled on it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20070) Redact datasource explain output
[ https://issues.apache.org/jira/browse/SPARK-20070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944417#comment-15944417 ] Apache Spark commented on SPARK-20070: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/17448 > Redact datasource explain output > > > Key: SPARK-20070 > URL: https://issues.apache.org/jira/browse/SPARK-20070 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > Fix For: 2.2.0 > > > When calling explain on a datasource, the output can contain sensitive > information. We should provide an admin/user to redact such information. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20117) TaskSetManager checkSpeculatableTasks variables immutability and Use string interpolation
[ https://issues.apache.org/jira/browse/SPARK-20117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20117: Assignee: Apache Spark > TaskSetManager checkSpeculatableTasks variables immutability and Use string > interpolation > - > > Key: SPARK-20117 > URL: https://issues.apache.org/jira/browse/SPARK-20117 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: jianran.tfh >Assignee: Apache Spark >Priority: Trivial > > TaskSetManager checkSpeculatableTasks variables immutability and Use string > interpolation -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20117) TaskSetManager checkSpeculatableTasks variables immutability and Use string interpolation
[ https://issues.apache.org/jira/browse/SPARK-20117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20117: Assignee: (was: Apache Spark) > TaskSetManager checkSpeculatableTasks variables immutability and Use string > interpolation > - > > Key: SPARK-20117 > URL: https://issues.apache.org/jira/browse/SPARK-20117 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: jianran.tfh >Priority: Trivial > > TaskSetManager checkSpeculatableTasks variables immutability and Use string > interpolation -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20117) TaskSetManager checkSpeculatableTasks variables immutability and Use string interpolation
[ https://issues.apache.org/jira/browse/SPARK-20117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944410#comment-15944410 ] Apache Spark commented on SPARK-20117: -- User 'jianran' has created a pull request for this issue: https://github.com/apache/spark/pull/17447 > TaskSetManager checkSpeculatableTasks variables immutability and Use string > interpolation > - > > Key: SPARK-20117 > URL: https://issues.apache.org/jira/browse/SPARK-20117 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: jianran.tfh >Priority: Trivial > > TaskSetManager checkSpeculatableTasks variables immutability and Use string > interpolation -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20117) TaskSetManager checkSpeculatableTasks variables immutability and Use string interpolation
jianran.tfh created SPARK-20117: --- Summary: TaskSetManager checkSpeculatableTasks variables immutability and Use string interpolation Key: SPARK-20117 URL: https://issues.apache.org/jira/browse/SPARK-20117 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 2.1.0 Reporter: jianran.tfh Priority: Trivial TaskSetManager checkSpeculatableTasks variables immutability and Use string interpolation -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19088) Optimize sequence type deserialization codegen
[ https://issues.apache.org/jira/browse/SPARK-19088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19088. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16541 [https://github.com/apache/spark/pull/16541] > Optimize sequence type deserialization codegen > -- > > Key: SPARK-19088 > URL: https://issues.apache.org/jira/browse/SPARK-19088 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michal Šenkýř >Assignee: Michal Šenkýř >Priority: Minor > Labels: performance > Fix For: 2.2.0 > > > Sequence type deserialization codegen added in [PR > #16240|https://github.com/apache/spark/pull/16240] should use a proper > builder instead of a conversion (using {{to}}) to avoid an additional pass. > This will require an additional {{MapObjects}}-like operation that will use > the provided builder instead of building an array. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19088) Optimize sequence type deserialization codegen
[ https://issues.apache.org/jira/browse/SPARK-19088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-19088: --- Assignee: Michal Šenkýř > Optimize sequence type deserialization codegen > -- > > Key: SPARK-19088 > URL: https://issues.apache.org/jira/browse/SPARK-19088 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michal Šenkýř >Assignee: Michal Šenkýř >Priority: Minor > Labels: performance > > Sequence type deserialization codegen added in [PR > #16240|https://github.com/apache/spark/pull/16240] should use a proper > builder instead of a conversion (using {{to}}) to avoid an additional pass. > This will require an additional {{MapObjects}}-like operation that will use > the provided builder instead of building an array. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20100) Consolidate SessionState construction
[ https://issues.apache.org/jira/browse/SPARK-20100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20100. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17433 [https://github.com/apache/spark/pull/17433] > Consolidate SessionState construction > - > > Key: SPARK-20100 > URL: https://issues.apache.org/jira/browse/SPARK-20100 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > Fix For: 2.2.0 > > > The current SessionState initialization path is quite complex. A part of the > creation is done in the SessionState companion objects, a part of the > creation is one inside the SessionState class, and a part is done by passing > functions. > The proposal is to consolidate the SessionState initialization into a builder > class. This SessionState will not do any initialization and just becomes a > place holder for the various Spark SQL internals. The advantages of this > approach are the following: > - SessionState initialization is less dispersed. The builder should be a one > stop shop. > - This provides us with a start for removing the HiveSessionState. Removing > the hive session state would also require us to move resource loading into a > separate class, and to (re)move metadata hive. > - It is easier to customize the Spark Session. You just need to create a > custom version of the builder. I will add hooks to make this easier. Opening > up these API's will happen at a later point. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20116) Remove task-level functionality from the DAGScheduler
Kay Ousterhout created SPARK-20116: -- Summary: Remove task-level functionality from the DAGScheduler Key: SPARK-20116 URL: https://issues.apache.org/jira/browse/SPARK-20116 Project: Spark Issue Type: Sub-task Components: Scheduler Affects Versions: 2.2.0 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Long, long ago, the scheduler code was more modular, and the DAGScheduler handled the logic of scheduling DAGs of stages (as the name suggests) and the TaskSchedulerImpl handled scheduling the tasks within a stage. Over time, more and more task-specific functionality has been added to the DAGScheduler, and now, the DAGScheduler duplicates a bunch of the task tracking that's done by other scheduler components. This makes the scheduler code harder to reason about, and has led to some tricky bugs (e.g., SPARK-19263). We should move all of this functionality back to the TaskSchedulerImpl and TaskSetManager, which should "hide" that complexity from the DAGScheduler. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19803. - Resolution: Fixed Issue resolved by pull request 17325 [https://github.com/apache/spark/pull/17325] > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Assignee: Shubham Chopra > Labels: flaky-test > Fix For: 2.2.0 > > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17075) Cardinality Estimation of Predicate Expressions
[ https://issues.apache.org/jira/browse/SPARK-17075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944325#comment-15944325 ] Apache Spark commented on SPARK-17075: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/17446 > Cardinality Estimation of Predicate Expressions > --- > > Key: SPARK-17075 > URL: https://issues.apache.org/jira/browse/SPARK-17075 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.0.0 >Reporter: Ron Hu >Assignee: Ron Hu > Fix For: 2.2.0 > > > A filter condition is the predicate expression specified in the WHERE clause > of a SQL select statement. A predicate can be a compound logical expression > with logical AND, OR, NOT operators combining multiple single conditions. A > single condition usually has comparison operators such as =, <, <=, >, >=, > ‘like’, etc. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20115) Fix DAGScheduler to recompute all the lost shuffle blocks when external shuffle service is unavailable
[ https://issues.apache.org/jira/browse/SPARK-20115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20115: Assignee: Apache Spark > Fix DAGScheduler to recompute all the lost shuffle blocks when external > shuffle service is unavailable > -- > > Key: SPARK-20115 > URL: https://issues.apache.org/jira/browse/SPARK-20115 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core, YARN >Affects Versions: 2.0.2, 2.1.0 > Environment: Spark on Yarn with external shuffle service enabled, > running on AWS EMR cluster. >Reporter: Udit Mehrotra >Assignee: Apache Spark > > The Spark’s DAGScheduler currently does not recompute all the lost shuffle > blocks on a host when a FetchFailed exception occurs, while fetching shuffle > blocks from another executor with external shuffle service enabled. Instead > it only recomputes the lost shuffle blocks computed by the executor for which > the FetchFailed exception occurred. This works fine for Internal shuffle > scenario, where the executors serve their own shuffle blocks and hence only > the shuffle blocks for that executor should be considered lost. However, when > External Shuffle Service is being used, a FetchFailed exception would mean > that the external shuffle service running on that host has become > unavailable. This in turn is sufficient to assume that all the shuffle blocks > which were managed by the Shuffle service on that host are lost. Therefore, > just recomputing the shuffle blocks associated with the particular Executor > for which FetchFailed exception occurred is not sufficient. We need to > recompute all the shuffle blocks, managed by that service because there could > be multiple executors running on that host. > > Since not all the shuffle blocks (for all the executors on the host) are > recomputed, this causes future attempts of the reduce stage to fail as well > because the new tasks scheduled still keep trying to reach the old location > of the shuffle blocks (which were not recomputed) and keep throwing further > FetchFailed exceptions. This ultimately causes the job to fail, after the > reduce stage has been retried 4 times. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20115) Fix DAGScheduler to recompute all the lost shuffle blocks when external shuffle service is unavailable
[ https://issues.apache.org/jira/browse/SPARK-20115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944312#comment-15944312 ] Apache Spark commented on SPARK-20115: -- User 'umehrot2' has created a pull request for this issue: https://github.com/apache/spark/pull/17445 > Fix DAGScheduler to recompute all the lost shuffle blocks when external > shuffle service is unavailable > -- > > Key: SPARK-20115 > URL: https://issues.apache.org/jira/browse/SPARK-20115 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core, YARN >Affects Versions: 2.0.2, 2.1.0 > Environment: Spark on Yarn with external shuffle service enabled, > running on AWS EMR cluster. >Reporter: Udit Mehrotra > > The Spark’s DAGScheduler currently does not recompute all the lost shuffle > blocks on a host when a FetchFailed exception occurs, while fetching shuffle > blocks from another executor with external shuffle service enabled. Instead > it only recomputes the lost shuffle blocks computed by the executor for which > the FetchFailed exception occurred. This works fine for Internal shuffle > scenario, where the executors serve their own shuffle blocks and hence only > the shuffle blocks for that executor should be considered lost. However, when > External Shuffle Service is being used, a FetchFailed exception would mean > that the external shuffle service running on that host has become > unavailable. This in turn is sufficient to assume that all the shuffle blocks > which were managed by the Shuffle service on that host are lost. Therefore, > just recomputing the shuffle blocks associated with the particular Executor > for which FetchFailed exception occurred is not sufficient. We need to > recompute all the shuffle blocks, managed by that service because there could > be multiple executors running on that host. > > Since not all the shuffle blocks (for all the executors on the host) are > recomputed, this causes future attempts of the reduce stage to fail as well > because the new tasks scheduled still keep trying to reach the old location > of the shuffle blocks (which were not recomputed) and keep throwing further > FetchFailed exceptions. This ultimately causes the job to fail, after the > reduce stage has been retried 4 times. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20115) Fix DAGScheduler to recompute all the lost shuffle blocks when external shuffle service is unavailable
[ https://issues.apache.org/jira/browse/SPARK-20115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20115: Assignee: (was: Apache Spark) > Fix DAGScheduler to recompute all the lost shuffle blocks when external > shuffle service is unavailable > -- > > Key: SPARK-20115 > URL: https://issues.apache.org/jira/browse/SPARK-20115 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core, YARN >Affects Versions: 2.0.2, 2.1.0 > Environment: Spark on Yarn with external shuffle service enabled, > running on AWS EMR cluster. >Reporter: Udit Mehrotra > > The Spark’s DAGScheduler currently does not recompute all the lost shuffle > blocks on a host when a FetchFailed exception occurs, while fetching shuffle > blocks from another executor with external shuffle service enabled. Instead > it only recomputes the lost shuffle blocks computed by the executor for which > the FetchFailed exception occurred. This works fine for Internal shuffle > scenario, where the executors serve their own shuffle blocks and hence only > the shuffle blocks for that executor should be considered lost. However, when > External Shuffle Service is being used, a FetchFailed exception would mean > that the external shuffle service running on that host has become > unavailable. This in turn is sufficient to assume that all the shuffle blocks > which were managed by the Shuffle service on that host are lost. Therefore, > just recomputing the shuffle blocks associated with the particular Executor > for which FetchFailed exception occurred is not sufficient. We need to > recompute all the shuffle blocks, managed by that service because there could > be multiple executors running on that host. > > Since not all the shuffle blocks (for all the executors on the host) are > recomputed, this causes future attempts of the reduce stage to fail as well > because the new tasks scheduled still keep trying to reach the old location > of the shuffle blocks (which were not recomputed) and keep throwing further > FetchFailed exceptions. This ultimately causes the job to fail, after the > reduce stage has been retried 4 times. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter
[ https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mitesh updated SPARK-20112: --- Description: I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The hs_err_pid and codegen file are attached (with query plans). Its not a deterministic repro, but running a big query load, I eventually see it come up within a few minutes. Here is some interesting repro information: - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I think that means its not an issue with the code-gen, but I cant figure out what the difference in behavior is. - The broadcast joins in the plan are all small tables. I have autoJoinBroadcast=-1 because I always hint which tables should be broadcast. - As you can see from the plan, all the sources are cached memory tables {noformat} # A fatal error has been detected by the Java Runtime Environment: # # [thread 139872345896704 also had an error] SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688 # # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 compressed oops) [thread 139872348002048 also had an error]# Problematic frame: # J 28454 C1 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3] {noformat} This kind of looks like https://issues.apache.org/jira/browse/SPARK-15822, but that is marked fix in 2.0.0 was: I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The hs_err_pid and codegen file are attached (with query plans). Its not a deterministic repro, but running a big query load, I eventually see it come up within a few minutes. Here is some interesting repro information: - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I think that means its not an issue with the code-gen, but I cant figure out what the difference in behavior is. - The broadcast joins in the plan are all small tables. I have autoJoinBroadcast=-1 because I always hint which tables should be broadcast. - As you can see from the plan, all the sources are cached memory tables {noformat} # A fatal error has been detected by the Java Runtime Environment: # # [thread 139872345896704 also had an error] SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688 # # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 compressed oops) [thread 139872348002048 also had an error]# Problematic frame: # J 28454 C1 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3] {noformat} > SIGSEGV in GeneratedIterator.sort_addToSorter > - > > Key: SPARK-20112 > URL: https://issues.apache.org/jira/browse/SPARK-20112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops) >Reporter: Mitesh > Attachments: codegen_sorter_crash.log, hs_err_pid19271.log > > > I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The > hs_err_pid and codegen file are attached (with query plans). Its not a > deterministic repro, but running a big query load, I eventually see it come > up within a few minutes. > Here is some interesting repro information: > - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't > repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I > think that means its not an issue with the code-gen, but I cant figure out > what the difference in behavior is. > - The broadcast joins in the plan are all small tables. I have > autoJoinBroadcast=-1 because I always hint which tables should be broadcast. > - As you can see from the plan, all the sources are cached memory tables > {noformat} > # A fatal error has been detected by the Java Runtime Environment: > # > # [thread 139872345896704 also had an error] > SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build > 1.8.0_60-b27) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode > linux-amd64 compressed oops) > [thread
[jira] [Issue Comment Deleted] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter
[ https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mitesh updated SPARK-20112: --- Comment: was deleted (was: This kind of looks like https://issues.apache.org/jira/browse/SPARK-15822, but that is marked fix in 2.0.0) > SIGSEGV in GeneratedIterator.sort_addToSorter > - > > Key: SPARK-20112 > URL: https://issues.apache.org/jira/browse/SPARK-20112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops) >Reporter: Mitesh > Attachments: codegen_sorter_crash.log, hs_err_pid19271.log > > > I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The > hs_err_pid and codegen file are attached (with query plans). Its not a > deterministic repro, but running a big query load, I eventually see it come > up within a few minutes. > Here is some interesting repro information: > - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't > repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I > think that means its not an issue with the code-gen, but I cant figure out > what the difference in behavior is. > - The broadcast joins in the plan are all small tables. I have > autoJoinBroadcast=-1 because I always hint which tables should be broadcast. > - As you can see from the plan, all the sources are cached memory tables > {noformat} > # A fatal error has been detected by the Java Runtime Environment: > # > # [thread 139872345896704 also had an error] > SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build > 1.8.0_60-b27) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode > linux-amd64 compressed oops) > [thread 139872348002048 also had an error]# Problematic frame: > # > J 28454 C1 > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V > (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20115) Fix DAGScheduler to recompute all the lost shuffle blocks when external shuffle service is unavailable
Udit Mehrotra created SPARK-20115: - Summary: Fix DAGScheduler to recompute all the lost shuffle blocks when external shuffle service is unavailable Key: SPARK-20115 URL: https://issues.apache.org/jira/browse/SPARK-20115 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core, YARN Affects Versions: 2.1.0, 2.0.2 Environment: Spark on Yarn with external shuffle service enabled, running on AWS EMR cluster. Reporter: Udit Mehrotra The Spark’s DAGScheduler currently does not recompute all the lost shuffle blocks on a host when a FetchFailed exception occurs, while fetching shuffle blocks from another executor with external shuffle service enabled. Instead it only recomputes the lost shuffle blocks computed by the executor for which the FetchFailed exception occurred. This works fine for Internal shuffle scenario, where the executors serve their own shuffle blocks and hence only the shuffle blocks for that executor should be considered lost. However, when External Shuffle Service is being used, a FetchFailed exception would mean that the external shuffle service running on that host has become unavailable. This in turn is sufficient to assume that all the shuffle blocks which were managed by the Shuffle service on that host are lost. Therefore, just recomputing the shuffle blocks associated with the particular Executor for which FetchFailed exception occurred is not sufficient. We need to recompute all the shuffle blocks, managed by that service because there could be multiple executors running on that host. Since not all the shuffle blocks (for all the executors on the host) are recomputed, this causes future attempts of the reduce stage to fail as well because the new tasks scheduled still keep trying to reach the old location of the shuffle blocks (which were not recomputed) and keep throwing further FetchFailed exceptions. This ultimately causes the job to fail, after the reduce stage has been retried 4 times. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944239#comment-15944239 ] yuhao yang edited comment on SPARK-20114 at 3/27/17 11:42 PM: -- Currently I prefer to implement the dummy PrefixSpanModel as the sequential rules extracted won't be quite useful. And if needed, we can implement other algorithms to extract sequential rules for prediction. was (Author: yuhaoyan): Currently I prefer to implement the dummy PrefixSpanModel as the sequential rules extracted won't be quite useful. > spark.ml parity for sequential pattern mining - PrefixSpan > -- > > Key: SPARK-20114 > URL: https://issues.apache.org/jira/browse/SPARK-20114 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > Creating this jira to track the feature parity for PrefixSpan and sequential > pattern mining in Spark ml with DataFrame API. > First list a few design issues to be discussed, then subtasks like Scala, > Python and R API will be created. > # Wrapping the MLlib PrefixSpan and provide a generic fit() should be > straightforward. Yet PrefixSpan only extracts frequent sequential patterns, > which is not good to be used directly for predicting on new records. Please > read > http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ > for some background knowledge. Thanks Philippe Fournier-Viger for providing > insights. If we want to keep using the Estimator/Transformer pattern, options > are: > #* Implement a dummy transform for PrefixSpanModel, which will not add > new column to the input DataSet. The PrefixSpanModel is only used to provide > access for frequent sequential patterns. > #* Adding the feature to extract sequential rules from sequential > patterns. Then use the sequential rules in the transform as FPGrowthModel. > The rules extracted are of the form X–> Y where X and Y are sequential > patterns. But in practice, these rules are not very good as they are too > precise and thus not noise tolerant. > # Different from association rules and frequent itemsets, sequential rules > can be extracted from the original dataset more efficiently using algorithms > like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is > unordered, but X must appear before Y, which is more general and can work > better in practice for prediction. > I'd like to hear more from the users to see which kind of Sequential rules > are more practical. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944239#comment-15944239 ] yuhao yang commented on SPARK-20114: Currently I prefer to implement the dummy PrefixSpanModel as the sequential rules extracted won't be quite useful. > spark.ml parity for sequential pattern mining - PrefixSpan > -- > > Key: SPARK-20114 > URL: https://issues.apache.org/jira/browse/SPARK-20114 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > Creating this jira to track the feature parity for PrefixSpan and sequential > pattern mining in Spark ml with DataFrame API. > First list a few design issues to be discussed, then subtasks like Scala, > Python and R API will be created. > # Wrapping the MLlib PrefixSpan and provide a generic fit() should be > straightforward. Yet PrefixSpan only extracts frequent sequential patterns, > which is not good to be used directly for predicting on new records. Please > read > http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ > for some background knowledge. Thanks Philippe Fournier-Viger for providing > insights. If we want to keep using the Estimator/Transformer pattern, options > are: > #* Implement a dummy transform for PrefixSpanModel, which will not add > new column to the input DataSet. The PrefixSpanModel is only used to provide > access for frequent sequential patterns. > #* Adding the feature to extract sequential rules from sequential > patterns. Then use the sequential rules in the transform as FPGrowthModel. > The rules extracted are of the form X–> Y where X and Y are sequential > patterns. But in practice, these rules are not very good as they are too > precise and thus not noise tolerant. > # Different from association rules and frequent itemsets, sequential rules > can be extracted from the original dataset more efficiently using algorithms > like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is > unordered, but X must appear before Y, which is more general and can work > better in practice for prediction. > I'd like to hear more from the users to see which kind of Sequential rules > are more practical. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-20114: --- Description: Creating this jira to track the feature parity for PrefixSpan and sequential pattern mining in Spark ml with DataFrame API. First list a few design issues to be discussed, then subtasks like Scala, Python and R API will be created. # Wrapping the MLlib PrefixSpan and provide a generic fit() should be straightforward. Yet PrefixSpan only extracts frequent sequential patterns, which is not good to be used directly for predicting on new records. Please read http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ for some background knowledge. Thanks Philippe Fournier-Viger for providing insights. If we want to keep using the Estimator/Transformer pattern, options are: #* Implement a dummy transform for PrefixSpanModel, which will not add new column to the input DataSet. The PrefixSpanModel is only used to provide access for frequent sequential patterns. #* Adding the feature to extract sequential rules from sequential patterns. Then use the sequential rules in the transform as FPGrowthModel. The rules extracted are of the form X–> Y where X and Y are sequential patterns. But in practice, these rules are not very good as they are too precise and thus not noise tolerant. # Different from association rules and frequent itemsets, sequential rules can be extracted from the original dataset more efficiently using algorithms like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is unordered, but X must appear before Y, which is more general and can work better in practice for prediction. I'd like to hear more from the users to see which kind of Sequential rules are more practical. was: Creating this jira to track the feature parity for PrefixSpan and sequential pattern mining in Spark ml with DataFrame API. First list a few design issues to be discussed, then subtasks like Scala, Python and R API will be created. # Wrapping the MLlib PrefixSpan and provide a generic fit() should be straightforward. Yet PrefixSpan only extracts frequent sequential patterns, which is not good to be used directly for predicting on new records. Please read http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ for some background knowledge. Thanks Philippe Fournier-Viger for providing insights. If we want to keep using the Estimator/Transformer pattern, options are: #* Implement a dummy transform for PrefixSpanModel, which will not add new column to the input DataSet. #* Adding the feature to extract sequential rules from sequential patterns. Then use the sequential rules in the transform as FPGrowthModel. The rules extracted are of the form X–> Y where X and Y are sequential patterns. But in practice, these rules are not very good as they are too precise and thus not noise tolerant. # Different from association rules and frequent itemsets, sequential rules can be extracted from the original dataset more efficiently using algorithms like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is unordered, but X must appear before Y, which is more general and can work better in practice for prediction. I'd like to hear more from the users to see which kind of Sequential rules are more practical. > spark.ml parity for sequential pattern mining - PrefixSpan > -- > > Key: SPARK-20114 > URL: https://issues.apache.org/jira/browse/SPARK-20114 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > Creating this jira to track the feature parity for PrefixSpan and sequential > pattern mining in Spark ml with DataFrame API. > First list a few design issues to be discussed, then subtasks like Scala, > Python and R API will be created. > # Wrapping the MLlib PrefixSpan and provide a generic fit() should be > straightforward. Yet PrefixSpan only extracts frequent sequential patterns, > which is not good to be used directly for predicting on new records. Please > read > http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ > for some background knowledge. Thanks Philippe Fournier-Viger for providing > insights. If we want to keep using the Estimator/Transformer pattern, options > are: > #* Implement a dummy transform for PrefixSpanModel, which will not add > new column to the input DataSet. The PrefixSpanModel is only used to provide > access for frequent sequential patterns. > #* Adding the feature to extract sequential rules from sequential > patterns. Then use the sequential rules in the transform as FPGrowthModel.
[jira] [Updated] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-20114: --- Description: Creating this jira to track the feature parity for PrefixSpan and sequential pattern mining in Spark ml with DataFrame API. First list a few design issues to be discussed, then subtasks like Scala, Python and R API will be created. # Wrapping the MLlib PrefixSpan and provide a generic fit() should be straightforward. Yet PrefixSpan only extracts frequent sequential patterns, which is not good to be used directly for predicting on new records. Please read http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ for some background knowledge. Thanks Philippe Fournier-Viger for providing insights. If we want to keep using the Estimator/Transformer pattern, options are: #* Implement a dummy transform for PrefixSpanModel, which will not add new column to the input DataSet. #* Adding the feature to extract sequential rules from sequential patterns. Then use the sequential rules in the transform as FPGrowthModel. The rules extracted are of the form X–> Y where X and Y are sequential patterns. But in practice, these rules are not very good as they are too precise and thus not noise tolerant. # Different from association rules and frequent itemsets, sequential rules can be extracted from the original dataset more efficiently using algorithms like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is unordered, but X must appear before Y, which is more general and can work better in practice for prediction. I'd like to hear more from the users to see which kind of Sequential rules are more practical. was: Creating this jira to track the feature parity for PrefixSpan and sequential pattern mining in Spark ml with DataFrame API. First list a few design issues to be discussed, then subtasks like Scala, Python and R will be created. # Wrapping the MLlib PrefixSpan and provide a generic fit() should be straightforward. Yet PrefixSpan only extracts frequent sequential patterns, which is not good to be used directly for predicting on new records. Please read http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ for some background knowledge. Thanks Philippe Fournier-Viger for providing insights. If we want to keep using the Estimator/Transformer pattern, options are: #* Implement a dummy transform for PrefixSpanModel, which will not add new column to the input DataSet. #* Adding the feature to extract sequential rules from sequential patterns. Then use the sequential rules in the transform as FPGrowthModel. The rules extracted are of the form X–> Y where X and Y are sequential patterns. But in practice, these rules are not very good as they are too precise and thus not noise tolerant. # Different from association rules and frequent itemsets, sequential rules can be extracted from the original dataset more efficiently using algorithms like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is unordered, but X must appear before Y, which is more general and can work better in practice for prediction. I'd like to hear more from the users to see which kind of Sequential rules are more practical. > spark.ml parity for sequential pattern mining - PrefixSpan > -- > > Key: SPARK-20114 > URL: https://issues.apache.org/jira/browse/SPARK-20114 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > Creating this jira to track the feature parity for PrefixSpan and sequential > pattern mining in Spark ml with DataFrame API. > First list a few design issues to be discussed, then subtasks like Scala, > Python and R API will be created. > # Wrapping the MLlib PrefixSpan and provide a generic fit() should be > straightforward. Yet PrefixSpan only extracts frequent sequential patterns, > which is not good to be used directly for predicting on new records. Please > read > http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ > for some background knowledge. Thanks Philippe Fournier-Viger for providing > insights. If we want to keep using the Estimator/Transformer pattern, options > are: > #* Implement a dummy transform for PrefixSpanModel, which will not add > new column to the input DataSet. > #* Adding the feature to extract sequential rules from sequential > patterns. Then use the sequential rules in the transform as FPGrowthModel. > The rules extracted are of the form X–> Y where X and Y are sequential > patterns. But in practice, these rules are not very good as they are too > precise and thus not
[jira] [Created] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan
yuhao yang created SPARK-20114: -- Summary: spark.ml parity for sequential pattern mining - PrefixSpan Key: SPARK-20114 URL: https://issues.apache.org/jira/browse/SPARK-20114 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.2.0 Reporter: yuhao yang Creating this jira to track the feature parity for PrefixSpan and sequential pattern mining in Spark ml with DataFrame API. First list a few design issues to be discussed, then subtasks like Scala, Python and R will be created. # Wrapping the MLlib PrefixSpan and provide a generic fit() should be straightforward. Yet PrefixSpan only extracts frequent sequential patterns, which is not good to be used directly for predicting on new records. Please read http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ for some background knowledge. Thanks Philippe Fournier-Viger for providing insights. If we want to keep using the Estimator/Transformer pattern, options are: #* Implement a dummy transform for PrefixSpanModel, which will not add new column to the input DataSet. #* Adding the feature to extract sequential rules from sequential patterns. Then use the sequential rules in the transform as FPGrowthModel. The rules extracted are of the form X–> Y where X and Y are sequential patterns. But in practice, these rules are not very good as they are too precise and thus not noise tolerant. # Different from association rules and frequent itemsets, sequential rules can be extracted from the original dataset more efficiently using algorithms like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is unordered, but X must appear before Y, which is more general and can work better in practice for prediction. I'd like to hear more from the users to see which kind of Sequential rules are more practical. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944215#comment-15944215 ] Timothy Hunter commented on SPARK-19634: [~sethah], yes, thanks for bringing up these concerns. Regarding the first points, the UDAF interface does not let you update arrays in place, which is a non-starter in our case. This is why the implementation switches to TIA. I have updated the design doc with these comments. Regarding the performance, I agree that there is a tension between having an API that is compatible with structured streaming and the current, RDD-based implementation. I will provide some test numbers so that we have a basis for discussion. That being said, the RDD API is not going away, so if users care about performance and do not need the additional benefit of integrating with SQL or structured streaming, they can still use it. > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19612) Tests failing with timeout
[ https://issues.apache.org/jira/browse/SPARK-19612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944181#comment-15944181 ] Kay Ousterhout commented on SPARK-19612: And another: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75272/console > Tests failing with timeout > -- > > Key: SPARK-19612 > URL: https://issues.apache.org/jira/browse/SPARK-19612 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Kay Ousterhout >Priority: Minor > > I've seen at least one recent test failure due to hitting the 250m timeout: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72882/ > Filing this JIRA to track this; if it happens repeatedly we should up the > timeout. > cc [~shaneknapp] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-19476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944163#comment-15944163 ] Lucy Yu commented on SPARK-19476: - bq. I don't think in general you're expected to be able to do this safely. Why would you do this asynchronously or with more partitions, simply? Sorry, my simplified example had a mistake in it but a user ran into a NullPointerException in our actual code. In my simplified example the lambda function may return before the thread is complete, but the actual code enforces that the lambda function cannot return until the thread has finished. ie we have {code} df.foreachPartition(partition => { ... numRowsAccumulator += ingestStrategy.loadPartition(targetNode, partition) }) {code} and loadPartition is defined https://github.com/memsql/memsql-spark-connector/blob/master/src/main/scala/com/memsql/spark/connector/LoadDataStrategy.scala#L18 . Basically, the thread finishes once it has read all of the partition's data into a stream at which point it closes the stream, and stmt.executeUpdate(query.sql.toString) which is part of the function passed to foreachPartition waits until the stream is closed. We do this to load the partition into a database in a constant-memory way -- by writing to a pipe and consuming from it at the same time. Without this approach they may run out of memory materializing the partition. > Running threads in Spark DataFrame foreachPartition() causes > NullPointerException > - > > Key: SPARK-19476 > URL: https://issues.apache.org/jira/browse/SPARK-19476 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Gal Topper >Priority: Minor > > First reported on [Stack > overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition]. > I use multiple threads inside foreachPartition(), which works great for me > except for when the underlying iterator is TungstenAggregationIterator. Here > is a minimal code snippet to reproduce: > {code:title=Reproduce.scala|borderStyle=solid} > import scala.concurrent.ExecutionContext.Implicits.global > import scala.concurrent.duration.Duration > import scala.concurrent.{Await, Future} > import org.apache.spark.SparkContext > import org.apache.spark.sql.SQLContext > object Reproduce extends App { > val sc = new SparkContext("local", "reproduce") > val sqlContext = new SQLContext(sc) > import sqlContext.implicits._ > val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count() > df.foreachPartition { iterator => > val f = Future(iterator.toVector) > Await.result(f, Duration.Inf) > } > } > {code} > When I run this, I get: > {noformat} > java.lang.NullPointerException > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > {noformat} > I believe I actually understand why this happens - > TungstenAggregationIterator uses a ThreadLocal variable that returns null > when called from a thread other than the original thread that got the > iterator from Spark. From examining the code, this does not appear to differ > between recent Spark versions. > However, this limitation is specific to TungstenAggregationIterator, and not > documented, as far as I'm aware. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944030#comment-15944030 ] Seth Hendrickson edited comment on SPARK-19634 at 3/27/17 10:23 PM: I'm coming to this a bit late, but I'm finding things a bit hard to follow. Reading the design doc, it seems that the original plan was to implement two interfaces - an RDD one that provides the same performance as current {{MultivariateOnlineSummarizer}} and a data frame interface using UDAF. from design doc: "...In the meantime, there will be a (possibly faster) RDD interface and a (more flexible) Dataframe interface." Now, the PR for this uses {{TypedImperativeAggregate}}. I understand that the it was pivoted away from UDAF, but the design doc does not reflect that. Also, if there is to be an RDD interface, what is the JIRA for it and what will it look like? Also, there are several concerns raised in the design doc about this Catalyst aggregate approach being less efficient, and the consensus seemed to be: provide an initial API with a "slow" implementation that will be improved upon in the future. Is that correct? I'm not that familiar with the Catalyst optimizer, but are we sure there is a good way to implement the tree-reduce type aggregation, and if so could we document that? I'd prefer to get the details hashed out further rather than rushing to provide an API and initial slow implementation, that way we can make sure that we get this correct in the long-term. I really appreciate some clarification and my apologies if I have missed any of the details/discussion. was (Author: sethah): I'm coming to this a bit late, but I'm finding things a bit hard to follow. Reading the design doc, it seems that the original plan was to implement two interfaces - an RDD one that provides the same performance as current {{MultivariateOnlineSummarizer}} and a data frame interface using UDAF. from design doc: "...In the meantime, there will be a (possibly faster) RDD interface and a (more flexible) Dataframe interface." Now, the PR for this uses {{TypedImperativeAggregate}}. I understand that the it was pivoted away from UDAF, but the design doc does not reflect that. Also, if there is to be an RDD interface, what is the JIRA for it and what will it look like? Also, there are several concerns raised in the design doc about this Catalyst aggregate approach being less efficient, and the consensus seemed to be: provide an initial API with a "slow" implementation that will be improved upon in the future. Is that correct? I'm not that familiar with the Catalyst optimizer, but are we sure there is a good way to implement the tree-reduce type aggregation, and if so could we document that? If this is still targeted at 2.2, why? I'd prefer to get the details hashed out further rather than rushing to provide an API and initial slow implementation, that way we can make sure that we get this correct in the long-term. I really appreciate some clarification and my apologies if I have missed any of the details/discussion. > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20111) codegen bug surfaced by GraphFrames issue 165
[ https://issues.apache.org/jira/browse/SPARK-20111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-20111. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 > codegen bug surfaced by GraphFrames issue 165 > - > > Key: SPARK-20111 > URL: https://issues.apache.org/jira/browse/SPARK-20111 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0, 2.2.0 >Reporter: Joseph K. Bradley > Fix For: 2.1.1, 2.2.0 > > > In GraphFrames, test {{test("named edges")}} in {{PatternMatchSuite}} > surfaces a SQL codegen bug. > This is described in https://github.com/graphframes/graphframes/issues/165 > Summary > * The unit test does a simple motif query on a graph. Essentially, this > means taking 2 DataFrames, doing a few joins, selecting 2 columns, and > collecting the (tiny) DataFrame. > * The test runs, but codegen fails. See the linked GraphFrames issue for the > stacktrace. > To reproduce this: > * Check out GraphFrames https://github.com/graphframes/graphframes > * Run {{sbt assembly}} to compile it and run tests > Copying [~felixcheung]'s comment from the GraphFrames issue 165: > {quote} > Seems like codegen bug; it looks like at least 2 issues: > 1. At L472, inputadapter_value is not defined within scope > 2. inputadapter_value is an InternalRow, for this statement to work > {{bhj_primitiveA = inputadapter_value;}} > it should be > {{bhj_primitiveA = inputadapter_value.getLong(0);}} > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20111) codegen bug surfaced by GraphFrames issue 165
[ https://issues.apache.org/jira/browse/SPARK-20111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944113#comment-15944113 ] Joseph K. Bradley edited comment on SPARK-20111 at 3/27/17 9:59 PM: Yep, you're right. Closing now. Thanks! was (Author: josephkb): Yep, you're right. Is the fix worth backporting, or would it be too much trouble? > codegen bug surfaced by GraphFrames issue 165 > - > > Key: SPARK-20111 > URL: https://issues.apache.org/jira/browse/SPARK-20111 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0, 2.2.0 >Reporter: Joseph K. Bradley > Fix For: 2.1.1, 2.2.0 > > > In GraphFrames, test {{test("named edges")}} in {{PatternMatchSuite}} > surfaces a SQL codegen bug. > This is described in https://github.com/graphframes/graphframes/issues/165 > Summary > * The unit test does a simple motif query on a graph. Essentially, this > means taking 2 DataFrames, doing a few joins, selecting 2 columns, and > collecting the (tiny) DataFrame. > * The test runs, but codegen fails. See the linked GraphFrames issue for the > stacktrace. > To reproduce this: > * Check out GraphFrames https://github.com/graphframes/graphframes > * Run {{sbt assembly}} to compile it and run tests > Copying [~felixcheung]'s comment from the GraphFrames issue 165: > {quote} > Seems like codegen bug; it looks like at least 2 issues: > 1. At L472, inputadapter_value is not defined within scope > 2. inputadapter_value is an InternalRow, for this statement to work > {{bhj_primitiveA = inputadapter_value;}} > it should be > {{bhj_primitiveA = inputadapter_value.getLong(0);}} > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20111) codegen bug surfaced by GraphFrames issue 165
[ https://issues.apache.org/jira/browse/SPARK-20111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944113#comment-15944113 ] Joseph K. Bradley commented on SPARK-20111: --- Yep, you're right. Is the fix worth backporting, or would it be too much trouble? > codegen bug surfaced by GraphFrames issue 165 > - > > Key: SPARK-20111 > URL: https://issues.apache.org/jira/browse/SPARK-20111 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0, 2.2.0 >Reporter: Joseph K. Bradley > > In GraphFrames, test {{test("named edges")}} in {{PatternMatchSuite}} > surfaces a SQL codegen bug. > This is described in https://github.com/graphframes/graphframes/issues/165 > Summary > * The unit test does a simple motif query on a graph. Essentially, this > means taking 2 DataFrames, doing a few joins, selecting 2 columns, and > collecting the (tiny) DataFrame. > * The test runs, but codegen fails. See the linked GraphFrames issue for the > stacktrace. > To reproduce this: > * Check out GraphFrames https://github.com/graphframes/graphframes > * Run {{sbt assembly}} to compile it and run tests > Copying [~felixcheung]'s comment from the GraphFrames issue 165: > {quote} > Seems like codegen bug; it looks like at least 2 issues: > 1. At L472, inputadapter_value is not defined within scope > 2. inputadapter_value is an InternalRow, for this statement to work > {{bhj_primitiveA = inputadapter_value;}} > it should be > {{bhj_primitiveA = inputadapter_value.getLong(0);}} > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter
[ https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944100#comment-15944100 ] Mitesh commented on SPARK-20112: This kind of looks like https://issues.apache.org/jira/browse/SPARK-15822, but that is marked fix in 2.0.0 > SIGSEGV in GeneratedIterator.sort_addToSorter > - > > Key: SPARK-20112 > URL: https://issues.apache.org/jira/browse/SPARK-20112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops) >Reporter: Mitesh > Attachments: codegen_sorter_crash.log, hs_err_pid19271.log > > > I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The > hs_err_pid and codegen file are attached (with query plans). Its not a > deterministic repro, but running a big query load, I eventually see it come > up within a few minutes. > Here is some interesting repro information: > - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't > repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I > think that means its not an issue with the code-gen, but I cant figure out > what the difference in behavior is. > - The broadcast joins in the plan are all small tables. I have > autoJoinBroadcast=-1 because I always hint which tables should be broadcast. > - As you can see from the plan, all the sources are cached memory tables > {noformat} > # A fatal error has been detected by the Java Runtime Environment: > # > # [thread 139872345896704 also had an error] > SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build > 1.8.0_60-b27) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode > linux-amd64 compressed oops) > [thread 139872348002048 also had an error]# Problematic frame: > # > J 28454 C1 > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V > (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter
[ https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mitesh updated SPARK-20112: --- Attachment: codegen_sorter_crash.log > SIGSEGV in GeneratedIterator.sort_addToSorter > - > > Key: SPARK-20112 > URL: https://issues.apache.org/jira/browse/SPARK-20112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops) >Reporter: Mitesh > Attachments: codegen_sorter_crash.log, hs_err_pid19271.log > > > I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The > hs_err_pid and codegen file are attached (with query plans). Its not a > deterministic repro, but running a big query load, I eventually see it come > up within a few minutes. > Here is some interesting repro information: > - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't > repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I > think that means its not an issue with the code-gen, but I cant figure out > what the difference in behavior is. > - The broadcast joins in the plan are all small tables. I have > autoJoinBroadcast=-1 because I always hint which tables should be broadcast. > - As you can see from the plan, all the sources are cached memory tables > {noformat} > # A fatal error has been detected by the Java Runtime Environment: > # > # [thread 139872345896704 also had an error] > SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build > 1.8.0_60-b27) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode > linux-amd64 compressed oops) > [thread 139872348002048 also had an error]# Problematic frame: > # > J 28454 C1 > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V > (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables
[ https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944094#comment-15944094 ] Morten Hornbech commented on SPARK-14560: - No. We worried it might just trigger some other bad behaviour, and it wasn't a production issue. > Cooperative Memory Management for Spillables > > > Key: SPARK-14560 > URL: https://issues.apache.org/jira/browse/SPARK-14560 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Imran Rashid >Assignee: Lianhui Wang > Fix For: 2.0.0 > > > SPARK-10432 introduced cooperative memory management for SQL operators that > can spill; however, {{Spillable}} s used by the old RDD api still do not > cooperate. This can lead to memory starvation, in particular on a > shuffle-to-shuffle stage, eventually resulting in errors like: > {noformat} > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081 > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by > org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory > were used by task 3081 but are not associated with specific consumers > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory > are used for execution and 1710484 bytes of memory are used for storage > 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size > = 1317230346 bytes, TID = 3081 > 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage > 3.0 (TID 3081) > java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0 > at > org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > This can happen anytime the shuffle read side requires more memory than what > is available for the task. Since the shuffle-read side doubles its memory > request each time, it can easily end up acquiring all of the available > memory, even if it does not use it. Eg., say that after the final spill, the > shuffle-read side requires 10 MB more memory, and there is 15 MB of memory > available. But if it starts at 2 MB, it will double to 4, 8, and then > request 16 MB of memory, and in fact get all available 15 MB. Since the 15 > MB of memory is sufficient, it will not spill, and will continue holding on > to all available memory. But this leaves *no* memory available for the > shuffle-write side. Since the shuffle-write side cannot request the > shuffle-read side to free up memory, this leads to an OOM. > The simple solution is to make {{Spillable}} implement {{MemoryConsumer}} as > well, so RDDs can benefit from the cooperative memory management introduced > by SPARK-10342. > Note that an additional improvement would be for the shuffle-read side to > simple release unused memory, without spilling, in case that would leave > enough memory, and only spill if that was inadequate. However that can come > as a later improvement. > *Workaround*: You can set > {{spark.shuffle.spill.numElementsForceSpillThreshold=N}} to force spilling to > occur every {{N}} elements, thus preventing the shuffle-read side from ever > grabbing all of the available memory. However, this requires careful tuning > of {{N}} to specific workloads: too big, and you will still get an OOM; too > small, and there will be so much spilling that performance will suffer > drastically. Furthermore, this workaround uses an *undocumented* > configuration with *no compatibility guarantees* for future versions of spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To
[jira] [Updated] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter
[ https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mitesh updated SPARK-20112: --- Attachment: (was: codegen_sorter_crash.log) > SIGSEGV in GeneratedIterator.sort_addToSorter > - > > Key: SPARK-20112 > URL: https://issues.apache.org/jira/browse/SPARK-20112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops) >Reporter: Mitesh > Attachments: hs_err_pid19271.log > > > I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The > hs_err_pid and codegen file are attached (with query plans). Its not a > deterministic repro, but running a big query load, I eventually see it come > up within a few minutes. > Here is some interesting repro information: > - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't > repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I > think that means its not an issue with the code-gen, but I cant figure out > what the difference in behavior is. > - The broadcast joins in the plan are all small tables. I have > autoJoinBroadcast=-1 because I always hint which tables should be broadcast. > - As you can see from the plan, all the sources are cached memory tables > {noformat} > # A fatal error has been detected by the Java Runtime Environment: > # > # [thread 139872345896704 also had an error] > SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build > 1.8.0_60-b27) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode > linux-amd64 compressed oops) > [thread 139872348002048 also had an error]# Problematic frame: > # > J 28454 C1 > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V > (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables
[ https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944090#comment-15944090 ] Darshan Mehta commented on SPARK-14560: --- [~mhornbech] thanks for the prompt response. Before rewriting the job, did you see any difference by playing around with spark.shuffle.spill.numElementsForceSpillThreshold=N values? > Cooperative Memory Management for Spillables > > > Key: SPARK-14560 > URL: https://issues.apache.org/jira/browse/SPARK-14560 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Imran Rashid >Assignee: Lianhui Wang > Fix For: 2.0.0 > > > SPARK-10432 introduced cooperative memory management for SQL operators that > can spill; however, {{Spillable}} s used by the old RDD api still do not > cooperate. This can lead to memory starvation, in particular on a > shuffle-to-shuffle stage, eventually resulting in errors like: > {noformat} > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081 > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by > org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory > were used by task 3081 but are not associated with specific consumers > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory > are used for execution and 1710484 bytes of memory are used for storage > 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size > = 1317230346 bytes, TID = 3081 > 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage > 3.0 (TID 3081) > java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0 > at > org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > This can happen anytime the shuffle read side requires more memory than what > is available for the task. Since the shuffle-read side doubles its memory > request each time, it can easily end up acquiring all of the available > memory, even if it does not use it. Eg., say that after the final spill, the > shuffle-read side requires 10 MB more memory, and there is 15 MB of memory > available. But if it starts at 2 MB, it will double to 4, 8, and then > request 16 MB of memory, and in fact get all available 15 MB. Since the 15 > MB of memory is sufficient, it will not spill, and will continue holding on > to all available memory. But this leaves *no* memory available for the > shuffle-write side. Since the shuffle-write side cannot request the > shuffle-read side to free up memory, this leads to an OOM. > The simple solution is to make {{Spillable}} implement {{MemoryConsumer}} as > well, so RDDs can benefit from the cooperative memory management introduced > by SPARK-10342. > Note that an additional improvement would be for the shuffle-read side to > simple release unused memory, without spilling, in case that would leave > enough memory, and only spill if that was inadequate. However that can come > as a later improvement. > *Workaround*: You can set > {{spark.shuffle.spill.numElementsForceSpillThreshold=N}} to force spilling to > occur every {{N}} elements, thus preventing the shuffle-read side from ever > grabbing all of the available memory. However, this requires careful tuning > of {{N}} to specific workloads: too big, and you will still get an OOM; too > small, and there will be so much spilling that performance will suffer > drastically. Furthermore, this workaround uses an *undocumented* > configuration with *no compatibility guarantees* for future versions of spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (SPARK-20111) codegen bug surfaced by GraphFrames issue 165
[ https://issues.apache.org/jira/browse/SPARK-20111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944089#comment-15944089 ] Herman van Hovell commented on SPARK-20111: --- [~josephkb] this might be already fixed in the latest master/2.1 (see SPARK-19512). Did you try this on the latest master? > codegen bug surfaced by GraphFrames issue 165 > - > > Key: SPARK-20111 > URL: https://issues.apache.org/jira/browse/SPARK-20111 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0, 2.2.0 >Reporter: Joseph K. Bradley > > In GraphFrames, test {{test("named edges")}} in {{PatternMatchSuite}} > surfaces a SQL codegen bug. > This is described in https://github.com/graphframes/graphframes/issues/165 > Summary > * The unit test does a simple motif query on a graph. Essentially, this > means taking 2 DataFrames, doing a few joins, selecting 2 columns, and > collecting the (tiny) DataFrame. > * The test runs, but codegen fails. See the linked GraphFrames issue for the > stacktrace. > To reproduce this: > * Check out GraphFrames https://github.com/graphframes/graphframes > * Run {{sbt assembly}} to compile it and run tests > Copying [~felixcheung]'s comment from the GraphFrames issue 165: > {quote} > Seems like codegen bug; it looks like at least 2 issues: > 1. At L472, inputadapter_value is not defined within scope > 2. inputadapter_value is an InternalRow, for this statement to work > {{bhj_primitiveA = inputadapter_value;}} > it should be > {{bhj_primitiveA = inputadapter_value.getLong(0);}} > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20113) overwrite mode appends data on MySQL table that does not have a primary key
[ https://issues.apache.org/jira/browse/SPARK-20113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944088#comment-15944088 ] Bhanu Akaveeti commented on SPARK-20113: I was expecting entire row comparison, but is there a provision to specify a key (for comparison) in the script, in future releases (than 2.0.1)? Also, truncate seems to be not working on table that does not have a PK. > overwrite mode appends data on MySQL table that does not have a primary key > --- > > Key: SPARK-20113 > URL: https://issues.apache.org/jira/browse/SPARK-20113 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.1 >Reporter: Bhanu Akaveeti > > Dataframe.write in overwrite mode appends data on MySQL table that does not > have a primary key > df_mysql.write \ > .mode("overwrite") \ > .jdbc("jdbc:mysql://ip-address/database", "MySQL_Table", properties={"user": > "MySQL_user", "password": "MySQL_pw"}) > When the above script is run twice, data is inserted twice. Also, I tried > with option("truncate","true") but still data is appended in MySQL table -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables
[ https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944080#comment-15944080 ] Morten Hornbech commented on SPARK-14560: - No, not really. We were able to work around it by rewriting the job, but it was never clear what made the difference. > Cooperative Memory Management for Spillables > > > Key: SPARK-14560 > URL: https://issues.apache.org/jira/browse/SPARK-14560 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Imran Rashid >Assignee: Lianhui Wang > Fix For: 2.0.0 > > > SPARK-10432 introduced cooperative memory management for SQL operators that > can spill; however, {{Spillable}} s used by the old RDD api still do not > cooperate. This can lead to memory starvation, in particular on a > shuffle-to-shuffle stage, eventually resulting in errors like: > {noformat} > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081 > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by > org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory > were used by task 3081 but are not associated with specific consumers > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory > are used for execution and 1710484 bytes of memory are used for storage > 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size > = 1317230346 bytes, TID = 3081 > 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage > 3.0 (TID 3081) > java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0 > at > org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > This can happen anytime the shuffle read side requires more memory than what > is available for the task. Since the shuffle-read side doubles its memory > request each time, it can easily end up acquiring all of the available > memory, even if it does not use it. Eg., say that after the final spill, the > shuffle-read side requires 10 MB more memory, and there is 15 MB of memory > available. But if it starts at 2 MB, it will double to 4, 8, and then > request 16 MB of memory, and in fact get all available 15 MB. Since the 15 > MB of memory is sufficient, it will not spill, and will continue holding on > to all available memory. But this leaves *no* memory available for the > shuffle-write side. Since the shuffle-write side cannot request the > shuffle-read side to free up memory, this leads to an OOM. > The simple solution is to make {{Spillable}} implement {{MemoryConsumer}} as > well, so RDDs can benefit from the cooperative memory management introduced > by SPARK-10342. > Note that an additional improvement would be for the shuffle-read side to > simple release unused memory, without spilling, in case that would leave > enough memory, and only spill if that was inadequate. However that can come > as a later improvement. > *Workaround*: You can set > {{spark.shuffle.spill.numElementsForceSpillThreshold=N}} to force spilling to > occur every {{N}} elements, thus preventing the shuffle-read side from ever > grabbing all of the available memory. However, this requires careful tuning > of {{N}} to specific workloads: too big, and you will still get an OOM; too > small, and there will be so much spilling that performance will suffer > drastically. Furthermore, this workaround uses an *undocumented* > configuration with *no compatibility guarantees* for future versions of spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (SPARK-20113) overwrite mode appends data on MySQL table that does not have a primary key
[ https://issues.apache.org/jira/browse/SPARK-20113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944079#comment-15944079 ] Sean Owen commented on SPARK-20113: --- If there is no primary key, how would anything know that the data is already inserted? there is no notion of sameness to decide data is already there. > overwrite mode appends data on MySQL table that does not have a primary key > --- > > Key: SPARK-20113 > URL: https://issues.apache.org/jira/browse/SPARK-20113 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.1 >Reporter: Bhanu Akaveeti > > Dataframe.write in overwrite mode appends data on MySQL table that does not > have a primary key > df_mysql.write \ > .mode("overwrite") \ > .jdbc("jdbc:mysql://ip-address/database", "MySQL_Table", properties={"user": > "MySQL_user", "password": "MySQL_pw"}) > When the above script is run twice, data is inserted twice. Also, I tried > with option("truncate","true") but still data is appended in MySQL table -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20113) overwrite mode appends data on MySQL table that does not have a primary key
Bhanu Akaveeti created SPARK-20113: -- Summary: overwrite mode appends data on MySQL table that does not have a primary key Key: SPARK-20113 URL: https://issues.apache.org/jira/browse/SPARK-20113 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.0.1 Reporter: Bhanu Akaveeti Dataframe.write in overwrite mode appends data on MySQL table that does not have a primary key df_mysql.write \ .mode("overwrite") \ .jdbc("jdbc:mysql://ip-address/database", "MySQL_Table", properties={"user": "MySQL_user", "password": "MySQL_pw"}) When the above script is run twice, data is inserted twice. Also, I tried with option("truncate","true") but still data is appended in MySQL table -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter
[ https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mitesh updated SPARK-20112: --- Description: I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The hs_err_pid and codegen file are attached (with query plans). Its not a deterministic repro, but running a big query load, I eventually see it come up within a few minutes. Here is some interesting repro information: - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I think that means its not an issue with the code-gen, but I cant figure out what the difference in behavior is. - The broadcast joins in the plan are all small tables. I have autoJoinBroadcast=-1 because I always hint which tables should be broadcast. - As you can see from the plan, all the sources are cached memory tables {noformat} # A fatal error has been detected by the Java Runtime Environment: # # [thread 139872345896704 also had an error] SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688 # # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 compressed oops) [thread 139872348002048 also had an error]# Problematic frame: # J 28454 C1 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3] {noformat} was: I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The hs_err_pid and codegen file are attached (with query plans). Its not a deterministic repro, but running a big query load, I eventually see it come up within a few minutes. Here is some interesting repro information: - Using r3.8xlarge machines, which have ephermal attached drives, I can't repro this. So I think that means its not an issue with the code-gen, but I cant figure out what the difference in behavior is. - The broadcast joins in the plan are all small tables. I have autoJoinBroadcast=-1 because I always hint which tables should be broadcast. - As you can see from the plan, all the sources are cached memory tables {noformat} # A fatal error has been detected by the Java Runtime Environment: # # [thread 139872345896704 also had an error] SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688 # # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 compressed oops) [thread 139872348002048 also had an error]# Problematic frame: # J 28454 C1 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3] {noformat} > SIGSEGV in GeneratedIterator.sort_addToSorter > - > > Key: SPARK-20112 > URL: https://issues.apache.org/jira/browse/SPARK-20112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops) >Reporter: Mitesh > Attachments: codegen_sorter_crash.log, hs_err_pid19271.log > > > I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The > hs_err_pid and codegen file are attached (with query plans). Its not a > deterministic repro, but running a big query load, I eventually see it come > up within a few minutes. > Here is some interesting repro information: > - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't > repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I > think that means its not an issue with the code-gen, but I cant figure out > what the difference in behavior is. > - The broadcast joins in the plan are all small tables. I have > autoJoinBroadcast=-1 because I always hint which tables should be broadcast. > - As you can see from the plan, all the sources are cached memory tables > {noformat} > # A fatal error has been detected by the Java Runtime Environment: > # > # [thread 139872345896704 also had an error] > SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build > 1.8.0_60-b27) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode > linux-amd64 compressed oops) > [thread 139872348002048 also had an error]# Problematic frame: > # > J 28454 C1 >
[jira] [Updated] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter
[ https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mitesh updated SPARK-20112: --- Description: I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The hs_err_pid and codegen file are attached (with query plans). Its not a deterministic repro, but running a big query load, I eventually see it come up within a few minutes. Here is some interesting repro information: - Using r3.8xlarge machines, which have ephermal attached drives, I can't repro this. So I think that means its not an issue with the code-gen, but I cant figure out what the difference in behavior is. - The broadcast joins in the plan are all small tables. I have autoJoinBroadcast=-1 because I always hint which tables should be broadcast. - As you can see from the plan, all the sources are cached memory tables {noformat} # A fatal error has been detected by the Java Runtime Environment: # # [thread 139872345896704 also had an error] SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688 # # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 compressed oops) [thread 139872348002048 also had an error]# Problematic frame: # J 28454 C1 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3] {noformat} was: I'm seeing a very weird crash in GeneratedIterator.sort_addToSorter. The hs_err_pid and codegen file are attached (with query plans). Its not a deterministic repro, but running a big query load, I eventually see it come up within a few minutes. Here is some interesting repro information: - Using r3.8xlarge machines, which have ephermal attached drives, I can't repro this. So I think that means its not an issue with the code-gen, but I cant figure out what the difference in behavior is. - The broadcast joins in the plan are all small tables. I have autoJoinBroadcast=-1 because I always hint which tables should be broadcast. - As you can see from the plan, all the sources are cached memory tables {noformat} # A fatal error has been detected by the Java Runtime Environment: # # [thread 139872345896704 also had an error] SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688 # # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 compressed oops) [thread 139872348002048 also had an error]# Problematic frame: # J 28454 C1 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3] {noformat} > SIGSEGV in GeneratedIterator.sort_addToSorter > - > > Key: SPARK-20112 > URL: https://issues.apache.org/jira/browse/SPARK-20112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops) >Reporter: Mitesh > Attachments: codegen_sorter_crash.log, hs_err_pid19271.log > > > I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The > hs_err_pid and codegen file are attached (with query plans). Its not a > deterministic repro, but running a big query load, I eventually see it come > up within a few minutes. > Here is some interesting repro information: > - Using r3.8xlarge machines, which have ephermal attached drives, I can't > repro this. So I think that means its not an issue with the code-gen, but I > cant figure out what the difference in behavior is. > - The broadcast joins in the plan are all small tables. I have > autoJoinBroadcast=-1 because I always hint which tables should be broadcast. > - As you can see from the plan, all the sources are cached memory tables > {noformat} > # A fatal error has been detected by the Java Runtime Environment: > # > # [thread 139872345896704 also had an error] > SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build > 1.8.0_60-b27) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode > linux-amd64 compressed oops) > [thread 139872348002048 also had an error]# Problematic frame: > # > J 28454 C1 > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V > (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3] >
[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables
[ https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944077#comment-15944077 ] Darshan Mehta commented on SPARK-14560: --- [~lovemylover] [~mhornbech] Were you able to figure out/fix the issue? We are using Spark 2.1.0 and are facing similar bug, below is the stacktrace: Caused by: java.lang.OutOfMemoryError: Unable to acquire 65536 bytes of memory, got 0 at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100) at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.(UnsafeInMemorySorter.java:125) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:154) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:121) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:82) at org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:87) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.init(Unknown Source) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:374) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:371) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:843) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:843) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > Cooperative Memory Management for Spillables > > > Key: SPARK-14560 > URL: https://issues.apache.org/jira/browse/SPARK-14560 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Imran Rashid >Assignee: Lianhui Wang > Fix For: 2.0.0 > > > SPARK-10432 introduced cooperative memory management for SQL operators that > can spill; however, {{Spillable}} s used by the old RDD api still do not > cooperate. This can lead to memory starvation, in particular on a > shuffle-to-shuffle stage, eventually resulting in errors like: > {noformat} > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081 > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by > org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory > were used by task 3081 but are not associated with specific consumers > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory > are used for execution and 1710484 bytes of memory are used for storage > 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size > = 1317230346 bytes, TID = 3081 > 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage > 3.0 (TID 3081) > java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0 > at > org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > This can happen anytime the shuffle read side requires more memory than what > is available for the task. Since the shuffle-read side doubles its memory > request each time, it can easily end up acquiring all of the available > memory, even if it does not use it. Eg., say that after the final spill, the > shuffle-read side requires 10 MB more memory, and there is 15 MB of memory > available. But if it starts at 2 MB, it will double to 4, 8, and then > request 16 MB of memory, and in fact get all
[jira] [Updated] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter
[ https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mitesh updated SPARK-20112: --- Attachment: (was: codegen_sorter_crash) > SIGSEGV in GeneratedIterator.sort_addToSorter > - > > Key: SPARK-20112 > URL: https://issues.apache.org/jira/browse/SPARK-20112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops) >Reporter: Mitesh > Attachments: codegen_sorter_crash.log, hs_err_pid19271.log > > > I'm seeing a very weird crash in GeneratedIterator.sort_addToSorter. The > hs_err_pid and codegen file are attached (with query plans). Its not a > deterministic repro, but running a big query load, I eventually see it come > up within a few minutes. > Here is some interesting repro information: > - Using r3.8xlarge machines, which have ephermal attached drives, I can't > repro this. So I think that means its not an issue with the code-gen, but I > cant figure out what the difference in behavior is. > - The broadcast joins in the plan are all small tables. I have > autoJoinBroadcast=-1 because I always hint which tables should be broadcast. > - As you can see from the plan, all the sources are cached memory tables > {noformat} > # A fatal error has been detected by the Java Runtime Environment: > # > # [thread 139872345896704 also had an error] > SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build > 1.8.0_60-b27) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode > linux-amd64 compressed oops) > [thread 139872348002048 also had an error]# Problematic frame: > # > J 28454 C1 > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V > (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter
[ https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mitesh updated SPARK-20112: --- Attachment: codegen_sorter_crash.log > SIGSEGV in GeneratedIterator.sort_addToSorter > - > > Key: SPARK-20112 > URL: https://issues.apache.org/jira/browse/SPARK-20112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops) >Reporter: Mitesh > Attachments: codegen_sorter_crash.log, hs_err_pid19271.log > > > I'm seeing a very weird crash in GeneratedIterator.sort_addToSorter. The > hs_err_pid and codegen file are attached (with query plans). Its not a > deterministic repro, but running a big query load, I eventually see it come > up within a few minutes. > Here is some interesting repro information: > - Using r3.8xlarge machines, which have ephermal attached drives, I can't > repro this. So I think that means its not an issue with the code-gen, but I > cant figure out what the difference in behavior is. > - The broadcast joins in the plan are all small tables. I have > autoJoinBroadcast=-1 because I always hint which tables should be broadcast. > - As you can see from the plan, all the sources are cached memory tables > {noformat} > # A fatal error has been detected by the Java Runtime Environment: > # > # [thread 139872345896704 also had an error] > SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build > 1.8.0_60-b27) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode > linux-amd64 compressed oops) > [thread 139872348002048 also had an error]# Problematic frame: > # > J 28454 C1 > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V > (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter
[ https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mitesh updated SPARK-20112: --- Attachment: hs_err_pid19271.log codegen_sorter_crash > SIGSEGV in GeneratedIterator.sort_addToSorter > - > > Key: SPARK-20112 > URL: https://issues.apache.org/jira/browse/SPARK-20112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops) >Reporter: Mitesh > Attachments: codegen_sorter_crash, hs_err_pid19271.log > > > I'm seeing a very weird crash in GeneratedIterator.sort_addToSorter. The > hs_err_pid and codegen file are attached (with query plans). Its not a > deterministic repro, but running a big query load, I eventually see it come > up within a few minutes. > Here is some interesting repro information: > - Using r3.8xlarge machines, which have ephermal attached drives, I can't > repro this. So I think that means its not an issue with the code-gen, but I > cant figure out what the difference in behavior is. > - The broadcast joins in the plan are all small tables. I have > autoJoinBroadcast=-1 because I always hint which tables should be broadcast. > - As you can see from the plan, all the sources are cached memory tables > {noformat} > # A fatal error has been detected by the Java Runtime Environment: > # > # [thread 139872345896704 also had an error] > SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build > 1.8.0_60-b27) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode > linux-amd64 compressed oops) > [thread 139872348002048 also had an error]# Problematic frame: > # > J 28454 C1 > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V > (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter
Mitesh created SPARK-20112: -- Summary: SIGSEGV in GeneratedIterator.sort_addToSorter Key: SPARK-20112 URL: https://issues.apache.org/jira/browse/SPARK-20112 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops) Reporter: Mitesh I'm seeing a very weird crash in GeneratedIterator.sort_addToSorter. The hs_err_pid and codegen file are attached (with query plans). Its not a deterministic repro, but running a big query load, I eventually see it come up within a few minutes. Here is some interesting repro information: - Using r3.8xlarge machines, which have ephermal attached drives, I can't repro this. So I think that means its not an issue with the code-gen, but I cant figure out what the difference in behavior is. - The broadcast joins in the plan are all small tables. I have autoJoinBroadcast=-1 because I always hint which tables should be broadcast. - As you can see from the plan, all the sources are cached memory tables {noformat} # A fatal error has been detected by the Java Runtime Environment: # # [thread 139872345896704 also had an error] SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688 # # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 compressed oops) [thread 139872348002048 also had an error]# Problematic frame: # J 28454 C1 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3] {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19876) Add OneTime trigger executor
[ https://issues.apache.org/jira/browse/SPARK-19876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944073#comment-15944073 ] Apache Spark commented on SPARK-19876: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/17444 > Add OneTime trigger executor > > > Key: SPARK-19876 > URL: https://issues.apache.org/jira/browse/SPARK-19876 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Tyson Condie >Assignee: Tyson Condie > Fix For: 2.2.0 > > > The goal is to add a new trigger executor that will process a single trigger > then stop. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19904) SPIP Add Spark Project Improvement Proposal doc to website
[ https://issues.apache.org/jira/browse/SPARK-19904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19904. --- Resolution: Fixed Fix Version/s: 2.2.0 There's not a bright line between what goes in the spark.apache.org site, and what does in per-release spark.apache.org/docs docs. Anything release-specific must be in the latter, which suggests anything else go in the former. I think SPIPs are process items and not release-specific, so I don't necessarily see value in linking to them from release-specific docs. I'd suggest we call this done. > SPIP Add Spark Project Improvement Proposal doc to website > -- > > Key: SPARK-19904 > URL: https://issues.apache.org/jira/browse/SPARK-19904 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Cody Koeninger >Assignee: Cody Koeninger > Labels: SPIP > Fix For: 2.2.0 > > > see > http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-td19268.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18127) Add hooks and extension points to Spark
[ https://issues.apache.org/jira/browse/SPARK-18127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-18127: Component/s: (was: Spark Core) SQL > Add hooks and extension points to Spark > --- > > Key: SPARK-18127 > URL: https://issues.apache.org/jira/browse/SPARK-18127 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Srinath >Assignee: Herman van Hovell > > As a Spark user I want to be able to customize my spark session. I currently > want to be able to do the following things: > # I want to be able to add custom analyzer rules. This allows me to implement > my own logical constructs; an example of this could be a recursive operator. > # I want to be able to add my own analysis checks. This allows me to catch > problems with spark plans early on. An example of this can be some datasource > specific checks. > # I want to be able to add my own optimizations. This allows me to optimize > plans in different ways, for instance when you use a very different cluster > (for example a one-node X1 instance). This supersedes the current > {{spark.experimental}} methods > # I want to be able to add my own planning strategies. This supersedes the > current {{spark.experimental}} methods. This allows me to plan my own > physical plan, an example of this would to plan my own heavily integrated > data source (CarbonData for example). > # I want to be able to use my own customized SQL constructs. An example of > this would supporting my own dialect, or be able to add constructs to the > current SQL language. I should not have to implement a complete parse, and > should be able to delegate to an underlying parser. > # I want to be able to track modifications and calls to the external catalog. > I want this API to be stable. This allows me to do synchronize with other > systems. > This API should modify the SparkSession when the session gets started, and it > should NOT change the session in flight. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19143) API in Spark for distributing new delegation tokens (Improve delegation token handling in secure clusters)
[ https://issues.apache.org/jira/browse/SPARK-19143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944046#comment-15944046 ] Thomas Graves commented on SPARK-19143: --- Yes I can be Shephard. > API in Spark for distributing new delegation tokens (Improve delegation token > handling in secure clusters) > -- > > Key: SPARK-19143 > URL: https://issues.apache.org/jira/browse/SPARK-19143 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 2.0.2, 2.1.0 >Reporter: Ruslan Dautkhanov > > Spin off from SPARK-14743 and comments chain in [recent comments| > https://issues.apache.org/jira/browse/SPARK-5493?focusedCommentId=15802179=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15802179] > in SPARK-5493. > Spark currently doesn't have a way for distribution new delegation tokens. > Quoting [~vanzin] from SPARK-5493 > {quote} > IIRC Livy doesn't yet support delegation token renewal. Once it reaches the > TTL, the session is unusable. > There might be ways to hack support for that without changes in Spark, but > I'd like to see a proper API in Spark for distributing new delegation tokens. > I mentioned that in SPARK-14743, but although that bug is closed, that > particular feature hasn't been implemented yet. > {quote} > Other thoughts? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944030#comment-15944030 ] Seth Hendrickson commented on SPARK-19634: -- I'm coming to this a bit late, but I'm finding things a bit hard to follow. Reading the design doc, it seems that the original plan was to implement two interfaces - an RDD one that provides the same performance as current {{MultivariateOnlineSummarizer}} and a data frame interface using UDAF. from design doc: "...In the meantime, there will be a (possibly faster) RDD interface and a (more flexible) Dataframe interface." Now, the PR for this uses {{TypedImperativeAggregate}}. I understand that the it was pivoted away from UDAF, but the design doc does not reflect that. Also, if there is to be an RDD interface, what is the JIRA for it and what will it look like? Also, there are several concerns raised in the design doc about this Catalyst aggregate approach being less efficient, and the consensus seemed to be: provide an initial API with a "slow" implementation that will be improved upon in the future. Is that correct? I'm not that familiar with the Catalyst optimizer, but are we sure there is a good way to implement the tree-reduce type aggregation, and if so could we document that? If this is still targeted at 2.2, why? I'd prefer to get the details hashed out further rather than rushing to provide an API and initial slow implementation, that way we can make sure that we get this correct in the long-term. I really appreciate some clarification and my apologies if I have missed any of the details/discussion. > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944019#comment-15944019 ] Timothy Hunter commented on SPARK-19634: [~dongjin] [~wm624] sorry it looks like I missed your comments... I pushed a PR for this feature. Please feel free to comment on the PR if you have the time. > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20111) codegen bug surfaced by GraphFrames issue 165
[ https://issues.apache.org/jira/browse/SPARK-20111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944004#comment-15944004 ] Timothy Hunter commented on SPARK-20111: As Spark SQL is making more and more forays into code generation, I have been wondering if it would make sense to start adopting practical compiler technologies, such as generating first an intermediate representation, instead of doing string manipulation as we currently do. This is of course much beyond the scope of this particular ticket. > codegen bug surfaced by GraphFrames issue 165 > - > > Key: SPARK-20111 > URL: https://issues.apache.org/jira/browse/SPARK-20111 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0, 2.2.0 >Reporter: Joseph K. Bradley > > In GraphFrames, test {{test("named edges")}} in {{PatternMatchSuite}} > surfaces a SQL codegen bug. > This is described in https://github.com/graphframes/graphframes/issues/165 > Summary > * The unit test does a simple motif query on a graph. Essentially, this > means taking 2 DataFrames, doing a few joins, selecting 2 columns, and > collecting the (tiny) DataFrame. > * The test runs, but codegen fails. See the linked GraphFrames issue for the > stacktrace. > To reproduce this: > * Check out GraphFrames https://github.com/graphframes/graphframes > * Run {{sbt assembly}} to compile it and run tests > Copying [~felixcheung]'s comment from the GraphFrames issue 165: > {quote} > Seems like codegen bug; it looks like at least 2 issues: > 1. At L472, inputadapter_value is not defined within scope > 2. inputadapter_value is an InternalRow, for this statement to work > {{bhj_primitiveA = inputadapter_value;}} > it should be > {{bhj_primitiveA = inputadapter_value.getLong(0);}} > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20111) codegen bug surfaced by GraphFrames issue 165
[ https://issues.apache.org/jira/browse/SPARK-20111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20111: -- Description: In GraphFrames, test {{test("named edges")}} in {{PatternMatchSuite}} surfaces a SQL codegen bug. This is described in https://github.com/graphframes/graphframes/issues/165 Summary * The unit test does a simple motif query on a graph. Essentially, this means taking 2 DataFrames, doing a few joins, selecting 2 columns, and collecting the (tiny) DataFrame. * The test runs, but codegen fails. See the linked GraphFrames issue for the stacktrace. To reproduce this: * Check out GraphFrames https://github.com/graphframes/graphframes * Run {{sbt assembly}} to compile it and run tests Copying [~felixcheung]'s comment from the GraphFrames issue 165: {quote} Seems like codegen bug; it looks like at least 2 issues: 1. At L472, inputadapter_value is not defined within scope 2. inputadapter_value is an InternalRow, for this statement to work {{bhj_primitiveA = inputadapter_value;}} it should be {{bhj_primitiveA = inputadapter_value.getLong(0);}} {quote} was: In GraphFrames, test {{test("named edges")}} in {{PatternMatchSuite}} surfaces a SQL codegen bug. This is described in https://github.com/graphframes/graphframes/issues/165 Summary * The unit test does a simple motif query on a graph. Essentially, this means taking 2 DataFrames, doing a few joins, selecting 2 columns, and collecting the (tiny) DataFrame. * The test runs, but codegen fails. See the linked GraphFrames issue for the stacktrace. To reproduce this: * Check out GraphFrames https://github.com/graphframes/graphframes * Run {{sbt assembly}} to compile it and run tests > codegen bug surfaced by GraphFrames issue 165 > - > > Key: SPARK-20111 > URL: https://issues.apache.org/jira/browse/SPARK-20111 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0, 2.2.0 >Reporter: Joseph K. Bradley > > In GraphFrames, test {{test("named edges")}} in {{PatternMatchSuite}} > surfaces a SQL codegen bug. > This is described in https://github.com/graphframes/graphframes/issues/165 > Summary > * The unit test does a simple motif query on a graph. Essentially, this > means taking 2 DataFrames, doing a few joins, selecting 2 columns, and > collecting the (tiny) DataFrame. > * The test runs, but codegen fails. See the linked GraphFrames issue for the > stacktrace. > To reproduce this: > * Check out GraphFrames https://github.com/graphframes/graphframes > * Run {{sbt assembly}} to compile it and run tests > Copying [~felixcheung]'s comment from the GraphFrames issue 165: > {quote} > Seems like codegen bug; it looks like at least 2 issues: > 1. At L472, inputadapter_value is not defined within scope > 2. inputadapter_value is an InternalRow, for this statement to work > {{bhj_primitiveA = inputadapter_value;}} > it should be > {{bhj_primitiveA = inputadapter_value.getLong(0);}} > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20111) codegen bug surfaced by GraphFrames issue 165
Joseph K. Bradley created SPARK-20111: - Summary: codegen bug surfaced by GraphFrames issue 165 Key: SPARK-20111 URL: https://issues.apache.org/jira/browse/SPARK-20111 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0, 2.0.2, 2.2.0 Reporter: Joseph K. Bradley In GraphFrames, test {{test("named edges")}} in {{PatternMatchSuite}} surfaces a SQL codegen bug. This is described in https://github.com/graphframes/graphframes/issues/165 Summary * The unit test does a simple motif query on a graph. Essentially, this means taking 2 DataFrames, doing a few joins, selecting 2 columns, and collecting the (tiny) DataFrame. * The test runs, but codegen fails. See the linked GraphFrames issue for the stacktrace. To reproduce this: * Check out GraphFrames https://github.com/graphframes/graphframes * Run {{sbt assembly}} to compile it and run tests -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20110) Windowed aggregation do not work when the timestamp is a nested field
Alexis Seigneurin created SPARK-20110: - Summary: Windowed aggregation do not work when the timestamp is a nested field Key: SPARK-20110 URL: https://issues.apache.org/jira/browse/SPARK-20110 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.1.0 Reporter: Alexis Seigneurin I am loading data into a DataFrame with nested fields. I want to perform a windowed aggregation on the timestamp from a nested fields: {code} .groupBy(window($"auth.sysEntryTimestamp", "2 minutes")) {code} I get the following error: {quote} org.apache.spark.sql.AnalysisException: Multiple time window expressions would result in a cartesian product of rows, therefore they are not currently not supported. {quote} This works fine if I first extract the timestamp to a separate column: {code} .withColumn("sysEntryTimestamp", $"auth.sysEntryTimestamp") .groupBy( window($"sysEntryTimestamp", "2 minutes") ) {code} Please see the whole sample: - batch: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4683710270868386/4278399007363210/3769253384867782/latest.html - Structured Streaming: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4683710270868386/4278399007363192/3769253384867782/latest.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20109) Need a way to convert from IndexedRowMatrix to Block
[ https://issues.apache.org/jira/browse/SPARK-20109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943966#comment-15943966 ] John Compitello commented on SPARK-20109: - I have a PR for this issue in the works, I'd like to be the one to handle it. > Need a way to convert from IndexedRowMatrix to Block > > > Key: SPARK-20109 > URL: https://issues.apache.org/jira/browse/SPARK-20109 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.1.0 >Reporter: John Compitello > > The current implementation of toBlockMatrix on IndexedRowMatrix is > insufficient. It is implemented by first converting the IndexedRowMatrix to a > CoordinateMatrix, then converting that CoordinateMatrix to a BlockMatrix. Not > only is this slower than it needs to be, it also means that the created > BlockMatrix ends up being backed by instances of SparseMatrix, which a user > may not want. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20109) Need a way to convert from IndexedRowMatrix to Block
John Compitello created SPARK-20109: --- Summary: Need a way to convert from IndexedRowMatrix to Block Key: SPARK-20109 URL: https://issues.apache.org/jira/browse/SPARK-20109 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 2.1.0 Reporter: John Compitello The current implementation of toBlockMatrix on IndexedRowMatrix is insufficient. It is implemented by first converting the IndexedRowMatrix to a CoordinateMatrix, then converting that CoordinateMatrix to a BlockMatrix. Not only is this slower than it needs to be, it also means that the created BlockMatrix ends up being backed by instances of SparseMatrix, which a user may not want. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20083) Change matrix toArray to not create a new array when matrix is already column major
[ https://issues.apache.org/jira/browse/SPARK-20083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943873#comment-15943873 ] Seth Hendrickson commented on SPARK-20083: -- Yes, that would be the intention. We have to take care to change the existing code when we require a new array from {{toArray}} when we implement this change. > Change matrix toArray to not create a new array when matrix is already column > major > --- > > Key: SPARK-20083 > URL: https://issues.apache.org/jira/browse/SPARK-20083 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Seth Hendrickson >Priority: Minor > > {{toArray}} always creates a new array in column major format, even when the > resulting array is the same as the backing values. We should change this to > just return a reference to the values array when it is already column major. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20083) Change matrix toArray to not create a new array when matrix is already column major
[ https://issues.apache.org/jira/browse/SPARK-20083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943857#comment-15943857 ] yuhao yang commented on SPARK-20083: So the result array will allow users to manipulate the matrix values. Is it intentional? > Change matrix toArray to not create a new array when matrix is already column > major > --- > > Key: SPARK-20083 > URL: https://issues.apache.org/jira/browse/SPARK-20083 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Seth Hendrickson >Priority: Minor > > {{toArray}} always creates a new array in column major format, even when the > resulting array is the same as the backing values. We should change this to > just return a reference to the values array when it is already column major. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19904) SPIP Add Spark Project Improvement Proposal doc to website
[ https://issues.apache.org/jira/browse/SPARK-19904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943827#comment-15943827 ] Cody Koeninger commented on SPARK-19904: It has been added to apache/spark-website git repo http://spark.apache.org/improvement-proposals.html , with a link under the Community menu item. It has not been added to apache/spark git repo http://spark.apache.org/docs/latest/ under the More menu item. I'm not 100% clear on what the plan was for maintaining some of the website docs in a separate site from the main repo, and whether it's worth updating both for cross-linking. > SPIP Add Spark Project Improvement Proposal doc to website > -- > > Key: SPARK-19904 > URL: https://issues.apache.org/jira/browse/SPARK-19904 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Cody Koeninger >Assignee: Cody Koeninger > Labels: SPIP > > see > http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-td19268.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19904) SPIP Add Spark Project Improvement Proposal doc to website
[ https://issues.apache.org/jira/browse/SPARK-19904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943799#comment-15943799 ] Thomas Graves commented on SPARK-19904: --- Is this done or what is this waiting on? > SPIP Add Spark Project Improvement Proposal doc to website > -- > > Key: SPARK-19904 > URL: https://issues.apache.org/jira/browse/SPARK-19904 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Cody Koeninger >Assignee: Cody Koeninger > Labels: SPIP > > see > http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-td19268.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20087) Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd listeners
[ https://issues.apache.org/jira/browse/SPARK-20087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20087: Assignee: Apache Spark > Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd > listeners > - > > Key: SPARK-20087 > URL: https://issues.apache.org/jira/browse/SPARK-20087 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Charles Lewis >Assignee: Apache Spark > > When tasks end due to an ExceptionFailure, subscribers to onTaskEnd receive > accumulators / task metrics for that task, if they were still available. > These metrics are not currently sent when tasks are killed intentionally, > such as when a speculative retry finishes, and the original is killed (or > vice versa). Since we're killing these tasks ourselves, these metrics should > almost always exist, and we should treat them the same way as we treat > ExceptionFailures. > Sending these metrics with the TaskKilled end reason makes aggregation across > all tasks in an app more accurate. This data can inform decisions about how > to tune the speculation parameters in order to minimize duplicated work, and > in general, the total cost of an app should include both successful and > failed tasks, if that information exists. > PR: https://github.com/apache/spark/pull/17422 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20087) Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd listeners
[ https://issues.apache.org/jira/browse/SPARK-20087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943755#comment-15943755 ] Apache Spark commented on SPARK-20087: -- User 'noodle-fb' has created a pull request for this issue: https://github.com/apache/spark/pull/17422 > Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd > listeners > - > > Key: SPARK-20087 > URL: https://issues.apache.org/jira/browse/SPARK-20087 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Charles Lewis > > When tasks end due to an ExceptionFailure, subscribers to onTaskEnd receive > accumulators / task metrics for that task, if they were still available. > These metrics are not currently sent when tasks are killed intentionally, > such as when a speculative retry finishes, and the original is killed (or > vice versa). Since we're killing these tasks ourselves, these metrics should > almost always exist, and we should treat them the same way as we treat > ExceptionFailures. > Sending these metrics with the TaskKilled end reason makes aggregation across > all tasks in an app more accurate. This data can inform decisions about how > to tune the speculation parameters in order to minimize duplicated work, and > in general, the total cost of an app should include both successful and > failed tasks, if that information exists. > PR: https://github.com/apache/spark/pull/17422 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20087) Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd listeners
[ https://issues.apache.org/jira/browse/SPARK-20087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20087: Assignee: (was: Apache Spark) > Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd > listeners > - > > Key: SPARK-20087 > URL: https://issues.apache.org/jira/browse/SPARK-20087 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Charles Lewis > > When tasks end due to an ExceptionFailure, subscribers to onTaskEnd receive > accumulators / task metrics for that task, if they were still available. > These metrics are not currently sent when tasks are killed intentionally, > such as when a speculative retry finishes, and the original is killed (or > vice versa). Since we're killing these tasks ourselves, these metrics should > almost always exist, and we should treat them the same way as we treat > ExceptionFailures. > Sending these metrics with the TaskKilled end reason makes aggregation across > all tasks in an app more accurate. This data can inform decisions about how > to tune the speculation parameters in order to minimize duplicated work, and > in general, the total cost of an app should include both successful and > failed tasks, if that information exists. > PR: https://github.com/apache/spark/pull/17422 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20105) Add tests for checkType and type string in structField in R
[ https://issues.apache.org/jira/browse/SPARK-20105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-20105. -- Resolution: Fixed Assignee: Hyukjin Kwon Fix Version/s: 2.2.0 Target Version/s: 2.2.0 > Add tests for checkType and type string in structField in R > --- > > Key: SPARK-20105 > URL: https://issues.apache.org/jira/browse/SPARK-20105 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.2.0 > > > It seems {{checkType}} and the type string in {{structField}} are not being > tested closely. > This string format currently seems R-specific. Therefore, it seems nicer if > we test positive/negative cases in R side. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20102) Fix two minor build script issues blocking 2.1.1 RC + master snapshot builds
[ https://issues.apache.org/jira/browse/SPARK-20102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-20102. Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 Fixed for 2.1.1 and master. > Fix two minor build script issues blocking 2.1.1 RC + master snapshot builds > > > Key: SPARK-20102 > URL: https://issues.apache.org/jira/browse/SPARK-20102 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.1 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.1.1, 2.2.0 > > > The master snapshot publisher builds are currently broken due to two minor > build issues: > 1. For unknown reasons, the LFTP {{mkdir -p}} command started throwing errors > when the remote FTP directory already exists. To work around this, we should > update the script to ignore errors. > 2. The PySpark setup.py file references a non-existent module, causing Python > packaging to fail. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20037) impossible to set kafka offsets using kafka 0.10 and spark 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-20037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943654#comment-15943654 ] Daniel Nuriyev commented on SPARK-20037: Thank you for your feedback, This problem started when I upgraded kafka client jars. But since you can't reproduce it, I'll dig in myself. > impossible to set kafka offsets using kafka 0.10 and spark 2.0.0 > > > Key: SPARK-20037 > URL: https://issues.apache.org/jira/browse/SPARK-20037 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Daniel Nuriyev > Attachments: Main.java, offsets.png > > > I use kafka 0.10.1 and java code with the following dependencies: > > org.apache.kafka > kafka_2.11 > 0.10.1.1 > > > org.apache.kafka > kafka-clients > 0.10.1.1 > > > org.apache.spark > spark-streaming_2.11 > 2.0.0 > > > org.apache.spark > spark-streaming-kafka-0-10_2.11 > 2.0.0 > > The code tries to read the a topic starting with offsets. > The topic has 4 partitions that start somewhere before 585000 and end after > 674000. So I wanted to read all partitions starting with 585000 > fromOffsets.put(new TopicPartition(topic, 0), 585000L); > fromOffsets.put(new TopicPartition(topic, 1), 585000L); > fromOffsets.put(new TopicPartition(topic, 2), 585000L); > fromOffsets.put(new TopicPartition(topic, 3), 585000L); > Using 5 second batches: > jssc = new JavaStreamingContext(conf, Durations.seconds(5)); > The code immediately throws: > Beginning offset 585000 is after the ending offset 584464 for topic > commerce_item_expectation partition 1 > It does not make sense because this topic/partition starts at 584464, not ends > I use this as a base: > https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html > But I use direct stream: > KafkaUtils.createDirectStream(jssc,LocationStrategies.PreferConsistent(), > ConsumerStrategies.Subscribe( > topics, kafkaParams, fromOffsets > ) > ) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20037) impossible to set kafka offsets using kafka 0.10 and spark 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-20037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943652#comment-15943652 ] Sean Owen commented on SPARK-20037: --- I cannot reproduce this in my application. There is more to your code you don't show and you have two different reports here. You may have a real problem, I'm just saying this is not sufficient as a JIRA in this project, and would not expect someone to investigate. > impossible to set kafka offsets using kafka 0.10 and spark 2.0.0 > > > Key: SPARK-20037 > URL: https://issues.apache.org/jira/browse/SPARK-20037 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Daniel Nuriyev > Attachments: Main.java, offsets.png > > > I use kafka 0.10.1 and java code with the following dependencies: > > org.apache.kafka > kafka_2.11 > 0.10.1.1 > > > org.apache.kafka > kafka-clients > 0.10.1.1 > > > org.apache.spark > spark-streaming_2.11 > 2.0.0 > > > org.apache.spark > spark-streaming-kafka-0-10_2.11 > 2.0.0 > > The code tries to read the a topic starting with offsets. > The topic has 4 partitions that start somewhere before 585000 and end after > 674000. So I wanted to read all partitions starting with 585000 > fromOffsets.put(new TopicPartition(topic, 0), 585000L); > fromOffsets.put(new TopicPartition(topic, 1), 585000L); > fromOffsets.put(new TopicPartition(topic, 2), 585000L); > fromOffsets.put(new TopicPartition(topic, 3), 585000L); > Using 5 second batches: > jssc = new JavaStreamingContext(conf, Durations.seconds(5)); > The code immediately throws: > Beginning offset 585000 is after the ending offset 584464 for topic > commerce_item_expectation partition 1 > It does not make sense because this topic/partition starts at 584464, not ends > I use this as a base: > https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html > But I use direct stream: > KafkaUtils.createDirectStream(jssc,LocationStrategies.PreferConsistent(), > ConsumerStrategies.Subscribe( > topics, kafkaParams, fromOffsets > ) > ) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20037) impossible to set kafka offsets using kafka 0.10 and spark 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-20037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Nuriyev updated SPARK-20037: --- Attachment: Main.java > impossible to set kafka offsets using kafka 0.10 and spark 2.0.0 > > > Key: SPARK-20037 > URL: https://issues.apache.org/jira/browse/SPARK-20037 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Daniel Nuriyev > Attachments: Main.java, offsets.png > > > I use kafka 0.10.1 and java code with the following dependencies: > > org.apache.kafka > kafka_2.11 > 0.10.1.1 > > > org.apache.kafka > kafka-clients > 0.10.1.1 > > > org.apache.spark > spark-streaming_2.11 > 2.0.0 > > > org.apache.spark > spark-streaming-kafka-0-10_2.11 > 2.0.0 > > The code tries to read the a topic starting with offsets. > The topic has 4 partitions that start somewhere before 585000 and end after > 674000. So I wanted to read all partitions starting with 585000 > fromOffsets.put(new TopicPartition(topic, 0), 585000L); > fromOffsets.put(new TopicPartition(topic, 1), 585000L); > fromOffsets.put(new TopicPartition(topic, 2), 585000L); > fromOffsets.put(new TopicPartition(topic, 3), 585000L); > Using 5 second batches: > jssc = new JavaStreamingContext(conf, Durations.seconds(5)); > The code immediately throws: > Beginning offset 585000 is after the ending offset 584464 for topic > commerce_item_expectation partition 1 > It does not make sense because this topic/partition starts at 584464, not ends > I use this as a base: > https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html > But I use direct stream: > KafkaUtils.createDirectStream(jssc,LocationStrategies.PreferConsistent(), > ConsumerStrategies.Subscribe( > topics, kafkaParams, fromOffsets > ) > ) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20037) impossible to set kafka offsets using kafka 0.10 and spark 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-20037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943642#comment-15943642 ] Daniel Nuriyev commented on SPARK-20037: My system is absolutely simple: a topic whose offset starts at X. A single Java method that opens a streaming context and reads the topic starting with a existing offset. The only dependencies are listed above. I do not think that this is a spark problem. I think it is a problem in one of the kafka jars. I will attach the Java method. Have you tried reproducing it? For me it's consistent. > impossible to set kafka offsets using kafka 0.10 and spark 2.0.0 > > > Key: SPARK-20037 > URL: https://issues.apache.org/jira/browse/SPARK-20037 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Daniel Nuriyev > Attachments: offsets.png > > > I use kafka 0.10.1 and java code with the following dependencies: > > org.apache.kafka > kafka_2.11 > 0.10.1.1 > > > org.apache.kafka > kafka-clients > 0.10.1.1 > > > org.apache.spark > spark-streaming_2.11 > 2.0.0 > > > org.apache.spark > spark-streaming-kafka-0-10_2.11 > 2.0.0 > > The code tries to read the a topic starting with offsets. > The topic has 4 partitions that start somewhere before 585000 and end after > 674000. So I wanted to read all partitions starting with 585000 > fromOffsets.put(new TopicPartition(topic, 0), 585000L); > fromOffsets.put(new TopicPartition(topic, 1), 585000L); > fromOffsets.put(new TopicPartition(topic, 2), 585000L); > fromOffsets.put(new TopicPartition(topic, 3), 585000L); > Using 5 second batches: > jssc = new JavaStreamingContext(conf, Durations.seconds(5)); > The code immediately throws: > Beginning offset 585000 is after the ending offset 584464 for topic > commerce_item_expectation partition 1 > It does not make sense because this topic/partition starts at 584464, not ends > I use this as a base: > https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html > But I use direct stream: > KafkaUtils.createDirectStream(jssc,LocationStrategies.PreferConsistent(), > ConsumerStrategies.Subscribe( > topics, kafkaParams, fromOffsets > ) > ) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20088) Do not create new SparkContext in SparkR createSparkContext
[ https://issues.apache.org/jira/browse/SPARK-20088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-20088: - Assignee: Hossein Falaki > Do not create new SparkContext in SparkR createSparkContext > --- > > Key: SPARK-20088 > URL: https://issues.apache.org/jira/browse/SPARK-20088 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki >Assignee: Hossein Falaki > Fix For: 2.2.0 > > > In the implementation of {{createSparkContext}}, we are calling > {code} > new JavaSparkContext() > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20088) Do not create new SparkContext in SparkR createSparkContext
[ https://issues.apache.org/jira/browse/SPARK-20088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-20088. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17423 [https://github.com/apache/spark/pull/17423] > Do not create new SparkContext in SparkR createSparkContext > --- > > Key: SPARK-20088 > URL: https://issues.apache.org/jira/browse/SPARK-20088 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > Fix For: 2.2.0 > > > In the implementation of {{createSparkContext}}, we are calling > {code} > new JavaSparkContext() > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20104) Don't estimate IsNull or IsNotNull predicates for non-leaf node
[ https://issues.apache.org/jira/browse/SPARK-20104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20104. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17438 [https://github.com/apache/spark/pull/17438] > Don't estimate IsNull or IsNotNull predicates for non-leaf node > --- > > Key: SPARK-20104 > URL: https://issues.apache.org/jira/browse/SPARK-20104 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Zhenhua Wang > Fix For: 2.2.0 > > > In current stage, we don't have advanced statistics such as sketches or > histograms. As a result, some operator can't estimate `nullCount` accurately. > E.g. left outer join estimation does not accurately update `nullCount` > currently. So for IsNull and IsNotNull predicates, we only estimate them when > the child is a leaf node, whose `nullCount` is accurate. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20104) Don't estimate IsNull or IsNotNull predicates for non-leaf node
[ https://issues.apache.org/jira/browse/SPARK-20104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-20104: --- Assignee: Zhenhua Wang > Don't estimate IsNull or IsNotNull predicates for non-leaf node > --- > > Key: SPARK-20104 > URL: https://issues.apache.org/jira/browse/SPARK-20104 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang > Fix For: 2.2.0 > > > In current stage, we don't have advanced statistics such as sketches or > histograms. As a result, some operator can't estimate `nullCount` accurately. > E.g. left outer join estimation does not accurately update `nullCount` > currently. So for IsNull and IsNotNull predicates, we only estimate them when > the child is a leaf node, whose `nullCount` is accurate. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943460#comment-15943460 ] Shubham Chopra commented on SPARK-19803: The PR enforces a refresh of the peer list cached at the executor that is trying to proactively replicate the block. This fix ensures that the peer will never try to replicate to a previously failed executor due to a stale reference. In addition, in the unit test, the block managers are explicitly stopped when they are being removed from the master. > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Assignee: Shubham Chopra > Labels: flaky-test > Fix For: 2.2.0 > > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20107) Speed up HadoopMapReduceCommitProtocol#commitJob for many output files
[ https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943398#comment-15943398 ] Apache Spark commented on SPARK-20107: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/17442 > Speed up HadoopMapReduceCommitProtocol#commitJob for many output files > -- > > Key: SPARK-20107 > URL: https://issues.apache.org/jira/browse/SPARK-20107 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Yuming Wang > > Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up > [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121] > for many output files. > It can speed up {{11 minutes}} for 216869 output files: > {code:sql} > CREATE TABLE tmp.spark_20107 AS SELECT > category_id, > product_id, > track_id, > concat( > substr(ds, 3, 2), > substr(ds, 6, 2), > substr(ds, 9, 2) > ) shortDate, > CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' > WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE > 'invalid actio' END AS type > FROM > tmp.user_action > WHERE > ds > date_sub('2017-01-23', 730) > AND actiontype IN ('0','1','2','3'); > {code} > {code} > $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l > 216870 > {code} > This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher > versions(see: > [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433] > and > [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0]) > and apache's hadoop 2.7.0 higher versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20107) Speed up HadoopMapReduceCommitProtocol#commitJob for many output files
[ https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20107: Assignee: (was: Apache Spark) > Speed up HadoopMapReduceCommitProtocol#commitJob for many output files > -- > > Key: SPARK-20107 > URL: https://issues.apache.org/jira/browse/SPARK-20107 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Yuming Wang > > Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up > [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121] > for many output files. > It can speed up {{11 minutes}} for 216869 output files: > {code:sql} > CREATE TABLE tmp.spark_20107 AS SELECT > category_id, > product_id, > track_id, > concat( > substr(ds, 3, 2), > substr(ds, 6, 2), > substr(ds, 9, 2) > ) shortDate, > CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' > WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE > 'invalid actio' END AS type > FROM > tmp.user_action > WHERE > ds > date_sub('2017-01-23', 730) > AND actiontype IN ('0','1','2','3'); > {code} > {code} > $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l > 216870 > {code} > This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher > versions(see: > [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433] > and > [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0]) > and apache's hadoop 2.7.0 higher versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20107) Speed up HadoopMapReduceCommitProtocol#commitJob for many output files
[ https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20107: Assignee: Apache Spark > Speed up HadoopMapReduceCommitProtocol#commitJob for many output files > -- > > Key: SPARK-20107 > URL: https://issues.apache.org/jira/browse/SPARK-20107 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Yuming Wang >Assignee: Apache Spark > > Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up > [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121] > for many output files. > It can speed up {{11 minutes}} for 216869 output files: > {code:sql} > CREATE TABLE tmp.spark_20107 AS SELECT > category_id, > product_id, > track_id, > concat( > substr(ds, 3, 2), > substr(ds, 6, 2), > substr(ds, 9, 2) > ) shortDate, > CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' > WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE > 'invalid actio' END AS type > FROM > tmp.user_action > WHERE > ds > date_sub('2017-01-23', 730) > AND actiontype IN ('0','1','2','3'); > {code} > {code} > $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l > 216870 > {code} > This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher > versions(see: > [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433] > and > [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0]) > and apache's hadoop 2.7.0 higher versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20107) Speed up HadoopMapReduceCommitProtocol#commitJob for many output files
[ https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-20107: Summary: Speed up HadoopMapReduceCommitProtocol#commitJob for many output files (was: Speed up FileOutputCommitter#commitJob for many output files) > Speed up HadoopMapReduceCommitProtocol#commitJob for many output files > -- > > Key: SPARK-20107 > URL: https://issues.apache.org/jira/browse/SPARK-20107 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Yuming Wang > > Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up > [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121] > for many output files. > It can speed up {{11 minutes}} for 216869 output files: > {code:sql} > CREATE TABLE tmp.spark_20107 AS SELECT > category_id, > product_id, > track_id, > concat( > substr(ds, 3, 2), > substr(ds, 6, 2), > substr(ds, 9, 2) > ) shortDate, > CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' > WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE > 'invalid actio' END AS type > FROM > tmp.user_action > WHERE > ds > date_sub('2017-01-23', 730) > AND actiontype IN ('0','1','2','3'); > {code} > {code} > $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l > 216870 > {code} > This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher > versions(see: > [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433] > and > [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0]) > and apache's hadoop 2.7.0 higher versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20107) Speed up FileOutputCommitter#commitJob for many output files
[ https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-20107: Description: Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121] for many output files. It can speed up {{11 minutes}} for 216869 output files: {code:sql} CREATE TABLE tmp.spark_20107 AS SELECT category_id, product_id, track_id, concat( substr(ds, 3, 2), substr(ds, 6, 2), substr(ds, 9, 2) ) shortDate, CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 'invalid actio' END AS type FROM tmp.user_action WHERE ds > date_sub('2017-01-23', 730) AND actiontype IN ('0','1','2','3'); {code} {code} $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l 216870 {code} This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher versions(see: [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433] and [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0]) and apache's hadoop 2.7.0 higher versions. was: It can speed up {{11 minutes}} for 216869 output files. This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher versions,(see: https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433 and https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0) and apache's hadoop 2.7.0 higher versions. > Speed up FileOutputCommitter#commitJob for many output files > > > Key: SPARK-20107 > URL: https://issues.apache.org/jira/browse/SPARK-20107 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Yuming Wang > > Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up > [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121] > for many output files. > It can speed up {{11 minutes}} for 216869 output files: > {code:sql} > CREATE TABLE tmp.spark_20107 AS SELECT > category_id, > product_id, > track_id, > concat( > substr(ds, 3, 2), > substr(ds, 6, 2), > substr(ds, 9, 2) > ) shortDate, > CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' > WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE > 'invalid actio' END AS type > FROM > tmp.user_action > WHERE > ds > date_sub('2017-01-23', 730) > AND actiontype IN ('0','1','2','3'); > {code} > {code} > $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l > 216870 > {code} > This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher > versions(see: > [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433] > and > [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0]) > and apache's hadoop 2.7.0 higher versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16938) Cannot resolve column name after a join
[ https://issues.apache.org/jira/browse/SPARK-16938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943382#comment-15943382 ] Michel Lemay commented on SPARK-16938: -- I just stumbled upon a similar issue with 'union' on two dataframes.. Anybody still working on this issue and if not, could it be revived? > Cannot resolve column name after a join > --- > > Key: SPARK-16938 > URL: https://issues.apache.org/jira/browse/SPARK-16938 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Mathieu D >Priority: Minor > > Found a change of behavior on spark-2.0.0, which breaks a query in our code > base. > The following works on previous spark versions, 1.6.1 up to 2.0.0-preview : > {code} > val dfa = Seq((1, 2), (2, 3)).toDF("id", "a").alias("dfa") > val dfb = Seq((1, 0), (1, 1)).toDF("id", "b").alias("dfb") > dfa.join(dfb, dfa("id") === dfb("id")).dropDuplicates(Array("dfa.id", > "dfb.id")) > {code} > but fails with spark-2.0.0 with the exception : > {code} > Cannot resolve column name "dfa.id" among (id, a, id, b); > org.apache.spark.sql.AnalysisException: Cannot resolve column name "dfa.id" > among (id, a, id, b); > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1818) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1817) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1817) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1814) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594) > at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1814) > at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1840) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20108) Spark query is getting failed with exception
ZS EDGE created SPARK-20108: --- Summary: Spark query is getting failed with exception Key: SPARK-20108 URL: https://issues.apache.org/jira/browse/SPARK-20108 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Reporter: ZS EDGE In our project we have implemented a logic where we programatically generate spark queries. These queries are executed as a sub query and below is the sample query-- sqlContext.sql("INSERT INTO TABLE test_client_r2_r2_2_prod_db1_oz.S3_EMPDTL_Incremental_invalid SELECT 'S3_EMPDTL_Incremental',S3_EMPDTL_Incremental.row_id,S3_EMPDTL_Incremental.SOURCE_FILE_NAME,S3_EMPDTL_Incremental.SOURCE_ROW_ID,'S3_EMPDTL_Incremental','2017-03-22 20:18:59','1','Emp_id#$Emp_name#$Emp_phone#$Emp_salary_in_K#$Emp_address_id#$Date_of_Birth#$Status#$Dept_id#$Date_of_joining#$Row_Number#$Dec_check#$','test','Y','N/A','','' FROM S3_EMPDTL_Incremental_r AS S3_EMPDTL_Incremental where row_id IN (select row_id from s3_empdtl_incremental_r where row_id IN(42949672960))") While executing the above code in the pyspark it is throwing below exception-- > .spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply (InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply (InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463) at org.apache.spark.sql.execution.joins.LongHashedRelation.getValue(HashedRelation.scala:762) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) ... 8 more [Stage 32:=> (10 + 5) / 26]17/03/22 15:42:10 ERROR TaskSetManager: Task 4 in stage 32.0 failed 4 times; aborting job 17/03/22 15:42:10 ERROR InsertIntoHadoopFsRelationCommand: Aborting job. org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 32.0 failed 4 times, most recent failure: Lost task 4.3 in stage 32.0 (TID 857, ip-10-116-1-73.ec2.internal): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply (InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply (InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused