[jira] [Created] (SPARK-20121) simplify NullPropagation with NullIntolerant

2017-03-27 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-20121:
---

 Summary: simplify NullPropagation with NullIntolerant
 Key: SPARK-20121
 URL: https://issues.apache.org/jira/browse/SPARK-20121
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20120) spark-sql CLI support silent mode

2017-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20120:


Assignee: (was: Apache Spark)

> spark-sql CLI support silent mode
> -
>
> Key: SPARK-20120
> URL: https://issues.apache.org/jira/browse/SPARK-20120
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>
> It is similar to Hive silent mode, just show the query result. see:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20120) spark-sql CLI support silent mode

2017-03-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944549#comment-15944549
 ] 

Apache Spark commented on SPARK-20120:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/17449

> spark-sql CLI support silent mode
> -
>
> Key: SPARK-20120
> URL: https://issues.apache.org/jira/browse/SPARK-20120
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>
> It is similar to Hive silent mode, just show the query result. see:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20120) spark-sql CLI support silent mode

2017-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20120:


Assignee: Apache Spark

> spark-sql CLI support silent mode
> -
>
> Key: SPARK-20120
> URL: https://issues.apache.org/jira/browse/SPARK-20120
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>
> It is similar to Hive silent mode, just show the query result. see:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20120) spark-sql CLI support silent mode

2017-03-27 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-20120:
---

 Summary: spark-sql CLI support silent mode
 Key: SPARK-20120
 URL: https://issues.apache.org/jira/browse/SPARK-20120
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Yuming Wang


It is similar to Hive silent mode, just show the query result. see:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20119) Flaky Test: org.apache.spark.sql.execution. DataSourceScanExecRedactionSuite

2017-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20119:


Assignee: Apache Spark  (was: Xiao Li)

> Flaky Test: org.apache.spark.sql.execution. DataSourceScanExecRedactionSuite
> 
>
> Key: SPARK-20119
> URL: https://issues.apache.org/jira/browse/SPARK-20119
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Failed in 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6/2895/testReport/org.apache.spark.sql.execution/DataSourceScanExecRedactionSuite/treeString_is_redacted/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20119) Flaky Test: org.apache.spark.sql.execution. DataSourceScanExecRedactionSuite

2017-03-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944535#comment-15944535
 ] 

Apache Spark commented on SPARK-20119:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17448

> Flaky Test: org.apache.spark.sql.execution. DataSourceScanExecRedactionSuite
> 
>
> Key: SPARK-20119
> URL: https://issues.apache.org/jira/browse/SPARK-20119
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Failed in 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6/2895/testReport/org.apache.spark.sql.execution/DataSourceScanExecRedactionSuite/treeString_is_redacted/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20119) Flaky Test: org.apache.spark.sql.execution. DataSourceScanExecRedactionSuite

2017-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20119:


Assignee: Xiao Li  (was: Apache Spark)

> Flaky Test: org.apache.spark.sql.execution. DataSourceScanExecRedactionSuite
> 
>
> Key: SPARK-20119
> URL: https://issues.apache.org/jira/browse/SPARK-20119
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Failed in 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6/2895/testReport/org.apache.spark.sql.execution/DataSourceScanExecRedactionSuite/treeString_is_redacted/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20119) Flaky Test: org.apache.spark.sql.execution. DataSourceScanExecRedactionSuite

2017-03-27 Thread Xiao Li (JIRA)
Xiao Li created SPARK-20119:
---

 Summary: Flaky Test: org.apache.spark.sql.execution. 
DataSourceScanExecRedactionSuite
 Key: SPARK-20119
 URL: https://issues.apache.org/jira/browse/SPARK-20119
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li
Assignee: Xiao Li


Failed in 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6/2895/testReport/org.apache.spark.sql.execution/DataSourceScanExecRedactionSuite/treeString_is_redacted/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException

2017-03-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944506#comment-15944506
 ] 

Sean Owen commented on SPARK-19476:
---

Why do that in a thread instead of doing that same work synchronously? It 
sounds like you even need to make it synchronous externally.

> Running threads in Spark DataFrame foreachPartition() causes 
> NullPointerException
> -
>
> Key: SPARK-19476
> URL: https://issues.apache.org/jira/browse/SPARK-19476
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Gal Topper
>Priority: Minor
>
> First reported on [Stack 
> overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition].
> I use multiple threads inside foreachPartition(), which works great for me 
> except for when the underlying iterator is TungstenAggregationIterator. Here 
> is a minimal code snippet to reproduce:
> {code:title=Reproduce.scala|borderStyle=solid}
> import scala.concurrent.ExecutionContext.Implicits.global
> import scala.concurrent.duration.Duration
> import scala.concurrent.{Await, Future}
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.SQLContext
> object Reproduce extends App {
>   val sc = new SparkContext("local", "reproduce")
>   val sqlContext = new SQLContext(sc)
>   import sqlContext.implicits._
>   val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count()
>   df.foreachPartition { iterator =>
> val f = Future(iterator.toVector)
> Await.result(f, Duration.Inf)
>   }
> }
> {code}
> When I run this, I get:
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> {noformat}
> I believe I actually understand why this happens - 
> TungstenAggregationIterator uses a ThreadLocal variable that returns null 
> when called from a thread other than the original thread that got the 
> iterator from Spark. From examining the code, this does not appear to differ 
> between recent Spark versions.
> However, this limitation is specific to TungstenAggregationIterator, and not 
> documented, as far as I'm aware.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20118) spark2.1 support numeric datatype

2017-03-27 Thread QQShu1 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

QQShu1 closed SPARK-20118.
--

> spark2.1 support numeric datatype
> -
>
> Key: SPARK-20118
> URL: https://issues.apache.org/jira/browse/SPARK-20118
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: QQShu1
>
> spark2.1 now don`t support numeric.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20118) spark2.1 support numeric datatype

2017-03-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20118.
---
Resolution: Invalid

> spark2.1 support numeric datatype
> -
>
> Key: SPARK-20118
> URL: https://issues.apache.org/jira/browse/SPARK-20118
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: QQShu1
>
> spark2.1 now don`t support numeric.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19963) create view from select Fails when nullif() is used

2017-03-27 Thread dharani_sugumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944461#comment-15944461
 ] 

dharani_sugumar commented on SPARK-19963:
-

@Jay Danielsen: I'm looking into this.

> create view from select Fails when nullif() is used
> ---
>
> Key: SPARK-19963
> URL: https://issues.apache.org/jira/browse/SPARK-19963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jay Danielsen
>Priority: Minor
>
> Test Case : Any valid query using nullif.
> SELECT nullif(mycol,0) from mytable;
> Create view FAILS when nullif used in select.
> CREATE VIEW my_view as
> SELECT nullif(mycol,0) from mytable;
> Error: java.lang.RuntimeException: Failed to analyze the canonicalized SQL: 
> ...
> I can refactor with CASE statement and create view successfully.
> CREATE VIEW my_view as
> SELECT CASE WHEN mycol = 0 THEN NULL ELSE mycol END mycol from mytable;



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter

2017-03-27 Thread Mitesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mitesh updated SPARK-20112:
---
Description: 
I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
hs_err_pid and codegen file are attached (with query plans). Its not a 
deterministic repro, but running a big query load, I eventually see it come up 
within a few minutes.

Here is some interesting repro information:
- Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't 
repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I 
think that means its not an issue with the code-gen, but I cant figure out what 
the difference in behavior is.
- The broadcast joins in the plan are all small tables. I have 
autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
- As you can see from the plan, all the sources are cached memory tables. And 
we partition/sort them all beforehand so its always sort-merge-joins or 
broadcast joins (with small tables).

{noformat}
# A fatal error has been detected by the Java Runtime Environment:
#
#  [thread 139872345896704 also had an error]
SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 
compressed oops)

[thread 139872348002048 also had an error]# Problematic frame:
# 
J 28454 C1 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V
 (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3]
{noformat}

This kind of looks like https://issues.apache.org/jira/browse/SPARK-15822, but 
that is marked fix in 2.0.0

  was:
I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
hs_err_pid and codegen file are attached (with query plans). Its not a 
deterministic repro, but running a big query load, I eventually see it come up 
within a few minutes.

Here is some interesting repro information:
- Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't 
repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I 
think that means its not an issue with the code-gen, but I cant figure out what 
the difference in behavior is.
- The broadcast joins in the plan are all small tables. I have 
autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
- As you can see from the plan, all the sources are cached memory tables

{noformat}
# A fatal error has been detected by the Java Runtime Environment:
#
#  [thread 139872345896704 also had an error]
SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 
compressed oops)

[thread 139872348002048 also had an error]# Problematic frame:
# 
J 28454 C1 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V
 (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3]
{noformat}

This kind of looks like https://issues.apache.org/jira/browse/SPARK-15822, but 
that is marked fix in 2.0.0


> SIGSEGV in GeneratedIterator.sort_addToSorter
> -
>
> Key: SPARK-20112
> URL: https://issues.apache.org/jira/browse/SPARK-20112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops)
>Reporter: Mitesh
> Attachments: codegen_sorter_crash.log, hs_err_pid19271.log
>
>
> I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
> hs_err_pid and codegen file are attached (with query plans). Its not a 
> deterministic repro, but running a big query load, I eventually see it come 
> up within a few minutes.
> Here is some interesting repro information:
> - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't 
> repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I 
> think that means its not an issue with the code-gen, but I cant figure out 
> what the difference in behavior is.
> - The broadcast joins in the plan are all small tables. I have 
> autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
> - As you can see from the plan, all the sources are cached memory tables. And 
> we partition/sort them all beforehand so its always sort-merge-joins or 
> broadcast joins (with small tables).
> {noformat}
> # A fatal error has been detected by the Java Runtime 

[jira] [Created] (SPARK-20118) spark2.1 support numeric datatype

2017-03-27 Thread QQShu1 (JIRA)
QQShu1 created SPARK-20118:
--

 Summary: spark2.1 support numeric datatype
 Key: SPARK-20118
 URL: https://issues.apache.org/jira/browse/SPARK-20118
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.1.0
Reporter: QQShu1


spark2.1 now don`t support numeric.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20117) TaskSetManager checkSpeculatableTasks variables immutability and Use string interpolation

2017-03-27 Thread jianran.tfh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jianran.tfh closed SPARK-20117.
---
Resolution: Invalid

> TaskSetManager checkSpeculatableTasks variables immutability and Use string 
> interpolation
> -
>
> Key: SPARK-20117
> URL: https://issues.apache.org/jira/browse/SPARK-20117
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: jianran.tfh
>Priority: Trivial
>
> TaskSetManager checkSpeculatableTasks variables immutability and Use string 
> interpolation



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19757) Executor with task scheduled could be killed due to idleness

2017-03-27 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-19757:
-
Component/s: Scheduler

> Executor with task scheduled could be killed due to idleness
> 
>
> Key: SPARK-19757
> URL: https://issues.apache.org/jira/browse/SPARK-19757
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 1.6.0
>Reporter: jin xing
>Assignee: Jimmy Xiang
>Priority: Minor
> Fix For: 2.2.0
>
>
> With dynamic executor allocation enabled on yarn mode, after one job is 
> finished for a while, submit another job, then there is race between killing 
> idle executors and scheduling new task on these executors. Sometimes, some 
> executor is killed right after a task is scheduled on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20070) Redact datasource explain output

2017-03-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944417#comment-15944417
 ] 

Apache Spark commented on SPARK-20070:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17448

> Redact datasource explain output
> 
>
> Key: SPARK-20070
> URL: https://issues.apache.org/jira/browse/SPARK-20070
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.2.0
>
>
> When calling explain on a datasource, the output can contain sensitive 
> information. We should provide an admin/user to redact such information.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20117) TaskSetManager checkSpeculatableTasks variables immutability and Use string interpolation

2017-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20117:


Assignee: Apache Spark

> TaskSetManager checkSpeculatableTasks variables immutability and Use string 
> interpolation
> -
>
> Key: SPARK-20117
> URL: https://issues.apache.org/jira/browse/SPARK-20117
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: jianran.tfh
>Assignee: Apache Spark
>Priority: Trivial
>
> TaskSetManager checkSpeculatableTasks variables immutability and Use string 
> interpolation



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20117) TaskSetManager checkSpeculatableTasks variables immutability and Use string interpolation

2017-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20117:


Assignee: (was: Apache Spark)

> TaskSetManager checkSpeculatableTasks variables immutability and Use string 
> interpolation
> -
>
> Key: SPARK-20117
> URL: https://issues.apache.org/jira/browse/SPARK-20117
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: jianran.tfh
>Priority: Trivial
>
> TaskSetManager checkSpeculatableTasks variables immutability and Use string 
> interpolation



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20117) TaskSetManager checkSpeculatableTasks variables immutability and Use string interpolation

2017-03-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944410#comment-15944410
 ] 

Apache Spark commented on SPARK-20117:
--

User 'jianran' has created a pull request for this issue:
https://github.com/apache/spark/pull/17447

> TaskSetManager checkSpeculatableTasks variables immutability and Use string 
> interpolation
> -
>
> Key: SPARK-20117
> URL: https://issues.apache.org/jira/browse/SPARK-20117
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: jianran.tfh
>Priority: Trivial
>
> TaskSetManager checkSpeculatableTasks variables immutability and Use string 
> interpolation



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20117) TaskSetManager checkSpeculatableTasks variables immutability and Use string interpolation

2017-03-27 Thread jianran.tfh (JIRA)
jianran.tfh created SPARK-20117:
---

 Summary: TaskSetManager checkSpeculatableTasks variables 
immutability and Use string interpolation
 Key: SPARK-20117
 URL: https://issues.apache.org/jira/browse/SPARK-20117
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.1.0
Reporter: jianran.tfh
Priority: Trivial


TaskSetManager checkSpeculatableTasks variables immutability and Use string 
interpolation



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19088) Optimize sequence type deserialization codegen

2017-03-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19088.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16541
[https://github.com/apache/spark/pull/16541]

> Optimize sequence type deserialization codegen
> --
>
> Key: SPARK-19088
> URL: https://issues.apache.org/jira/browse/SPARK-19088
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michal Šenkýř
>Assignee: Michal Šenkýř
>Priority: Minor
>  Labels: performance
> Fix For: 2.2.0
>
>
> Sequence type deserialization codegen added in [PR 
> #16240|https://github.com/apache/spark/pull/16240] should use a proper 
> builder instead of a conversion (using {{to}}) to avoid an additional pass.
> This will require an additional {{MapObjects}}-like operation that will use 
> the provided builder instead of building an array.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19088) Optimize sequence type deserialization codegen

2017-03-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19088:
---

Assignee: Michal Šenkýř

> Optimize sequence type deserialization codegen
> --
>
> Key: SPARK-19088
> URL: https://issues.apache.org/jira/browse/SPARK-19088
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michal Šenkýř
>Assignee: Michal Šenkýř
>Priority: Minor
>  Labels: performance
>
> Sequence type deserialization codegen added in [PR 
> #16240|https://github.com/apache/spark/pull/16240] should use a proper 
> builder instead of a conversion (using {{to}}) to avoid an additional pass.
> This will require an additional {{MapObjects}}-like operation that will use 
> the provided builder instead of building an array.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20100) Consolidate SessionState construction

2017-03-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20100.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17433
[https://github.com/apache/spark/pull/17433]

> Consolidate SessionState construction
> -
>
> Key: SPARK-20100
> URL: https://issues.apache.org/jira/browse/SPARK-20100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.2.0
>
>
> The current SessionState initialization path is quite complex. A part of the 
> creation is done in the SessionState companion objects, a part of the 
> creation is one inside the SessionState class, and a part is done by passing 
> functions.
> The proposal is to consolidate the SessionState initialization into a builder 
> class. This SessionState will not do any initialization and just becomes a 
> place holder for the various Spark SQL internals. The advantages of this 
> approach are the following:
> - SessionState initialization is less dispersed. The builder should be a one 
> stop shop.
> - This provides us with a start for removing the HiveSessionState. Removing 
> the hive session state would also require us to move resource loading into a 
> separate class, and to (re)move metadata hive.
> - It is easier to customize the Spark Session. You just need to create a 
> custom version of the builder. I will add hooks to make this easier. Opening 
> up these API's will happen at a later point.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20116) Remove task-level functionality from the DAGScheduler

2017-03-27 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-20116:
--

 Summary: Remove task-level functionality from the DAGScheduler
 Key: SPARK-20116
 URL: https://issues.apache.org/jira/browse/SPARK-20116
 Project: Spark
  Issue Type: Sub-task
  Components: Scheduler
Affects Versions: 2.2.0
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout


Long, long ago, the scheduler code was more modular, and the DAGScheduler 
handled the logic of scheduling DAGs of stages (as the name suggests) and the 
TaskSchedulerImpl handled scheduling the tasks within a stage.  Over time, more 
and more task-specific functionality has been added to the DAGScheduler, and 
now, the DAGScheduler duplicates a bunch of the task tracking that's done by 
other scheduler components.  This makes the scheduler code harder to reason 
about, and has led to some tricky bugs (e.g., SPARK-19263).  We should move all 
of this functionality back to the TaskSchedulerImpl and TaskSetManager, which 
should "hide" that complexity from the DAGScheduler.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19803.
-
Resolution: Fixed

Issue resolved by pull request 17325
[https://github.com/apache/spark/pull/17325]

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Shubham Chopra
>  Labels: flaky-test
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17075) Cardinality Estimation of Predicate Expressions

2017-03-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944325#comment-15944325
 ] 

Apache Spark commented on SPARK-17075:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17446

> Cardinality Estimation of Predicate Expressions
> ---
>
> Key: SPARK-17075
> URL: https://issues.apache.org/jira/browse/SPARK-17075
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>Assignee: Ron Hu
> Fix For: 2.2.0
>
>
> A filter condition is the predicate expression specified in the WHERE clause 
> of a SQL select statement.  A predicate can be a compound logical expression 
> with logical AND, OR, NOT operators combining multiple single conditions.  A 
> single condition usually has comparison operators such as =, <, <=, >, >=, 
> ‘like’, etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20115) Fix DAGScheduler to recompute all the lost shuffle blocks when external shuffle service is unavailable

2017-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20115:


Assignee: Apache Spark

> Fix DAGScheduler to recompute all the lost shuffle blocks when external 
> shuffle service is unavailable
> --
>
> Key: SPARK-20115
> URL: https://issues.apache.org/jira/browse/SPARK-20115
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core, YARN
>Affects Versions: 2.0.2, 2.1.0
> Environment: Spark on Yarn with external shuffle service enabled, 
> running on AWS EMR cluster.
>Reporter: Udit Mehrotra
>Assignee: Apache Spark
>
> The Spark’s DAGScheduler currently does not recompute all the lost shuffle 
> blocks on a host when a FetchFailed exception occurs, while fetching shuffle 
> blocks from another executor with external shuffle service enabled. Instead 
> it only recomputes the lost shuffle blocks computed by the executor for which 
> the FetchFailed exception occurred. This works fine for Internal shuffle 
> scenario, where the executors serve their own shuffle blocks and hence only 
> the shuffle blocks for that executor should be considered lost. However, when 
> External Shuffle Service is being used, a FetchFailed exception would mean 
> that the external shuffle service running on that host has become 
> unavailable. This in turn is sufficient to assume that all the shuffle blocks 
> which were managed by the Shuffle service on that host are lost. Therefore, 
> just recomputing the shuffle blocks associated with the particular Executor 
> for which FetchFailed exception occurred is not sufficient. We need to 
> recompute all the shuffle blocks, managed by that service because there could 
> be multiple executors running on that host.
>  
> Since not all the shuffle blocks (for all the executors on the host) are 
> recomputed, this causes future attempts of the reduce stage to fail as well 
> because the new tasks scheduled still keep trying to reach the old location 
> of the shuffle blocks (which were not recomputed) and keep throwing further 
> FetchFailed exceptions. This ultimately causes the job to fail, after the 
> reduce stage has been retried 4 times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20115) Fix DAGScheduler to recompute all the lost shuffle blocks when external shuffle service is unavailable

2017-03-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944312#comment-15944312
 ] 

Apache Spark commented on SPARK-20115:
--

User 'umehrot2' has created a pull request for this issue:
https://github.com/apache/spark/pull/17445

> Fix DAGScheduler to recompute all the lost shuffle blocks when external 
> shuffle service is unavailable
> --
>
> Key: SPARK-20115
> URL: https://issues.apache.org/jira/browse/SPARK-20115
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core, YARN
>Affects Versions: 2.0.2, 2.1.0
> Environment: Spark on Yarn with external shuffle service enabled, 
> running on AWS EMR cluster.
>Reporter: Udit Mehrotra
>
> The Spark’s DAGScheduler currently does not recompute all the lost shuffle 
> blocks on a host when a FetchFailed exception occurs, while fetching shuffle 
> blocks from another executor with external shuffle service enabled. Instead 
> it only recomputes the lost shuffle blocks computed by the executor for which 
> the FetchFailed exception occurred. This works fine for Internal shuffle 
> scenario, where the executors serve their own shuffle blocks and hence only 
> the shuffle blocks for that executor should be considered lost. However, when 
> External Shuffle Service is being used, a FetchFailed exception would mean 
> that the external shuffle service running on that host has become 
> unavailable. This in turn is sufficient to assume that all the shuffle blocks 
> which were managed by the Shuffle service on that host are lost. Therefore, 
> just recomputing the shuffle blocks associated with the particular Executor 
> for which FetchFailed exception occurred is not sufficient. We need to 
> recompute all the shuffle blocks, managed by that service because there could 
> be multiple executors running on that host.
>  
> Since not all the shuffle blocks (for all the executors on the host) are 
> recomputed, this causes future attempts of the reduce stage to fail as well 
> because the new tasks scheduled still keep trying to reach the old location 
> of the shuffle blocks (which were not recomputed) and keep throwing further 
> FetchFailed exceptions. This ultimately causes the job to fail, after the 
> reduce stage has been retried 4 times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20115) Fix DAGScheduler to recompute all the lost shuffle blocks when external shuffle service is unavailable

2017-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20115:


Assignee: (was: Apache Spark)

> Fix DAGScheduler to recompute all the lost shuffle blocks when external 
> shuffle service is unavailable
> --
>
> Key: SPARK-20115
> URL: https://issues.apache.org/jira/browse/SPARK-20115
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core, YARN
>Affects Versions: 2.0.2, 2.1.0
> Environment: Spark on Yarn with external shuffle service enabled, 
> running on AWS EMR cluster.
>Reporter: Udit Mehrotra
>
> The Spark’s DAGScheduler currently does not recompute all the lost shuffle 
> blocks on a host when a FetchFailed exception occurs, while fetching shuffle 
> blocks from another executor with external shuffle service enabled. Instead 
> it only recomputes the lost shuffle blocks computed by the executor for which 
> the FetchFailed exception occurred. This works fine for Internal shuffle 
> scenario, where the executors serve their own shuffle blocks and hence only 
> the shuffle blocks for that executor should be considered lost. However, when 
> External Shuffle Service is being used, a FetchFailed exception would mean 
> that the external shuffle service running on that host has become 
> unavailable. This in turn is sufficient to assume that all the shuffle blocks 
> which were managed by the Shuffle service on that host are lost. Therefore, 
> just recomputing the shuffle blocks associated with the particular Executor 
> for which FetchFailed exception occurred is not sufficient. We need to 
> recompute all the shuffle blocks, managed by that service because there could 
> be multiple executors running on that host.
>  
> Since not all the shuffle blocks (for all the executors on the host) are 
> recomputed, this causes future attempts of the reduce stage to fail as well 
> because the new tasks scheduled still keep trying to reach the old location 
> of the shuffle blocks (which were not recomputed) and keep throwing further 
> FetchFailed exceptions. This ultimately causes the job to fail, after the 
> reduce stage has been retried 4 times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter

2017-03-27 Thread Mitesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mitesh updated SPARK-20112:
---
Description: 
I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
hs_err_pid and codegen file are attached (with query plans). Its not a 
deterministic repro, but running a big query load, I eventually see it come up 
within a few minutes.

Here is some interesting repro information:
- Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't 
repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I 
think that means its not an issue with the code-gen, but I cant figure out what 
the difference in behavior is.
- The broadcast joins in the plan are all small tables. I have 
autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
- As you can see from the plan, all the sources are cached memory tables

{noformat}
# A fatal error has been detected by the Java Runtime Environment:
#
#  [thread 139872345896704 also had an error]
SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 
compressed oops)

[thread 139872348002048 also had an error]# Problematic frame:
# 
J 28454 C1 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V
 (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3]
{noformat}

This kind of looks like https://issues.apache.org/jira/browse/SPARK-15822, but 
that is marked fix in 2.0.0

  was:
I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
hs_err_pid and codegen file are attached (with query plans). Its not a 
deterministic repro, but running a big query load, I eventually see it come up 
within a few minutes.

Here is some interesting repro information:
- Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't 
repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I 
think that means its not an issue with the code-gen, but I cant figure out what 
the difference in behavior is.
- The broadcast joins in the plan are all small tables. I have 
autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
- As you can see from the plan, all the sources are cached memory tables

{noformat}
# A fatal error has been detected by the Java Runtime Environment:
#
#  [thread 139872345896704 also had an error]
SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 
compressed oops)

[thread 139872348002048 also had an error]# Problematic frame:
# 
J 28454 C1 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V
 (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3]
{noformat}


> SIGSEGV in GeneratedIterator.sort_addToSorter
> -
>
> Key: SPARK-20112
> URL: https://issues.apache.org/jira/browse/SPARK-20112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops)
>Reporter: Mitesh
> Attachments: codegen_sorter_crash.log, hs_err_pid19271.log
>
>
> I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
> hs_err_pid and codegen file are attached (with query plans). Its not a 
> deterministic repro, but running a big query load, I eventually see it come 
> up within a few minutes.
> Here is some interesting repro information:
> - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't 
> repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I 
> think that means its not an issue with the code-gen, but I cant figure out 
> what the difference in behavior is.
> - The broadcast joins in the plan are all small tables. I have 
> autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
> - As you can see from the plan, all the sources are cached memory tables
> {noformat}
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  [thread 139872345896704 also had an error]
> SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode 
> linux-amd64 compressed oops)
> [thread 

[jira] [Issue Comment Deleted] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter

2017-03-27 Thread Mitesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mitesh updated SPARK-20112:
---
Comment: was deleted

(was: This kind of looks like 
https://issues.apache.org/jira/browse/SPARK-15822, but that is marked fix in 
2.0.0)

> SIGSEGV in GeneratedIterator.sort_addToSorter
> -
>
> Key: SPARK-20112
> URL: https://issues.apache.org/jira/browse/SPARK-20112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops)
>Reporter: Mitesh
> Attachments: codegen_sorter_crash.log, hs_err_pid19271.log
>
>
> I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
> hs_err_pid and codegen file are attached (with query plans). Its not a 
> deterministic repro, but running a big query load, I eventually see it come 
> up within a few minutes.
> Here is some interesting repro information:
> - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't 
> repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I 
> think that means its not an issue with the code-gen, but I cant figure out 
> what the difference in behavior is.
> - The broadcast joins in the plan are all small tables. I have 
> autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
> - As you can see from the plan, all the sources are cached memory tables
> {noformat}
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  [thread 139872345896704 also had an error]
> SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode 
> linux-amd64 compressed oops)
> [thread 139872348002048 also had an error]# Problematic frame:
> # 
> J 28454 C1 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V
>  (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20115) Fix DAGScheduler to recompute all the lost shuffle blocks when external shuffle service is unavailable

2017-03-27 Thread Udit Mehrotra (JIRA)
Udit Mehrotra created SPARK-20115:
-

 Summary: Fix DAGScheduler to recompute all the lost shuffle blocks 
when external shuffle service is unavailable
 Key: SPARK-20115
 URL: https://issues.apache.org/jira/browse/SPARK-20115
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core, YARN
Affects Versions: 2.1.0, 2.0.2
 Environment: Spark on Yarn with external shuffle service enabled, 
running on AWS EMR cluster.
Reporter: Udit Mehrotra


The Spark’s DAGScheduler currently does not recompute all the lost shuffle 
blocks on a host when a FetchFailed exception occurs, while fetching shuffle 
blocks from another executor with external shuffle service enabled. Instead it 
only recomputes the lost shuffle blocks computed by the executor for which the 
FetchFailed exception occurred. This works fine for Internal shuffle scenario, 
where the executors serve their own shuffle blocks and hence only the shuffle 
blocks for that executor should be considered lost. However, when External 
Shuffle Service is being used, a FetchFailed exception would mean that the 
external shuffle service running on that host has become unavailable. This in 
turn is sufficient to assume that all the shuffle blocks which were managed by 
the Shuffle service on that host are lost. Therefore, just recomputing the 
shuffle blocks associated with the particular Executor for which FetchFailed 
exception occurred is not sufficient. We need to recompute all the shuffle 
blocks, managed by that service because there could be multiple executors 
running on that host.
 
Since not all the shuffle blocks (for all the executors on the host) are 
recomputed, this causes future attempts of the reduce stage to fail as well 
because the new tasks scheduled still keep trying to reach the old location of 
the shuffle blocks (which were not recomputed) and keep throwing further 
FetchFailed exceptions. This ultimately causes the job to fail, after the 
reduce stage has been retried 4 times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan

2017-03-27 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944239#comment-15944239
 ] 

yuhao yang edited comment on SPARK-20114 at 3/27/17 11:42 PM:
--

Currently I prefer to implement the dummy PrefixSpanModel as the sequential 
rules extracted won't be quite useful. And if needed, we can implement other 
algorithms to extract sequential rules for prediction.


was (Author: yuhaoyan):
Currently I prefer to implement the dummy PrefixSpanModel as the sequential 
rules extracted won't be quite useful. 

> spark.ml parity for sequential pattern mining - PrefixSpan
> --
>
> Key: SPARK-20114
> URL: https://issues.apache.org/jira/browse/SPARK-20114
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> Creating this jira to track the feature parity for PrefixSpan and sequential 
> pattern mining in Spark ml with DataFrame API. 
> First list a few design issues to be discussed, then subtasks like Scala, 
> Python and R API will be created.
> # Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
> straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
> which is not good to be used directly for predicting on new records. Please 
> read  
> http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
>  for some background knowledge. Thanks Philippe Fournier-Viger for providing 
> insights. If we want to keep using the Estimator/Transformer pattern, options 
> are:
>  #*  Implement a dummy transform for PrefixSpanModel, which will not add 
> new column to the input DataSet. The PrefixSpanModel is only used to provide 
> access for frequent sequential patterns.
>  #*  Adding the feature to extract sequential rules from sequential 
> patterns. Then use the sequential rules in the transform as FPGrowthModel.  
> The rules extracted are of the form X–> Y where X and Y are sequential 
> patterns. But in practice, these rules are not very good as they are too 
> precise and thus not noise tolerant.
> #  Different from association rules and frequent itemsets, sequential rules 
> can be extracted from the original dataset more efficiently using algorithms 
> like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
> unordered, but X must appear before Y, which is more general and can work 
> better in practice for prediction. 
> I'd like to hear more from the users to see which kind of Sequential rules 
> are more practical. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan

2017-03-27 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944239#comment-15944239
 ] 

yuhao yang commented on SPARK-20114:


Currently I prefer to implement the dummy PrefixSpanModel as the sequential 
rules extracted won't be quite useful. 

> spark.ml parity for sequential pattern mining - PrefixSpan
> --
>
> Key: SPARK-20114
> URL: https://issues.apache.org/jira/browse/SPARK-20114
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> Creating this jira to track the feature parity for PrefixSpan and sequential 
> pattern mining in Spark ml with DataFrame API. 
> First list a few design issues to be discussed, then subtasks like Scala, 
> Python and R API will be created.
> # Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
> straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
> which is not good to be used directly for predicting on new records. Please 
> read  
> http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
>  for some background knowledge. Thanks Philippe Fournier-Viger for providing 
> insights. If we want to keep using the Estimator/Transformer pattern, options 
> are:
>  #*  Implement a dummy transform for PrefixSpanModel, which will not add 
> new column to the input DataSet. The PrefixSpanModel is only used to provide 
> access for frequent sequential patterns.
>  #*  Adding the feature to extract sequential rules from sequential 
> patterns. Then use the sequential rules in the transform as FPGrowthModel.  
> The rules extracted are of the form X–> Y where X and Y are sequential 
> patterns. But in practice, these rules are not very good as they are too 
> precise and thus not noise tolerant.
> #  Different from association rules and frequent itemsets, sequential rules 
> can be extracted from the original dataset more efficiently using algorithms 
> like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
> unordered, but X must appear before Y, which is more general and can work 
> better in practice for prediction. 
> I'd like to hear more from the users to see which kind of Sequential rules 
> are more practical. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan

2017-03-27 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-20114:
---
Description: 
Creating this jira to track the feature parity for PrefixSpan and sequential 
pattern mining in Spark ml with DataFrame API. 

First list a few design issues to be discussed, then subtasks like Scala, 
Python and R API will be created.

# Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
which is not good to be used directly for predicting on new records. Please 
read  
http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
 for some background knowledge. Thanks Philippe Fournier-Viger for providing 
insights. If we want to keep using the Estimator/Transformer pattern, options 
are:
 #*  Implement a dummy transform for PrefixSpanModel, which will not add 
new column to the input DataSet. The PrefixSpanModel is only used to provide 
access for frequent sequential patterns.
 #*  Adding the feature to extract sequential rules from sequential 
patterns. Then use the sequential rules in the transform as FPGrowthModel.  The 
rules extracted are of the form X–> Y where X and Y are sequential patterns. 
But in practice, these rules are not very good as they are too precise and thus 
not noise tolerant.
#  Different from association rules and frequent itemsets, sequential rules can 
be extracted from the original dataset more efficiently using algorithms like 
RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
unordered, but X must appear before Y, which is more general and can work 
better in practice for prediction. 

I'd like to hear more from the users to see which kind of Sequential rules are 
more practical. 


  was:
Creating this jira to track the feature parity for PrefixSpan and sequential 
pattern mining in Spark ml with DataFrame API. 

First list a few design issues to be discussed, then subtasks like Scala, 
Python and R API will be created.

# Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
which is not good to be used directly for predicting on new records. Please 
read  
http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
 for some background knowledge. Thanks Philippe Fournier-Viger for providing 
insights. If we want to keep using the Estimator/Transformer pattern, options 
are:
 #*  Implement a dummy transform for PrefixSpanModel, which will not add 
new column to the input DataSet. 
 #*  Adding the feature to extract sequential rules from sequential 
patterns. Then use the sequential rules in the transform as FPGrowthModel.  The 
rules extracted are of the form X–> Y where X and Y are sequential patterns. 
But in practice, these rules are not very good as they are too precise and thus 
not noise tolerant.
#  Different from association rules and frequent itemsets, sequential rules can 
be extracted from the original dataset more efficiently using algorithms like 
RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
unordered, but X must appear before Y, which is more general and can work 
better in practice for prediction. 

I'd like to hear more from the users to see which kind of Sequential rules are 
more practical. 



> spark.ml parity for sequential pattern mining - PrefixSpan
> --
>
> Key: SPARK-20114
> URL: https://issues.apache.org/jira/browse/SPARK-20114
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> Creating this jira to track the feature parity for PrefixSpan and sequential 
> pattern mining in Spark ml with DataFrame API. 
> First list a few design issues to be discussed, then subtasks like Scala, 
> Python and R API will be created.
> # Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
> straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
> which is not good to be used directly for predicting on new records. Please 
> read  
> http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
>  for some background knowledge. Thanks Philippe Fournier-Viger for providing 
> insights. If we want to keep using the Estimator/Transformer pattern, options 
> are:
>  #*  Implement a dummy transform for PrefixSpanModel, which will not add 
> new column to the input DataSet. The PrefixSpanModel is only used to provide 
> access for frequent sequential patterns.
>  #*  Adding the feature to extract sequential rules from sequential 
> patterns. Then use the sequential rules in the transform as FPGrowthModel.  

[jira] [Updated] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan

2017-03-27 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-20114:
---
Description: 
Creating this jira to track the feature parity for PrefixSpan and sequential 
pattern mining in Spark ml with DataFrame API. 

First list a few design issues to be discussed, then subtasks like Scala, 
Python and R API will be created.

# Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
which is not good to be used directly for predicting on new records. Please 
read  
http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
 for some background knowledge. Thanks Philippe Fournier-Viger for providing 
insights. If we want to keep using the Estimator/Transformer pattern, options 
are:
 #*  Implement a dummy transform for PrefixSpanModel, which will not add 
new column to the input DataSet. 
 #*  Adding the feature to extract sequential rules from sequential 
patterns. Then use the sequential rules in the transform as FPGrowthModel.  The 
rules extracted are of the form X–> Y where X and Y are sequential patterns. 
But in practice, these rules are not very good as they are too precise and thus 
not noise tolerant.
#  Different from association rules and frequent itemsets, sequential rules can 
be extracted from the original dataset more efficiently using algorithms like 
RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
unordered, but X must appear before Y, which is more general and can work 
better in practice for prediction. 

I'd like to hear more from the users to see which kind of Sequential rules are 
more practical. 


  was:
Creating this jira to track the feature parity for PrefixSpan and sequential 
pattern mining in Spark ml with DataFrame API. 

First list a few design issues to be discussed, then subtasks like Scala, 
Python and R will be created.

# Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
which is not good to be used directly for predicting on new records. Please 
read  
http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
 for some background knowledge. Thanks Philippe Fournier-Viger for providing 
insights. If we want to keep using the Estimator/Transformer pattern, options 
are:
 #*  Implement a dummy transform for PrefixSpanModel, which will not add 
new column to the input DataSet. 
 #*  Adding the feature to extract sequential rules from sequential 
patterns. Then use the sequential rules in the transform as FPGrowthModel.  The 
rules extracted are of the form X–> Y where X and Y are sequential patterns. 
But in practice, these rules are not very good as they are too precise and thus 
not noise tolerant.
#  Different from association rules and frequent itemsets, sequential rules can 
be extracted from the original dataset more efficiently using algorithms like 
RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
unordered, but X must appear before Y, which is more general and can work 
better in practice for prediction. 

I'd like to hear more from the users to see which kind of Sequential rules are 
more practical. 



> spark.ml parity for sequential pattern mining - PrefixSpan
> --
>
> Key: SPARK-20114
> URL: https://issues.apache.org/jira/browse/SPARK-20114
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> Creating this jira to track the feature parity for PrefixSpan and sequential 
> pattern mining in Spark ml with DataFrame API. 
> First list a few design issues to be discussed, then subtasks like Scala, 
> Python and R API will be created.
> # Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
> straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
> which is not good to be used directly for predicting on new records. Please 
> read  
> http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
>  for some background knowledge. Thanks Philippe Fournier-Viger for providing 
> insights. If we want to keep using the Estimator/Transformer pattern, options 
> are:
>  #*  Implement a dummy transform for PrefixSpanModel, which will not add 
> new column to the input DataSet. 
>  #*  Adding the feature to extract sequential rules from sequential 
> patterns. Then use the sequential rules in the transform as FPGrowthModel.  
> The rules extracted are of the form X–> Y where X and Y are sequential 
> patterns. But in practice, these rules are not very good as they are too 
> precise and thus not 

[jira] [Created] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan

2017-03-27 Thread yuhao yang (JIRA)
yuhao yang created SPARK-20114:
--

 Summary: spark.ml parity for sequential pattern mining - PrefixSpan
 Key: SPARK-20114
 URL: https://issues.apache.org/jira/browse/SPARK-20114
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang


Creating this jira to track the feature parity for PrefixSpan and sequential 
pattern mining in Spark ml with DataFrame API. 

First list a few design issues to be discussed, then subtasks like Scala, 
Python and R will be created.

# Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
which is not good to be used directly for predicting on new records. Please 
read  
http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
 for some background knowledge. Thanks Philippe Fournier-Viger for providing 
insights. If we want to keep using the Estimator/Transformer pattern, options 
are:
 #*  Implement a dummy transform for PrefixSpanModel, which will not add 
new column to the input DataSet. 
 #*  Adding the feature to extract sequential rules from sequential 
patterns. Then use the sequential rules in the transform as FPGrowthModel.  The 
rules extracted are of the form X–> Y where X and Y are sequential patterns. 
But in practice, these rules are not very good as they are too precise and thus 
not noise tolerant.
#  Different from association rules and frequent itemsets, sequential rules can 
be extracted from the original dataset more efficiently using algorithms like 
RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
unordered, but X must appear before Y, which is more general and can work 
better in practice for prediction. 

I'd like to hear more from the users to see which kind of Sequential rules are 
more practical. 




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-03-27 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944215#comment-15944215
 ] 

Timothy Hunter commented on SPARK-19634:


[~sethah], yes, thanks for bringing up these concerns. Regarding the first 
points, the UDAF interface does not let you update arrays in place, which is a 
non-starter in our case. This is why the implementation switches to TIA. I have 
updated the design doc with these comments.

Regarding the performance, I agree that there is a tension between having an 
API that is compatible with structured streaming and the current, RDD-based 
implementation. I will provide some test numbers so that we have a basis for 
discussion. That being said, the RDD API is not going away, so if users care 
about performance and do not need the additional benefit of integrating with 
SQL or structured streaming, they can still use it.

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19612) Tests failing with timeout

2017-03-27 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944181#comment-15944181
 ] 

Kay Ousterhout commented on SPARK-19612:


And another: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75272/console

> Tests failing with timeout
> --
>
> Key: SPARK-19612
> URL: https://issues.apache.org/jira/browse/SPARK-19612
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Kay Ousterhout
>Priority: Minor
>
> I've seen at least one recent test failure due to hitting the 250m timeout: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72882/
> Filing this JIRA to track this; if it happens repeatedly we should up the 
> timeout.
> cc [~shaneknapp]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException

2017-03-27 Thread Lucy Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944163#comment-15944163
 ] 

Lucy Yu commented on SPARK-19476:
-

bq. I don't think in general you're expected to be able to do this safely. Why 
would you do this asynchronously or with more partitions, simply?

Sorry, my simplified example had a mistake in it but a user ran into a 
NullPointerException in our actual code. In my simplified example the lambda 
function may return before the thread is complete, but the actual code enforces 
that the lambda function cannot return until the thread has finished. ie we have

{code}
df.foreachPartition(partition => {
...
numRowsAccumulator += ingestStrategy.loadPartition(targetNode, partition)
})
{code}

and loadPartition is defined 
https://github.com/memsql/memsql-spark-connector/blob/master/src/main/scala/com/memsql/spark/connector/LoadDataStrategy.scala#L18
 . Basically, the thread finishes once it has read all of the partition's data 
into a stream at which point it closes the stream, and 
stmt.executeUpdate(query.sql.toString) which is part of the function passed to 
foreachPartition waits until the stream is closed.

We do this to load the partition into a database in a constant-memory way -- by 
writing to a pipe and consuming from it at the same time. Without this approach 
they may run out of memory materializing the partition.

> Running threads in Spark DataFrame foreachPartition() causes 
> NullPointerException
> -
>
> Key: SPARK-19476
> URL: https://issues.apache.org/jira/browse/SPARK-19476
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Gal Topper
>Priority: Minor
>
> First reported on [Stack 
> overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition].
> I use multiple threads inside foreachPartition(), which works great for me 
> except for when the underlying iterator is TungstenAggregationIterator. Here 
> is a minimal code snippet to reproduce:
> {code:title=Reproduce.scala|borderStyle=solid}
> import scala.concurrent.ExecutionContext.Implicits.global
> import scala.concurrent.duration.Duration
> import scala.concurrent.{Await, Future}
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.SQLContext
> object Reproduce extends App {
>   val sc = new SparkContext("local", "reproduce")
>   val sqlContext = new SQLContext(sc)
>   import sqlContext.implicits._
>   val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count()
>   df.foreachPartition { iterator =>
> val f = Future(iterator.toVector)
> Await.result(f, Duration.Inf)
>   }
> }
> {code}
> When I run this, I get:
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> {noformat}
> I believe I actually understand why this happens - 
> TungstenAggregationIterator uses a ThreadLocal variable that returns null 
> when called from a thread other than the original thread that got the 
> iterator from Spark. From examining the code, this does not appear to differ 
> between recent Spark versions.
> However, this limitation is specific to TungstenAggregationIterator, and not 
> documented, as far as I'm aware.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-03-27 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944030#comment-15944030
 ] 

Seth Hendrickson edited comment on SPARK-19634 at 3/27/17 10:23 PM:


I'm coming to this a bit late, but I'm finding things a bit hard to follow. 
Reading the design doc, it seems that the original plan was to implement two 
interfaces - an RDD one that provides the same performance as current 
{{MultivariateOnlineSummarizer}} and a data frame interface using UDAF. 

from design doc:
"...In the meantime, there will be a (possibly faster) RDD interface and a 
(more flexible) Dataframe interface."

Now, the PR for this uses {{TypedImperativeAggregate}}. I understand that the 
it was pivoted away from UDAF, but the design doc does not reflect that. Also, 
if there is to be an RDD interface, what is the JIRA for it and what will it 
look like?

Also, there are several concerns raised in the design doc about this Catalyst 
aggregate approach being less efficient, and the consensus seemed to be: 
provide an initial API with a "slow" implementation that will be improved upon 
in the future. Is that correct? I'm not that familiar with the Catalyst 
optimizer, but are we sure there is a good way to implement the tree-reduce 
type aggregation, and if so could we document that? I'd prefer to get the 
details hashed out further rather than rushing to provide an API and initial 
slow implementation, that way we can make sure that we get this correct in the 
long-term. I really appreciate some clarification and my apologies if I have 
missed any of the details/discussion.


was (Author: sethah):
I'm coming to this a bit late, but I'm finding things a bit hard to follow. 
Reading the design doc, it seems that the original plan was to implement two 
interfaces - an RDD one that provides the same performance as current 
{{MultivariateOnlineSummarizer}} and a data frame interface using UDAF. 

from design doc:
"...In the meantime, there will be a (possibly faster) RDD interface and a 
(more flexible) Dataframe interface."

Now, the PR for this uses {{TypedImperativeAggregate}}. I understand that the 
it was pivoted away from UDAF, but the design doc does not reflect that. Also, 
if there is to be an RDD interface, what is the JIRA for it and what will it 
look like?

Also, there are several concerns raised in the design doc about this Catalyst 
aggregate approach being less efficient, and the consensus seemed to be: 
provide an initial API with a "slow" implementation that will be improved upon 
in the future. Is that correct? I'm not that familiar with the Catalyst 
optimizer, but are we sure there is a good way to implement the tree-reduce 
type aggregation, and if so could we document that? If this is still targeted 
at 2.2, why? I'd prefer to get the details hashed out further rather than 
rushing to provide an API and initial slow implementation, that way we can make 
sure that we get this correct in the long-term. I really appreciate some 
clarification and my apologies if I have missed any of the details/discussion.

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20111) codegen bug surfaced by GraphFrames issue 165

2017-03-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-20111.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> codegen bug surfaced by GraphFrames issue 165
> -
>
> Key: SPARK-20111
> URL: https://issues.apache.org/jira/browse/SPARK-20111
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Joseph K. Bradley
> Fix For: 2.1.1, 2.2.0
>
>
> In GraphFrames, test {{test("named edges")}} in {{PatternMatchSuite}} 
> surfaces a SQL codegen bug.
> This is described in https://github.com/graphframes/graphframes/issues/165
> Summary
> * The unit test does a simple motif query on a graph.  Essentially, this 
> means taking 2 DataFrames, doing a few joins, selecting 2 columns, and 
> collecting the (tiny) DataFrame.
> * The test runs, but codegen fails.  See the linked GraphFrames issue for the 
> stacktrace.
> To reproduce this:
> * Check out GraphFrames https://github.com/graphframes/graphframes
> * Run {{sbt assembly}} to compile it and run tests
> Copying [~felixcheung]'s comment from the GraphFrames issue 165:
> {quote}
> Seems like codegen bug; it looks like at least 2 issues:
> 1. At L472, inputadapter_value is not defined within scope
> 2. inputadapter_value is an InternalRow, for this statement to work
> {{bhj_primitiveA = inputadapter_value;}}
> it should be
> {{bhj_primitiveA = inputadapter_value.getLong(0);}}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20111) codegen bug surfaced by GraphFrames issue 165

2017-03-27 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944113#comment-15944113
 ] 

Joseph K. Bradley edited comment on SPARK-20111 at 3/27/17 9:59 PM:


Yep, you're right.  Closing now.  Thanks!


was (Author: josephkb):
Yep, you're right.  Is the fix worth backporting, or would it be too much 
trouble?

> codegen bug surfaced by GraphFrames issue 165
> -
>
> Key: SPARK-20111
> URL: https://issues.apache.org/jira/browse/SPARK-20111
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Joseph K. Bradley
> Fix For: 2.1.1, 2.2.0
>
>
> In GraphFrames, test {{test("named edges")}} in {{PatternMatchSuite}} 
> surfaces a SQL codegen bug.
> This is described in https://github.com/graphframes/graphframes/issues/165
> Summary
> * The unit test does a simple motif query on a graph.  Essentially, this 
> means taking 2 DataFrames, doing a few joins, selecting 2 columns, and 
> collecting the (tiny) DataFrame.
> * The test runs, but codegen fails.  See the linked GraphFrames issue for the 
> stacktrace.
> To reproduce this:
> * Check out GraphFrames https://github.com/graphframes/graphframes
> * Run {{sbt assembly}} to compile it and run tests
> Copying [~felixcheung]'s comment from the GraphFrames issue 165:
> {quote}
> Seems like codegen bug; it looks like at least 2 issues:
> 1. At L472, inputadapter_value is not defined within scope
> 2. inputadapter_value is an InternalRow, for this statement to work
> {{bhj_primitiveA = inputadapter_value;}}
> it should be
> {{bhj_primitiveA = inputadapter_value.getLong(0);}}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20111) codegen bug surfaced by GraphFrames issue 165

2017-03-27 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944113#comment-15944113
 ] 

Joseph K. Bradley commented on SPARK-20111:
---

Yep, you're right.  Is the fix worth backporting, or would it be too much 
trouble?

> codegen bug surfaced by GraphFrames issue 165
> -
>
> Key: SPARK-20111
> URL: https://issues.apache.org/jira/browse/SPARK-20111
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Joseph K. Bradley
>
> In GraphFrames, test {{test("named edges")}} in {{PatternMatchSuite}} 
> surfaces a SQL codegen bug.
> This is described in https://github.com/graphframes/graphframes/issues/165
> Summary
> * The unit test does a simple motif query on a graph.  Essentially, this 
> means taking 2 DataFrames, doing a few joins, selecting 2 columns, and 
> collecting the (tiny) DataFrame.
> * The test runs, but codegen fails.  See the linked GraphFrames issue for the 
> stacktrace.
> To reproduce this:
> * Check out GraphFrames https://github.com/graphframes/graphframes
> * Run {{sbt assembly}} to compile it and run tests
> Copying [~felixcheung]'s comment from the GraphFrames issue 165:
> {quote}
> Seems like codegen bug; it looks like at least 2 issues:
> 1. At L472, inputadapter_value is not defined within scope
> 2. inputadapter_value is an InternalRow, for this statement to work
> {{bhj_primitiveA = inputadapter_value;}}
> it should be
> {{bhj_primitiveA = inputadapter_value.getLong(0);}}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter

2017-03-27 Thread Mitesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944100#comment-15944100
 ] 

Mitesh commented on SPARK-20112:


This kind of looks like https://issues.apache.org/jira/browse/SPARK-15822, but 
that is marked fix in 2.0.0

> SIGSEGV in GeneratedIterator.sort_addToSorter
> -
>
> Key: SPARK-20112
> URL: https://issues.apache.org/jira/browse/SPARK-20112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops)
>Reporter: Mitesh
> Attachments: codegen_sorter_crash.log, hs_err_pid19271.log
>
>
> I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
> hs_err_pid and codegen file are attached (with query plans). Its not a 
> deterministic repro, but running a big query load, I eventually see it come 
> up within a few minutes.
> Here is some interesting repro information:
> - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't 
> repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I 
> think that means its not an issue with the code-gen, but I cant figure out 
> what the difference in behavior is.
> - The broadcast joins in the plan are all small tables. I have 
> autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
> - As you can see from the plan, all the sources are cached memory tables
> {noformat}
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  [thread 139872345896704 also had an error]
> SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode 
> linux-amd64 compressed oops)
> [thread 139872348002048 also had an error]# Problematic frame:
> # 
> J 28454 C1 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V
>  (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter

2017-03-27 Thread Mitesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mitesh updated SPARK-20112:
---
Attachment: codegen_sorter_crash.log

> SIGSEGV in GeneratedIterator.sort_addToSorter
> -
>
> Key: SPARK-20112
> URL: https://issues.apache.org/jira/browse/SPARK-20112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops)
>Reporter: Mitesh
> Attachments: codegen_sorter_crash.log, hs_err_pid19271.log
>
>
> I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
> hs_err_pid and codegen file are attached (with query plans). Its not a 
> deterministic repro, but running a big query load, I eventually see it come 
> up within a few minutes.
> Here is some interesting repro information:
> - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't 
> repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I 
> think that means its not an issue with the code-gen, but I cant figure out 
> what the difference in behavior is.
> - The broadcast joins in the plan are all small tables. I have 
> autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
> - As you can see from the plan, all the sources are cached memory tables
> {noformat}
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  [thread 139872345896704 also had an error]
> SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode 
> linux-amd64 compressed oops)
> [thread 139872348002048 also had an error]# Problematic frame:
> # 
> J 28454 C1 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V
>  (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables

2017-03-27 Thread Morten Hornbech (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944094#comment-15944094
 ] 

Morten Hornbech commented on SPARK-14560:
-

No. We worried it might just trigger some other bad behaviour, and it wasn't a 
production issue.

> Cooperative Memory Management for Spillables
> 
>
> Key: SPARK-14560
> URL: https://issues.apache.org/jira/browse/SPARK-14560
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Imran Rashid
>Assignee: Lianhui Wang
> Fix For: 2.0.0
>
>
> SPARK-10432 introduced cooperative memory management for SQL operators that 
> can spill; however, {{Spillable}} s used by the old RDD api still do not 
> cooperate.  This can lead to memory starvation, in particular on a 
> shuffle-to-shuffle stage, eventually resulting in errors like:
> {noformat}
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory 
> were used by task 3081 but are not associated with specific consumers
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory 
> are used for execution and 1710484 bytes of memory are used for storage
> 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size 
> = 1317230346 bytes, TID = 3081
> 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage 
> 3.0 (TID 3081)
> java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This can happen anytime the shuffle read side requires more memory than what 
> is available for the task.  Since the shuffle-read side doubles its memory 
> request each time, it can easily end up acquiring all of the available 
> memory, even if it does not use it.  Eg., say that after the final spill, the 
> shuffle-read side requires 10 MB more memory, and there is 15 MB of memory 
> available.  But if it starts at 2 MB, it will double to 4, 8, and then 
> request 16 MB of memory, and in fact get all available 15 MB.  Since the 15 
> MB of memory is sufficient, it will not spill, and will continue holding on 
> to all available memory.  But this leaves *no* memory available for the 
> shuffle-write side.  Since the shuffle-write side cannot request the 
> shuffle-read side to free up memory, this leads to an OOM.
> The simple solution is to make {{Spillable}} implement {{MemoryConsumer}} as 
> well, so RDDs can benefit from the cooperative memory management introduced 
> by SPARK-10342.
> Note that an additional improvement would be for the shuffle-read side to 
> simple release unused memory, without spilling, in case that would leave 
> enough memory, and only spill if that was inadequate.  However that can come 
> as a later improvement.
> *Workaround*:  You can set 
> {{spark.shuffle.spill.numElementsForceSpillThreshold=N}} to force spilling to 
> occur every {{N}} elements, thus preventing the shuffle-read side from ever 
> grabbing all of the available memory.  However, this requires careful tuning 
> of {{N}} to specific workloads: too big, and you will still get an OOM; too 
> small, and there will be so much spilling that performance will suffer 
> drastically.  Furthermore, this workaround uses an *undocumented* 
> configuration with *no compatibility guarantees* for future versions of spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To 

[jira] [Updated] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter

2017-03-27 Thread Mitesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mitesh updated SPARK-20112:
---
Attachment: (was: codegen_sorter_crash.log)

> SIGSEGV in GeneratedIterator.sort_addToSorter
> -
>
> Key: SPARK-20112
> URL: https://issues.apache.org/jira/browse/SPARK-20112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops)
>Reporter: Mitesh
> Attachments: hs_err_pid19271.log
>
>
> I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
> hs_err_pid and codegen file are attached (with query plans). Its not a 
> deterministic repro, but running a big query load, I eventually see it come 
> up within a few minutes.
> Here is some interesting repro information:
> - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't 
> repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I 
> think that means its not an issue with the code-gen, but I cant figure out 
> what the difference in behavior is.
> - The broadcast joins in the plan are all small tables. I have 
> autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
> - As you can see from the plan, all the sources are cached memory tables
> {noformat}
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  [thread 139872345896704 also had an error]
> SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode 
> linux-amd64 compressed oops)
> [thread 139872348002048 also had an error]# Problematic frame:
> # 
> J 28454 C1 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V
>  (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables

2017-03-27 Thread Darshan Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944090#comment-15944090
 ] 

Darshan Mehta commented on SPARK-14560:
---

[~mhornbech] thanks for the prompt response. Before rewriting the job, did you 
see any difference by playing around with 
spark.shuffle.spill.numElementsForceSpillThreshold=N  values?

> Cooperative Memory Management for Spillables
> 
>
> Key: SPARK-14560
> URL: https://issues.apache.org/jira/browse/SPARK-14560
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Imran Rashid
>Assignee: Lianhui Wang
> Fix For: 2.0.0
>
>
> SPARK-10432 introduced cooperative memory management for SQL operators that 
> can spill; however, {{Spillable}} s used by the old RDD api still do not 
> cooperate.  This can lead to memory starvation, in particular on a 
> shuffle-to-shuffle stage, eventually resulting in errors like:
> {noformat}
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory 
> were used by task 3081 but are not associated with specific consumers
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory 
> are used for execution and 1710484 bytes of memory are used for storage
> 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size 
> = 1317230346 bytes, TID = 3081
> 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage 
> 3.0 (TID 3081)
> java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This can happen anytime the shuffle read side requires more memory than what 
> is available for the task.  Since the shuffle-read side doubles its memory 
> request each time, it can easily end up acquiring all of the available 
> memory, even if it does not use it.  Eg., say that after the final spill, the 
> shuffle-read side requires 10 MB more memory, and there is 15 MB of memory 
> available.  But if it starts at 2 MB, it will double to 4, 8, and then 
> request 16 MB of memory, and in fact get all available 15 MB.  Since the 15 
> MB of memory is sufficient, it will not spill, and will continue holding on 
> to all available memory.  But this leaves *no* memory available for the 
> shuffle-write side.  Since the shuffle-write side cannot request the 
> shuffle-read side to free up memory, this leads to an OOM.
> The simple solution is to make {{Spillable}} implement {{MemoryConsumer}} as 
> well, so RDDs can benefit from the cooperative memory management introduced 
> by SPARK-10342.
> Note that an additional improvement would be for the shuffle-read side to 
> simple release unused memory, without spilling, in case that would leave 
> enough memory, and only spill if that was inadequate.  However that can come 
> as a later improvement.
> *Workaround*:  You can set 
> {{spark.shuffle.spill.numElementsForceSpillThreshold=N}} to force spilling to 
> occur every {{N}} elements, thus preventing the shuffle-read side from ever 
> grabbing all of the available memory.  However, this requires careful tuning 
> of {{N}} to specific workloads: too big, and you will still get an OOM; too 
> small, and there will be so much spilling that performance will suffer 
> drastically.  Furthermore, this workaround uses an *undocumented* 
> configuration with *no compatibility guarantees* for future versions of spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (SPARK-20111) codegen bug surfaced by GraphFrames issue 165

2017-03-27 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944089#comment-15944089
 ] 

Herman van Hovell commented on SPARK-20111:
---

[~josephkb] this might be already fixed in the latest master/2.1 (see 
SPARK-19512). Did you try this on the latest master?

> codegen bug surfaced by GraphFrames issue 165
> -
>
> Key: SPARK-20111
> URL: https://issues.apache.org/jira/browse/SPARK-20111
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Joseph K. Bradley
>
> In GraphFrames, test {{test("named edges")}} in {{PatternMatchSuite}} 
> surfaces a SQL codegen bug.
> This is described in https://github.com/graphframes/graphframes/issues/165
> Summary
> * The unit test does a simple motif query on a graph.  Essentially, this 
> means taking 2 DataFrames, doing a few joins, selecting 2 columns, and 
> collecting the (tiny) DataFrame.
> * The test runs, but codegen fails.  See the linked GraphFrames issue for the 
> stacktrace.
> To reproduce this:
> * Check out GraphFrames https://github.com/graphframes/graphframes
> * Run {{sbt assembly}} to compile it and run tests
> Copying [~felixcheung]'s comment from the GraphFrames issue 165:
> {quote}
> Seems like codegen bug; it looks like at least 2 issues:
> 1. At L472, inputadapter_value is not defined within scope
> 2. inputadapter_value is an InternalRow, for this statement to work
> {{bhj_primitiveA = inputadapter_value;}}
> it should be
> {{bhj_primitiveA = inputadapter_value.getLong(0);}}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20113) overwrite mode appends data on MySQL table that does not have a primary key

2017-03-27 Thread Bhanu Akaveeti (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944088#comment-15944088
 ] 

Bhanu Akaveeti commented on SPARK-20113:


I was expecting entire row comparison, but is there a provision to specify a 
key (for comparison) in the script, in future releases (than 2.0.1)?

Also, truncate seems to be not working on table that does not have a PK.

> overwrite mode appends data on MySQL table that does not have a primary key
> ---
>
> Key: SPARK-20113
> URL: https://issues.apache.org/jira/browse/SPARK-20113
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.1
>Reporter: Bhanu Akaveeti
>
> Dataframe.write in overwrite mode appends data on MySQL table that does not 
> have a primary key
> df_mysql.write \
> .mode("overwrite") \
> .jdbc("jdbc:mysql://ip-address/database", "MySQL_Table", properties={"user": 
> "MySQL_user", "password": "MySQL_pw"})
> When the above script is run twice, data is inserted twice. Also, I tried 
> with option("truncate","true") but still data is appended in MySQL table



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables

2017-03-27 Thread Morten Hornbech (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944080#comment-15944080
 ] 

Morten Hornbech commented on SPARK-14560:
-

No, not really. We were able to work around it by rewriting the job, but it was 
never clear what made the difference.

> Cooperative Memory Management for Spillables
> 
>
> Key: SPARK-14560
> URL: https://issues.apache.org/jira/browse/SPARK-14560
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Imran Rashid
>Assignee: Lianhui Wang
> Fix For: 2.0.0
>
>
> SPARK-10432 introduced cooperative memory management for SQL operators that 
> can spill; however, {{Spillable}} s used by the old RDD api still do not 
> cooperate.  This can lead to memory starvation, in particular on a 
> shuffle-to-shuffle stage, eventually resulting in errors like:
> {noformat}
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory 
> were used by task 3081 but are not associated with specific consumers
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory 
> are used for execution and 1710484 bytes of memory are used for storage
> 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size 
> = 1317230346 bytes, TID = 3081
> 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage 
> 3.0 (TID 3081)
> java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This can happen anytime the shuffle read side requires more memory than what 
> is available for the task.  Since the shuffle-read side doubles its memory 
> request each time, it can easily end up acquiring all of the available 
> memory, even if it does not use it.  Eg., say that after the final spill, the 
> shuffle-read side requires 10 MB more memory, and there is 15 MB of memory 
> available.  But if it starts at 2 MB, it will double to 4, 8, and then 
> request 16 MB of memory, and in fact get all available 15 MB.  Since the 15 
> MB of memory is sufficient, it will not spill, and will continue holding on 
> to all available memory.  But this leaves *no* memory available for the 
> shuffle-write side.  Since the shuffle-write side cannot request the 
> shuffle-read side to free up memory, this leads to an OOM.
> The simple solution is to make {{Spillable}} implement {{MemoryConsumer}} as 
> well, so RDDs can benefit from the cooperative memory management introduced 
> by SPARK-10342.
> Note that an additional improvement would be for the shuffle-read side to 
> simple release unused memory, without spilling, in case that would leave 
> enough memory, and only spill if that was inadequate.  However that can come 
> as a later improvement.
> *Workaround*:  You can set 
> {{spark.shuffle.spill.numElementsForceSpillThreshold=N}} to force spilling to 
> occur every {{N}} elements, thus preventing the shuffle-read side from ever 
> grabbing all of the available memory.  However, this requires careful tuning 
> of {{N}} to specific workloads: too big, and you will still get an OOM; too 
> small, and there will be so much spilling that performance will suffer 
> drastically.  Furthermore, this workaround uses an *undocumented* 
> configuration with *no compatibility guarantees* for future versions of spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (SPARK-20113) overwrite mode appends data on MySQL table that does not have a primary key

2017-03-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944079#comment-15944079
 ] 

Sean Owen commented on SPARK-20113:
---

If there is no primary key, how would anything know that the data is already 
inserted? there is no notion of sameness to decide data is already there.

> overwrite mode appends data on MySQL table that does not have a primary key
> ---
>
> Key: SPARK-20113
> URL: https://issues.apache.org/jira/browse/SPARK-20113
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.1
>Reporter: Bhanu Akaveeti
>
> Dataframe.write in overwrite mode appends data on MySQL table that does not 
> have a primary key
> df_mysql.write \
> .mode("overwrite") \
> .jdbc("jdbc:mysql://ip-address/database", "MySQL_Table", properties={"user": 
> "MySQL_user", "password": "MySQL_pw"})
> When the above script is run twice, data is inserted twice. Also, I tried 
> with option("truncate","true") but still data is appended in MySQL table



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20113) overwrite mode appends data on MySQL table that does not have a primary key

2017-03-27 Thread Bhanu Akaveeti (JIRA)
Bhanu Akaveeti created SPARK-20113:
--

 Summary: overwrite mode appends data on MySQL table that does not 
have a primary key
 Key: SPARK-20113
 URL: https://issues.apache.org/jira/browse/SPARK-20113
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.0.1
Reporter: Bhanu Akaveeti


Dataframe.write in overwrite mode appends data on MySQL table that does not 
have a primary key

df_mysql.write \
.mode("overwrite") \
.jdbc("jdbc:mysql://ip-address/database", "MySQL_Table", properties={"user": 
"MySQL_user", "password": "MySQL_pw"})

When the above script is run twice, data is inserted twice. Also, I tried with 
option("truncate","true") but still data is appended in MySQL table



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter

2017-03-27 Thread Mitesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mitesh updated SPARK-20112:
---
Description: 
I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
hs_err_pid and codegen file are attached (with query plans). Its not a 
deterministic repro, but running a big query load, I eventually see it come up 
within a few minutes.

Here is some interesting repro information:
- Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't 
repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I 
think that means its not an issue with the code-gen, but I cant figure out what 
the difference in behavior is.
- The broadcast joins in the plan are all small tables. I have 
autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
- As you can see from the plan, all the sources are cached memory tables

{noformat}
# A fatal error has been detected by the Java Runtime Environment:
#
#  [thread 139872345896704 also had an error]
SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 
compressed oops)

[thread 139872348002048 also had an error]# Problematic frame:
# 
J 28454 C1 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V
 (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3]
{noformat}

  was:
I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
hs_err_pid and codegen file are attached (with query plans). Its not a 
deterministic repro, but running a big query load, I eventually see it come up 
within a few minutes.

Here is some interesting repro information:
- Using r3.8xlarge machines, which have ephermal attached drives, I can't repro 
this. So I think that means its not an issue with the code-gen, but I cant 
figure out what the difference in behavior is.
- The broadcast joins in the plan are all small tables. I have 
autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
- As you can see from the plan, all the sources are cached memory tables

{noformat}
# A fatal error has been detected by the Java Runtime Environment:
#
#  [thread 139872345896704 also had an error]
SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 
compressed oops)

[thread 139872348002048 also had an error]# Problematic frame:
# 
J 28454 C1 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V
 (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3]
{noformat}


> SIGSEGV in GeneratedIterator.sort_addToSorter
> -
>
> Key: SPARK-20112
> URL: https://issues.apache.org/jira/browse/SPARK-20112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops)
>Reporter: Mitesh
> Attachments: codegen_sorter_crash.log, hs_err_pid19271.log
>
>
> I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
> hs_err_pid and codegen file are attached (with query plans). Its not a 
> deterministic repro, but running a big query load, I eventually see it come 
> up within a few minutes.
> Here is some interesting repro information:
> - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't 
> repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I 
> think that means its not an issue with the code-gen, but I cant figure out 
> what the difference in behavior is.
> - The broadcast joins in the plan are all small tables. I have 
> autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
> - As you can see from the plan, all the sources are cached memory tables
> {noformat}
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  [thread 139872345896704 also had an error]
> SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode 
> linux-amd64 compressed oops)
> [thread 139872348002048 also had an error]# Problematic frame:
> # 
> J 28454 C1 
> 

[jira] [Updated] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter

2017-03-27 Thread Mitesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mitesh updated SPARK-20112:
---
Description: 
I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
hs_err_pid and codegen file are attached (with query plans). Its not a 
deterministic repro, but running a big query load, I eventually see it come up 
within a few minutes.

Here is some interesting repro information:
- Using r3.8xlarge machines, which have ephermal attached drives, I can't repro 
this. So I think that means its not an issue with the code-gen, but I cant 
figure out what the difference in behavior is.
- The broadcast joins in the plan are all small tables. I have 
autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
- As you can see from the plan, all the sources are cached memory tables

{noformat}
# A fatal error has been detected by the Java Runtime Environment:
#
#  [thread 139872345896704 also had an error]
SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 
compressed oops)

[thread 139872348002048 also had an error]# Problematic frame:
# 
J 28454 C1 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V
 (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3]
{noformat}

  was:
I'm seeing a very weird crash in GeneratedIterator.sort_addToSorter. The 
hs_err_pid and codegen file are attached (with query plans). Its not a 
deterministic repro, but running a big query load, I eventually see it come up 
within a few minutes.

Here is some interesting repro information:
- Using r3.8xlarge machines, which have ephermal attached drives, I can't repro 
this. So I think that means its not an issue with the code-gen, but I cant 
figure out what the difference in behavior is.
- The broadcast joins in the plan are all small tables. I have 
autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
- As you can see from the plan, all the sources are cached memory tables

{noformat}
# A fatal error has been detected by the Java Runtime Environment:
#
#  [thread 139872345896704 also had an error]
SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 
compressed oops)

[thread 139872348002048 also had an error]# Problematic frame:
# 
J 28454 C1 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V
 (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3]
{noformat}


> SIGSEGV in GeneratedIterator.sort_addToSorter
> -
>
> Key: SPARK-20112
> URL: https://issues.apache.org/jira/browse/SPARK-20112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops)
>Reporter: Mitesh
> Attachments: codegen_sorter_crash.log, hs_err_pid19271.log
>
>
> I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
> hs_err_pid and codegen file are attached (with query plans). Its not a 
> deterministic repro, but running a big query load, I eventually see it come 
> up within a few minutes.
> Here is some interesting repro information:
> - Using r3.8xlarge machines, which have ephermal attached drives, I can't 
> repro this. So I think that means its not an issue with the code-gen, but I 
> cant figure out what the difference in behavior is.
> - The broadcast joins in the plan are all small tables. I have 
> autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
> - As you can see from the plan, all the sources are cached memory tables
> {noformat}
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  [thread 139872345896704 also had an error]
> SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode 
> linux-amd64 compressed oops)
> [thread 139872348002048 also had an error]# Problematic frame:
> # 
> J 28454 C1 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V
>  (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3]
> 

[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables

2017-03-27 Thread Darshan Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944077#comment-15944077
 ] 

Darshan Mehta commented on SPARK-14560:
---

[~lovemylover] [~mhornbech] Were you able to figure out/fix the issue? We are 
using Spark 2.1.0 and are facing similar bug, below is the stacktrace:

Caused by: java.lang.OutOfMemoryError: Unable to acquire 65536 bytes of memory, 
got 0
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.(UnsafeInMemorySorter.java:125)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:154)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:121)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:82)
at 
org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:87)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.init(Unknown
 Source)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:374)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:371)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:843)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:843)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

> Cooperative Memory Management for Spillables
> 
>
> Key: SPARK-14560
> URL: https://issues.apache.org/jira/browse/SPARK-14560
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Imran Rashid
>Assignee: Lianhui Wang
> Fix For: 2.0.0
>
>
> SPARK-10432 introduced cooperative memory management for SQL operators that 
> can spill; however, {{Spillable}} s used by the old RDD api still do not 
> cooperate.  This can lead to memory starvation, in particular on a 
> shuffle-to-shuffle stage, eventually resulting in errors like:
> {noformat}
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory 
> were used by task 3081 but are not associated with specific consumers
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory 
> are used for execution and 1710484 bytes of memory are used for storage
> 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size 
> = 1317230346 bytes, TID = 3081
> 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage 
> 3.0 (TID 3081)
> java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This can happen anytime the shuffle read side requires more memory than what 
> is available for the task.  Since the shuffle-read side doubles its memory 
> request each time, it can easily end up acquiring all of the available 
> memory, even if it does not use it.  Eg., say that after the final spill, the 
> shuffle-read side requires 10 MB more memory, and there is 15 MB of memory 
> available.  But if it starts at 2 MB, it will double to 4, 8, and then 
> request 16 MB of memory, and in fact get all 

[jira] [Updated] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter

2017-03-27 Thread Mitesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mitesh updated SPARK-20112:
---
Attachment: (was: codegen_sorter_crash)

> SIGSEGV in GeneratedIterator.sort_addToSorter
> -
>
> Key: SPARK-20112
> URL: https://issues.apache.org/jira/browse/SPARK-20112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops)
>Reporter: Mitesh
> Attachments: codegen_sorter_crash.log, hs_err_pid19271.log
>
>
> I'm seeing a very weird crash in GeneratedIterator.sort_addToSorter. The 
> hs_err_pid and codegen file are attached (with query plans). Its not a 
> deterministic repro, but running a big query load, I eventually see it come 
> up within a few minutes.
> Here is some interesting repro information:
> - Using r3.8xlarge machines, which have ephermal attached drives, I can't 
> repro this. So I think that means its not an issue with the code-gen, but I 
> cant figure out what the difference in behavior is.
> - The broadcast joins in the plan are all small tables. I have 
> autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
> - As you can see from the plan, all the sources are cached memory tables
> {noformat}
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  [thread 139872345896704 also had an error]
> SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode 
> linux-amd64 compressed oops)
> [thread 139872348002048 also had an error]# Problematic frame:
> # 
> J 28454 C1 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V
>  (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter

2017-03-27 Thread Mitesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mitesh updated SPARK-20112:
---
Attachment: codegen_sorter_crash.log

> SIGSEGV in GeneratedIterator.sort_addToSorter
> -
>
> Key: SPARK-20112
> URL: https://issues.apache.org/jira/browse/SPARK-20112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops)
>Reporter: Mitesh
> Attachments: codegen_sorter_crash.log, hs_err_pid19271.log
>
>
> I'm seeing a very weird crash in GeneratedIterator.sort_addToSorter. The 
> hs_err_pid and codegen file are attached (with query plans). Its not a 
> deterministic repro, but running a big query load, I eventually see it come 
> up within a few minutes.
> Here is some interesting repro information:
> - Using r3.8xlarge machines, which have ephermal attached drives, I can't 
> repro this. So I think that means its not an issue with the code-gen, but I 
> cant figure out what the difference in behavior is.
> - The broadcast joins in the plan are all small tables. I have 
> autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
> - As you can see from the plan, all the sources are cached memory tables
> {noformat}
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  [thread 139872345896704 also had an error]
> SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode 
> linux-amd64 compressed oops)
> [thread 139872348002048 also had an error]# Problematic frame:
> # 
> J 28454 C1 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V
>  (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter

2017-03-27 Thread Mitesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mitesh updated SPARK-20112:
---
Attachment: hs_err_pid19271.log
codegen_sorter_crash

> SIGSEGV in GeneratedIterator.sort_addToSorter
> -
>
> Key: SPARK-20112
> URL: https://issues.apache.org/jira/browse/SPARK-20112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops)
>Reporter: Mitesh
> Attachments: codegen_sorter_crash, hs_err_pid19271.log
>
>
> I'm seeing a very weird crash in GeneratedIterator.sort_addToSorter. The 
> hs_err_pid and codegen file are attached (with query plans). Its not a 
> deterministic repro, but running a big query load, I eventually see it come 
> up within a few minutes.
> Here is some interesting repro information:
> - Using r3.8xlarge machines, which have ephermal attached drives, I can't 
> repro this. So I think that means its not an issue with the code-gen, but I 
> cant figure out what the difference in behavior is.
> - The broadcast joins in the plan are all small tables. I have 
> autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
> - As you can see from the plan, all the sources are cached memory tables
> {noformat}
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  [thread 139872345896704 also had an error]
> SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode 
> linux-amd64 compressed oops)
> [thread 139872348002048 also had an error]# Problematic frame:
> # 
> J 28454 C1 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V
>  (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter

2017-03-27 Thread Mitesh (JIRA)
Mitesh created SPARK-20112:
--

 Summary: SIGSEGV in GeneratedIterator.sort_addToSorter
 Key: SPARK-20112
 URL: https://issues.apache.org/jira/browse/SPARK-20112
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
 Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops)
Reporter: Mitesh


I'm seeing a very weird crash in GeneratedIterator.sort_addToSorter. The 
hs_err_pid and codegen file are attached (with query plans). Its not a 
deterministic repro, but running a big query load, I eventually see it come up 
within a few minutes.

Here is some interesting repro information:
- Using r3.8xlarge machines, which have ephermal attached drives, I can't repro 
this. So I think that means its not an issue with the code-gen, but I cant 
figure out what the difference in behavior is.
- The broadcast joins in the plan are all small tables. I have 
autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
- As you can see from the plan, all the sources are cached memory tables

{noformat}
# A fatal error has been detected by the Java Runtime Environment:
#
#  [thread 139872345896704 also had an error]
SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 
compressed oops)

[thread 139872348002048 also had an error]# Problematic frame:
# 
J 28454 C1 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V
 (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3]
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19876) Add OneTime trigger executor

2017-03-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944073#comment-15944073
 ] 

Apache Spark commented on SPARK-19876:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/17444

> Add OneTime trigger executor
> 
>
> Key: SPARK-19876
> URL: https://issues.apache.org/jira/browse/SPARK-19876
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
>Assignee: Tyson Condie
> Fix For: 2.2.0
>
>
> The goal is to add a new trigger executor that will process a single trigger 
> then stop. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19904) SPIP Add Spark Project Improvement Proposal doc to website

2017-03-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19904.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

There's not a bright line between what goes in the spark.apache.org site, and 
what does in per-release spark.apache.org/docs docs. Anything release-specific 
must be in the latter, which suggests anything else go in the former. I think 
SPIPs are process items and not release-specific, so I don't necessarily see 
value in linking to them from release-specific docs. I'd suggest we call this 
done.

> SPIP Add Spark Project Improvement Proposal doc to website
> --
>
> Key: SPARK-19904
> URL: https://issues.apache.org/jira/browse/SPARK-19904
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Cody Koeninger
>Assignee: Cody Koeninger
>  Labels: SPIP
> Fix For: 2.2.0
>
>
> see
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-td19268.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18127) Add hooks and extension points to Spark

2017-03-27 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18127:

Component/s: (was: Spark Core)
 SQL

> Add hooks and extension points to Spark
> ---
>
> Key: SPARK-18127
> URL: https://issues.apache.org/jira/browse/SPARK-18127
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Srinath
>Assignee: Herman van Hovell
>
> As a Spark user I want to be able to customize my spark session. I currently 
> want to be able to do the following things:
> # I want to be able to add custom analyzer rules. This allows me to implement 
> my own logical constructs; an example of this could be a recursive operator.
> # I want to be able to add my own analysis checks. This allows me to catch 
> problems with spark plans early on. An example of this can be some datasource 
> specific checks.
> # I want to be able to add my own optimizations. This allows me to optimize 
> plans in different ways, for instance when you use a very different cluster 
> (for example a one-node X1 instance). This supersedes the current 
> {{spark.experimental}} methods
> # I want to be able to add my own planning strategies. This supersedes the 
> current {{spark.experimental}} methods. This allows me to plan my own 
> physical plan, an example of this would to plan my own heavily integrated 
> data source (CarbonData for example).
> # I want to be able to use my own customized SQL constructs. An example of 
> this would supporting my own dialect, or be able to add constructs to the 
> current SQL language. I should not have to implement a complete parse, and 
> should be able to delegate to an underlying parser.
> # I want to be able to track modifications and calls to the external catalog. 
> I want this API to be stable. This allows me to do synchronize with other 
> systems.
> This API should modify the SparkSession when the session gets started, and it 
> should NOT change the session in flight.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19143) API in Spark for distributing new delegation tokens (Improve delegation token handling in secure clusters)

2017-03-27 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944046#comment-15944046
 ] 

Thomas Graves commented on SPARK-19143:
---

Yes I can be Shephard.

> API in Spark for distributing new delegation tokens (Improve delegation token 
> handling in secure clusters)
> --
>
> Key: SPARK-19143
> URL: https://issues.apache.org/jira/browse/SPARK-19143
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Ruslan Dautkhanov
>
> Spin off from SPARK-14743 and comments chain in [recent comments| 
> https://issues.apache.org/jira/browse/SPARK-5493?focusedCommentId=15802179=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15802179]
>  in SPARK-5493.
> Spark currently doesn't have a way for distribution new delegation tokens. 
> Quoting [~vanzin] from SPARK-5493 
> {quote}
> IIRC Livy doesn't yet support delegation token renewal. Once it reaches the 
> TTL, the session is unusable.
> There might be ways to hack support for that without changes in Spark, but 
> I'd like to see a proper API in Spark for distributing new delegation tokens. 
> I mentioned that in SPARK-14743, but although that bug is closed, that 
> particular feature hasn't been implemented yet.
> {quote}
> Other thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-03-27 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944030#comment-15944030
 ] 

Seth Hendrickson commented on SPARK-19634:
--

I'm coming to this a bit late, but I'm finding things a bit hard to follow. 
Reading the design doc, it seems that the original plan was to implement two 
interfaces - an RDD one that provides the same performance as current 
{{MultivariateOnlineSummarizer}} and a data frame interface using UDAF. 

from design doc:
"...In the meantime, there will be a (possibly faster) RDD interface and a 
(more flexible) Dataframe interface."

Now, the PR for this uses {{TypedImperativeAggregate}}. I understand that the 
it was pivoted away from UDAF, but the design doc does not reflect that. Also, 
if there is to be an RDD interface, what is the JIRA for it and what will it 
look like?

Also, there are several concerns raised in the design doc about this Catalyst 
aggregate approach being less efficient, and the consensus seemed to be: 
provide an initial API with a "slow" implementation that will be improved upon 
in the future. Is that correct? I'm not that familiar with the Catalyst 
optimizer, but are we sure there is a good way to implement the tree-reduce 
type aggregation, and if so could we document that? If this is still targeted 
at 2.2, why? I'd prefer to get the details hashed out further rather than 
rushing to provide an API and initial slow implementation, that way we can make 
sure that we get this correct in the long-term. I really appreciate some 
clarification and my apologies if I have missed any of the details/discussion.

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-03-27 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944019#comment-15944019
 ] 

Timothy Hunter commented on SPARK-19634:


[~dongjin] [~wm624] sorry it looks like I missed your comments... I pushed a PR 
for this feature. Please feel free to comment on the PR if you have the time.

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20111) codegen bug surfaced by GraphFrames issue 165

2017-03-27 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944004#comment-15944004
 ] 

Timothy Hunter commented on SPARK-20111:


As Spark SQL is making more and more forays into code generation, I have been 
wondering if it would make sense to start adopting practical compiler 
technologies, such as generating first an intermediate representation, instead 
of doing string manipulation as we currently do. This is of course much beyond 
the scope of this particular ticket.

> codegen bug surfaced by GraphFrames issue 165
> -
>
> Key: SPARK-20111
> URL: https://issues.apache.org/jira/browse/SPARK-20111
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Joseph K. Bradley
>
> In GraphFrames, test {{test("named edges")}} in {{PatternMatchSuite}} 
> surfaces a SQL codegen bug.
> This is described in https://github.com/graphframes/graphframes/issues/165
> Summary
> * The unit test does a simple motif query on a graph.  Essentially, this 
> means taking 2 DataFrames, doing a few joins, selecting 2 columns, and 
> collecting the (tiny) DataFrame.
> * The test runs, but codegen fails.  See the linked GraphFrames issue for the 
> stacktrace.
> To reproduce this:
> * Check out GraphFrames https://github.com/graphframes/graphframes
> * Run {{sbt assembly}} to compile it and run tests
> Copying [~felixcheung]'s comment from the GraphFrames issue 165:
> {quote}
> Seems like codegen bug; it looks like at least 2 issues:
> 1. At L472, inputadapter_value is not defined within scope
> 2. inputadapter_value is an InternalRow, for this statement to work
> {{bhj_primitiveA = inputadapter_value;}}
> it should be
> {{bhj_primitiveA = inputadapter_value.getLong(0);}}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20111) codegen bug surfaced by GraphFrames issue 165

2017-03-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20111:
--
Description: 
In GraphFrames, test {{test("named edges")}} in {{PatternMatchSuite}} surfaces 
a SQL codegen bug.
This is described in https://github.com/graphframes/graphframes/issues/165

Summary
* The unit test does a simple motif query on a graph.  Essentially, this means 
taking 2 DataFrames, doing a few joins, selecting 2 columns, and collecting the 
(tiny) DataFrame.
* The test runs, but codegen fails.  See the linked GraphFrames issue for the 
stacktrace.

To reproduce this:
* Check out GraphFrames https://github.com/graphframes/graphframes
* Run {{sbt assembly}} to compile it and run tests

Copying [~felixcheung]'s comment from the GraphFrames issue 165:
{quote}
Seems like codegen bug; it looks like at least 2 issues:
1. At L472, inputadapter_value is not defined within scope
2. inputadapter_value is an InternalRow, for this statement to work
{{bhj_primitiveA = inputadapter_value;}}
it should be
{{bhj_primitiveA = inputadapter_value.getLong(0);}}
{quote}

  was:
In GraphFrames, test {{test("named edges")}} in {{PatternMatchSuite}} surfaces 
a SQL codegen bug.
This is described in https://github.com/graphframes/graphframes/issues/165

Summary
* The unit test does a simple motif query on a graph.  Essentially, this means 
taking 2 DataFrames, doing a few joins, selecting 2 columns, and collecting the 
(tiny) DataFrame.
* The test runs, but codegen fails.  See the linked GraphFrames issue for the 
stacktrace.

To reproduce this:
* Check out GraphFrames https://github.com/graphframes/graphframes
* Run {{sbt assembly}} to compile it and run tests


> codegen bug surfaced by GraphFrames issue 165
> -
>
> Key: SPARK-20111
> URL: https://issues.apache.org/jira/browse/SPARK-20111
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Joseph K. Bradley
>
> In GraphFrames, test {{test("named edges")}} in {{PatternMatchSuite}} 
> surfaces a SQL codegen bug.
> This is described in https://github.com/graphframes/graphframes/issues/165
> Summary
> * The unit test does a simple motif query on a graph.  Essentially, this 
> means taking 2 DataFrames, doing a few joins, selecting 2 columns, and 
> collecting the (tiny) DataFrame.
> * The test runs, but codegen fails.  See the linked GraphFrames issue for the 
> stacktrace.
> To reproduce this:
> * Check out GraphFrames https://github.com/graphframes/graphframes
> * Run {{sbt assembly}} to compile it and run tests
> Copying [~felixcheung]'s comment from the GraphFrames issue 165:
> {quote}
> Seems like codegen bug; it looks like at least 2 issues:
> 1. At L472, inputadapter_value is not defined within scope
> 2. inputadapter_value is an InternalRow, for this statement to work
> {{bhj_primitiveA = inputadapter_value;}}
> it should be
> {{bhj_primitiveA = inputadapter_value.getLong(0);}}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20111) codegen bug surfaced by GraphFrames issue 165

2017-03-27 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-20111:
-

 Summary: codegen bug surfaced by GraphFrames issue 165
 Key: SPARK-20111
 URL: https://issues.apache.org/jira/browse/SPARK-20111
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.2, 2.2.0
Reporter: Joseph K. Bradley


In GraphFrames, test {{test("named edges")}} in {{PatternMatchSuite}} surfaces 
a SQL codegen bug.
This is described in https://github.com/graphframes/graphframes/issues/165

Summary
* The unit test does a simple motif query on a graph.  Essentially, this means 
taking 2 DataFrames, doing a few joins, selecting 2 columns, and collecting the 
(tiny) DataFrame.
* The test runs, but codegen fails.  See the linked GraphFrames issue for the 
stacktrace.

To reproduce this:
* Check out GraphFrames https://github.com/graphframes/graphframes
* Run {{sbt assembly}} to compile it and run tests



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20110) Windowed aggregation do not work when the timestamp is a nested field

2017-03-27 Thread Alexis Seigneurin (JIRA)
Alexis Seigneurin created SPARK-20110:
-

 Summary: Windowed aggregation do not work when the timestamp is a 
nested field
 Key: SPARK-20110
 URL: https://issues.apache.org/jira/browse/SPARK-20110
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.1.0
Reporter: Alexis Seigneurin


I am loading data into a DataFrame with nested fields. I want to perform a 
windowed aggregation on the timestamp from a nested fields:

{code}
  .groupBy(window($"auth.sysEntryTimestamp", "2 minutes"))
{code}

I get the following error:

{quote}
org.apache.spark.sql.AnalysisException: Multiple time window expressions would 
result in a cartesian product of rows, therefore they are not currently not 
supported.
{quote}

This works fine if I first extract the timestamp to a separate column:

{code}
  .withColumn("sysEntryTimestamp", $"auth.sysEntryTimestamp")
  .groupBy(
window($"sysEntryTimestamp", "2 minutes")
  )
{code}

Please see the whole sample:
- batch: 
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4683710270868386/4278399007363210/3769253384867782/latest.html
- Structured Streaming: 
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4683710270868386/4278399007363192/3769253384867782/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20109) Need a way to convert from IndexedRowMatrix to Block

2017-03-27 Thread John Compitello (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943966#comment-15943966
 ] 

John Compitello commented on SPARK-20109:
-

I have a PR for this issue in the works, I'd like to be the one to handle it. 

> Need a way to convert from IndexedRowMatrix to Block
> 
>
> Key: SPARK-20109
> URL: https://issues.apache.org/jira/browse/SPARK-20109
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: John Compitello
>
> The current implementation of toBlockMatrix on IndexedRowMatrix is 
> insufficient. It is implemented by first converting the IndexedRowMatrix to a 
> CoordinateMatrix, then converting that CoordinateMatrix to a BlockMatrix. Not 
> only is this slower than it needs to be, it also means that the created 
> BlockMatrix ends up being backed by instances of SparseMatrix, which a user 
> may not want. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20109) Need a way to convert from IndexedRowMatrix to Block

2017-03-27 Thread John Compitello (JIRA)
John Compitello created SPARK-20109:
---

 Summary: Need a way to convert from IndexedRowMatrix to Block
 Key: SPARK-20109
 URL: https://issues.apache.org/jira/browse/SPARK-20109
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 2.1.0
Reporter: John Compitello


The current implementation of toBlockMatrix on IndexedRowMatrix is 
insufficient. It is implemented by first converting the IndexedRowMatrix to a 
CoordinateMatrix, then converting that CoordinateMatrix to a BlockMatrix. Not 
only is this slower than it needs to be, it also means that the created 
BlockMatrix ends up being backed by instances of SparseMatrix, which a user may 
not want. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20083) Change matrix toArray to not create a new array when matrix is already column major

2017-03-27 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943873#comment-15943873
 ] 

Seth Hendrickson commented on SPARK-20083:
--

Yes, that would be the intention. We have to take care to change the existing 
code when we require a new array from {{toArray}} when we implement this change.

> Change matrix toArray to not create a new array when matrix is already column 
> major
> ---
>
> Key: SPARK-20083
> URL: https://issues.apache.org/jira/browse/SPARK-20083
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Seth Hendrickson
>Priority: Minor
>
> {{toArray}} always creates a new array in column major format, even when the 
> resulting array is the same as the backing values. We should change this to 
> just return a reference to the values array when it is already column major.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20083) Change matrix toArray to not create a new array when matrix is already column major

2017-03-27 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943857#comment-15943857
 ] 

yuhao yang commented on SPARK-20083:


So the result array will allow users to manipulate the matrix values. Is it 
intentional?

> Change matrix toArray to not create a new array when matrix is already column 
> major
> ---
>
> Key: SPARK-20083
> URL: https://issues.apache.org/jira/browse/SPARK-20083
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Seth Hendrickson
>Priority: Minor
>
> {{toArray}} always creates a new array in column major format, even when the 
> resulting array is the same as the backing values. We should change this to 
> just return a reference to the values array when it is already column major.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19904) SPIP Add Spark Project Improvement Proposal doc to website

2017-03-27 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943827#comment-15943827
 ] 

Cody Koeninger commented on SPARK-19904:


It has been added to  apache/spark-website  git repo 
http://spark.apache.org/improvement-proposals.html  , with a link under the 
Community menu item.

It has not been added to apache/spark git repo 
http://spark.apache.org/docs/latest/  under the More menu item.

I'm not 100% clear on what the plan was for maintaining some of the website 
docs in a separate site from the main repo, and whether it's worth updating 
both for cross-linking.

> SPIP Add Spark Project Improvement Proposal doc to website
> --
>
> Key: SPARK-19904
> URL: https://issues.apache.org/jira/browse/SPARK-19904
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Cody Koeninger
>Assignee: Cody Koeninger
>  Labels: SPIP
>
> see
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-td19268.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19904) SPIP Add Spark Project Improvement Proposal doc to website

2017-03-27 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943799#comment-15943799
 ] 

Thomas Graves commented on SPARK-19904:
---

Is this done or what is this waiting on? 

> SPIP Add Spark Project Improvement Proposal doc to website
> --
>
> Key: SPARK-19904
> URL: https://issues.apache.org/jira/browse/SPARK-19904
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Cody Koeninger
>Assignee: Cody Koeninger
>  Labels: SPIP
>
> see
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-td19268.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20087) Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd listeners

2017-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20087:


Assignee: Apache Spark

> Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd 
> listeners
> -
>
> Key: SPARK-20087
> URL: https://issues.apache.org/jira/browse/SPARK-20087
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Charles Lewis
>Assignee: Apache Spark
>
> When tasks end due to an ExceptionFailure, subscribers to onTaskEnd receive 
> accumulators / task metrics for that task, if they were still available. 
> These metrics are not currently sent when tasks are killed intentionally, 
> such as when a speculative retry finishes, and the original is killed (or 
> vice versa). Since we're killing these tasks ourselves, these metrics should 
> almost always exist, and we should treat them the same way as we treat 
> ExceptionFailures.
> Sending these metrics with the TaskKilled end reason makes aggregation across 
> all tasks in an app more accurate. This data can inform decisions about how 
> to tune the speculation parameters in order to minimize duplicated work, and 
> in general, the total cost of an app should include both successful and 
> failed tasks, if that information exists.
> PR: https://github.com/apache/spark/pull/17422



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20087) Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd listeners

2017-03-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943755#comment-15943755
 ] 

Apache Spark commented on SPARK-20087:
--

User 'noodle-fb' has created a pull request for this issue:
https://github.com/apache/spark/pull/17422

> Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd 
> listeners
> -
>
> Key: SPARK-20087
> URL: https://issues.apache.org/jira/browse/SPARK-20087
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Charles Lewis
>
> When tasks end due to an ExceptionFailure, subscribers to onTaskEnd receive 
> accumulators / task metrics for that task, if they were still available. 
> These metrics are not currently sent when tasks are killed intentionally, 
> such as when a speculative retry finishes, and the original is killed (or 
> vice versa). Since we're killing these tasks ourselves, these metrics should 
> almost always exist, and we should treat them the same way as we treat 
> ExceptionFailures.
> Sending these metrics with the TaskKilled end reason makes aggregation across 
> all tasks in an app more accurate. This data can inform decisions about how 
> to tune the speculation parameters in order to minimize duplicated work, and 
> in general, the total cost of an app should include both successful and 
> failed tasks, if that information exists.
> PR: https://github.com/apache/spark/pull/17422



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20087) Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd listeners

2017-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20087:


Assignee: (was: Apache Spark)

> Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd 
> listeners
> -
>
> Key: SPARK-20087
> URL: https://issues.apache.org/jira/browse/SPARK-20087
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Charles Lewis
>
> When tasks end due to an ExceptionFailure, subscribers to onTaskEnd receive 
> accumulators / task metrics for that task, if they were still available. 
> These metrics are not currently sent when tasks are killed intentionally, 
> such as when a speculative retry finishes, and the original is killed (or 
> vice versa). Since we're killing these tasks ourselves, these metrics should 
> almost always exist, and we should treat them the same way as we treat 
> ExceptionFailures.
> Sending these metrics with the TaskKilled end reason makes aggregation across 
> all tasks in an app more accurate. This data can inform decisions about how 
> to tune the speculation parameters in order to minimize duplicated work, and 
> in general, the total cost of an app should include both successful and 
> failed tasks, if that information exists.
> PR: https://github.com/apache/spark/pull/17422



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20105) Add tests for checkType and type string in structField in R

2017-03-27 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-20105.
--
  Resolution: Fixed
Assignee: Hyukjin Kwon
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

> Add tests for checkType and type string in structField in R
> ---
>
> Key: SPARK-20105
> URL: https://issues.apache.org/jira/browse/SPARK-20105
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.2.0
>
>
> It seems {{checkType}} and the type string in {{structField}} are not being 
> tested closely.
> This string format currently seems R-specific. Therefore, it seems nicer if 
> we test positive/negative cases in R side.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20102) Fix two minor build script issues blocking 2.1.1 RC + master snapshot builds

2017-03-27 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-20102.

   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Fixed for 2.1.1 and master.

> Fix two minor build script issues blocking 2.1.1 RC + master snapshot builds
> 
>
> Key: SPARK-20102
> URL: https://issues.apache.org/jira/browse/SPARK-20102
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.1.1, 2.2.0
>
>
> The master snapshot publisher builds are currently broken due to two minor 
> build issues:
> 1. For unknown reasons, the LFTP {{mkdir -p}} command started throwing errors 
> when the remote FTP directory already exists. To work around this, we should 
> update the script to ignore errors.
> 2. The PySpark setup.py file references a non-existent module, causing Python 
> packaging to fail.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20037) impossible to set kafka offsets using kafka 0.10 and spark 2.0.0

2017-03-27 Thread Daniel Nuriyev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943654#comment-15943654
 ] 

Daniel Nuriyev commented on SPARK-20037:


Thank you for your feedback, This problem started when I upgraded kafka client 
jars. But since you can't reproduce it, I'll dig in myself.

> impossible to set kafka offsets using kafka 0.10 and spark 2.0.0
> 
>
> Key: SPARK-20037
> URL: https://issues.apache.org/jira/browse/SPARK-20037
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Daniel Nuriyev
> Attachments: Main.java, offsets.png
>
>
> I use kafka 0.10.1 and java code with the following dependencies:
> 
> org.apache.kafka
> kafka_2.11
> 0.10.1.1
> 
> 
> org.apache.kafka
> kafka-clients
> 0.10.1.1
> 
> 
> org.apache.spark
> spark-streaming_2.11
> 2.0.0
> 
> 
> org.apache.spark
> spark-streaming-kafka-0-10_2.11
> 2.0.0
> 
> The code tries to read the a topic starting with offsets. 
> The topic has 4 partitions that start somewhere before 585000 and end after 
> 674000. So I wanted to read all partitions starting with 585000
> fromOffsets.put(new TopicPartition(topic, 0), 585000L);
> fromOffsets.put(new TopicPartition(topic, 1), 585000L);
> fromOffsets.put(new TopicPartition(topic, 2), 585000L);
> fromOffsets.put(new TopicPartition(topic, 3), 585000L);
> Using 5 second batches:
> jssc = new JavaStreamingContext(conf, Durations.seconds(5));
> The code immediately throws:
> Beginning offset 585000 is after the ending offset 584464 for topic 
> commerce_item_expectation partition 1
> It does not make sense because this topic/partition starts at 584464, not ends
> I use this as a base: 
> https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> But I use direct stream:
> KafkaUtils.createDirectStream(jssc,LocationStrategies.PreferConsistent(),
> ConsumerStrategies.Subscribe(
> topics, kafkaParams, fromOffsets
> )
> )



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20037) impossible to set kafka offsets using kafka 0.10 and spark 2.0.0

2017-03-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943652#comment-15943652
 ] 

Sean Owen commented on SPARK-20037:
---

I cannot reproduce this in my application. There is more to your code you don't 
show and you have two different reports here.
You may have a real problem, I'm just saying this is not sufficient as a JIRA 
in this project, and would not expect someone to investigate.

> impossible to set kafka offsets using kafka 0.10 and spark 2.0.0
> 
>
> Key: SPARK-20037
> URL: https://issues.apache.org/jira/browse/SPARK-20037
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Daniel Nuriyev
> Attachments: Main.java, offsets.png
>
>
> I use kafka 0.10.1 and java code with the following dependencies:
> 
> org.apache.kafka
> kafka_2.11
> 0.10.1.1
> 
> 
> org.apache.kafka
> kafka-clients
> 0.10.1.1
> 
> 
> org.apache.spark
> spark-streaming_2.11
> 2.0.0
> 
> 
> org.apache.spark
> spark-streaming-kafka-0-10_2.11
> 2.0.0
> 
> The code tries to read the a topic starting with offsets. 
> The topic has 4 partitions that start somewhere before 585000 and end after 
> 674000. So I wanted to read all partitions starting with 585000
> fromOffsets.put(new TopicPartition(topic, 0), 585000L);
> fromOffsets.put(new TopicPartition(topic, 1), 585000L);
> fromOffsets.put(new TopicPartition(topic, 2), 585000L);
> fromOffsets.put(new TopicPartition(topic, 3), 585000L);
> Using 5 second batches:
> jssc = new JavaStreamingContext(conf, Durations.seconds(5));
> The code immediately throws:
> Beginning offset 585000 is after the ending offset 584464 for topic 
> commerce_item_expectation partition 1
> It does not make sense because this topic/partition starts at 584464, not ends
> I use this as a base: 
> https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> But I use direct stream:
> KafkaUtils.createDirectStream(jssc,LocationStrategies.PreferConsistent(),
> ConsumerStrategies.Subscribe(
> topics, kafkaParams, fromOffsets
> )
> )



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20037) impossible to set kafka offsets using kafka 0.10 and spark 2.0.0

2017-03-27 Thread Daniel Nuriyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Nuriyev updated SPARK-20037:
---
Attachment: Main.java

> impossible to set kafka offsets using kafka 0.10 and spark 2.0.0
> 
>
> Key: SPARK-20037
> URL: https://issues.apache.org/jira/browse/SPARK-20037
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Daniel Nuriyev
> Attachments: Main.java, offsets.png
>
>
> I use kafka 0.10.1 and java code with the following dependencies:
> 
> org.apache.kafka
> kafka_2.11
> 0.10.1.1
> 
> 
> org.apache.kafka
> kafka-clients
> 0.10.1.1
> 
> 
> org.apache.spark
> spark-streaming_2.11
> 2.0.0
> 
> 
> org.apache.spark
> spark-streaming-kafka-0-10_2.11
> 2.0.0
> 
> The code tries to read the a topic starting with offsets. 
> The topic has 4 partitions that start somewhere before 585000 and end after 
> 674000. So I wanted to read all partitions starting with 585000
> fromOffsets.put(new TopicPartition(topic, 0), 585000L);
> fromOffsets.put(new TopicPartition(topic, 1), 585000L);
> fromOffsets.put(new TopicPartition(topic, 2), 585000L);
> fromOffsets.put(new TopicPartition(topic, 3), 585000L);
> Using 5 second batches:
> jssc = new JavaStreamingContext(conf, Durations.seconds(5));
> The code immediately throws:
> Beginning offset 585000 is after the ending offset 584464 for topic 
> commerce_item_expectation partition 1
> It does not make sense because this topic/partition starts at 584464, not ends
> I use this as a base: 
> https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> But I use direct stream:
> KafkaUtils.createDirectStream(jssc,LocationStrategies.PreferConsistent(),
> ConsumerStrategies.Subscribe(
> topics, kafkaParams, fromOffsets
> )
> )



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20037) impossible to set kafka offsets using kafka 0.10 and spark 2.0.0

2017-03-27 Thread Daniel Nuriyev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943642#comment-15943642
 ] 

Daniel Nuriyev commented on SPARK-20037:


My system is absolutely simple: a topic whose offset starts at X. A single Java 
method that opens a streaming context and reads the topic starting with a 
existing offset. The only dependencies are listed above.
I do not think that this is a spark problem. I think it is a problem in one of 
the kafka jars.
I will attach the Java method.
Have you tried reproducing it? For me it's consistent.

> impossible to set kafka offsets using kafka 0.10 and spark 2.0.0
> 
>
> Key: SPARK-20037
> URL: https://issues.apache.org/jira/browse/SPARK-20037
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Daniel Nuriyev
> Attachments: offsets.png
>
>
> I use kafka 0.10.1 and java code with the following dependencies:
> 
> org.apache.kafka
> kafka_2.11
> 0.10.1.1
> 
> 
> org.apache.kafka
> kafka-clients
> 0.10.1.1
> 
> 
> org.apache.spark
> spark-streaming_2.11
> 2.0.0
> 
> 
> org.apache.spark
> spark-streaming-kafka-0-10_2.11
> 2.0.0
> 
> The code tries to read the a topic starting with offsets. 
> The topic has 4 partitions that start somewhere before 585000 and end after 
> 674000. So I wanted to read all partitions starting with 585000
> fromOffsets.put(new TopicPartition(topic, 0), 585000L);
> fromOffsets.put(new TopicPartition(topic, 1), 585000L);
> fromOffsets.put(new TopicPartition(topic, 2), 585000L);
> fromOffsets.put(new TopicPartition(topic, 3), 585000L);
> Using 5 second batches:
> jssc = new JavaStreamingContext(conf, Durations.seconds(5));
> The code immediately throws:
> Beginning offset 585000 is after the ending offset 584464 for topic 
> commerce_item_expectation partition 1
> It does not make sense because this topic/partition starts at 584464, not ends
> I use this as a base: 
> https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> But I use direct stream:
> KafkaUtils.createDirectStream(jssc,LocationStrategies.PreferConsistent(),
> ConsumerStrategies.Subscribe(
> topics, kafkaParams, fromOffsets
> )
> )



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20088) Do not create new SparkContext in SparkR createSparkContext

2017-03-27 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-20088:
-

Assignee: Hossein Falaki

> Do not create new SparkContext in SparkR createSparkContext
> ---
>
> Key: SPARK-20088
> URL: https://issues.apache.org/jira/browse/SPARK-20088
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>Assignee: Hossein Falaki
> Fix For: 2.2.0
>
>
> In the implementation of {{createSparkContext}}, we are calling 
> {code}
>  new JavaSparkContext()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20088) Do not create new SparkContext in SparkR createSparkContext

2017-03-27 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-20088.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17423
[https://github.com/apache/spark/pull/17423]

> Do not create new SparkContext in SparkR createSparkContext
> ---
>
> Key: SPARK-20088
> URL: https://issues.apache.org/jira/browse/SPARK-20088
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
> Fix For: 2.2.0
>
>
> In the implementation of {{createSparkContext}}, we are calling 
> {code}
>  new JavaSparkContext()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20104) Don't estimate IsNull or IsNotNull predicates for non-leaf node

2017-03-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20104.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17438
[https://github.com/apache/spark/pull/17438]

> Don't estimate IsNull or IsNotNull predicates for non-leaf node
> ---
>
> Key: SPARK-20104
> URL: https://issues.apache.org/jira/browse/SPARK-20104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
> Fix For: 2.2.0
>
>
> In current stage, we don't have advanced statistics such as sketches or 
> histograms. As a result, some operator can't estimate `nullCount` accurately. 
> E.g. left outer join estimation does not accurately update `nullCount` 
> currently. So for IsNull and IsNotNull predicates, we only estimate them when 
> the child is a leaf node, whose `nullCount` is accurate. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20104) Don't estimate IsNull or IsNotNull predicates for non-leaf node

2017-03-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20104:
---

Assignee: Zhenhua Wang

> Don't estimate IsNull or IsNotNull predicates for non-leaf node
> ---
>
> Key: SPARK-20104
> URL: https://issues.apache.org/jira/browse/SPARK-20104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.2.0
>
>
> In current stage, we don't have advanced statistics such as sketches or 
> histograms. As a result, some operator can't estimate `nullCount` accurately. 
> E.g. left outer join estimation does not accurately update `nullCount` 
> currently. So for IsNull and IsNotNull predicates, we only estimate them when 
> the child is a leaf node, whose `nullCount` is accurate. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-27 Thread Shubham Chopra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943460#comment-15943460
 ] 

Shubham Chopra commented on SPARK-19803:


The PR enforces a refresh of the peer list cached at the executor that is 
trying to proactively replicate the block. This fix ensures that the peer will 
never try to replicate to a previously failed executor due to a stale 
reference. In addition, in the unit test, the block managers are explicitly 
stopped when they are being removed from the master.

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Shubham Chopra
>  Labels: flaky-test
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20107) Speed up HadoopMapReduceCommitProtocol#commitJob for many output files

2017-03-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943398#comment-15943398
 ] 

Apache Spark commented on SPARK-20107:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/17442

> Speed up HadoopMapReduceCommitProtocol#commitJob for many output files
> --
>
> Key: SPARK-20107
> URL: https://issues.apache.org/jira/browse/SPARK-20107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>
> Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up 
> [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121]
>  for many output files.
> It can speed up {{11 minutes}} for 216869 output files:
> {code:sql}
> CREATE TABLE tmp.spark_20107 AS SELECT
>   category_id,
>   product_id,
>   track_id,
>   concat(
> substr(ds, 3, 2),
> substr(ds, 6, 2),
> substr(ds, 9, 2)
>   ) shortDate,
>   CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' 
> WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 
> 'invalid actio' END AS type
> FROM
>   tmp.user_action
> WHERE
>   ds > date_sub('2017-01-23', 730)
> AND actiontype IN ('0','1','2','3');
> {code}
> {code}
> $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l
> 216870
> {code}
> This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher 
> versions(see: 
> [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433]
>  and 
> [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0])
>  and apache's hadoop 2.7.0 higher versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20107) Speed up HadoopMapReduceCommitProtocol#commitJob for many output files

2017-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20107:


Assignee: (was: Apache Spark)

> Speed up HadoopMapReduceCommitProtocol#commitJob for many output files
> --
>
> Key: SPARK-20107
> URL: https://issues.apache.org/jira/browse/SPARK-20107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>
> Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up 
> [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121]
>  for many output files.
> It can speed up {{11 minutes}} for 216869 output files:
> {code:sql}
> CREATE TABLE tmp.spark_20107 AS SELECT
>   category_id,
>   product_id,
>   track_id,
>   concat(
> substr(ds, 3, 2),
> substr(ds, 6, 2),
> substr(ds, 9, 2)
>   ) shortDate,
>   CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' 
> WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 
> 'invalid actio' END AS type
> FROM
>   tmp.user_action
> WHERE
>   ds > date_sub('2017-01-23', 730)
> AND actiontype IN ('0','1','2','3');
> {code}
> {code}
> $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l
> 216870
> {code}
> This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher 
> versions(see: 
> [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433]
>  and 
> [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0])
>  and apache's hadoop 2.7.0 higher versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20107) Speed up HadoopMapReduceCommitProtocol#commitJob for many output files

2017-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20107:


Assignee: Apache Spark

> Speed up HadoopMapReduceCommitProtocol#commitJob for many output files
> --
>
> Key: SPARK-20107
> URL: https://issues.apache.org/jira/browse/SPARK-20107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>
> Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up 
> [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121]
>  for many output files.
> It can speed up {{11 minutes}} for 216869 output files:
> {code:sql}
> CREATE TABLE tmp.spark_20107 AS SELECT
>   category_id,
>   product_id,
>   track_id,
>   concat(
> substr(ds, 3, 2),
> substr(ds, 6, 2),
> substr(ds, 9, 2)
>   ) shortDate,
>   CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' 
> WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 
> 'invalid actio' END AS type
> FROM
>   tmp.user_action
> WHERE
>   ds > date_sub('2017-01-23', 730)
> AND actiontype IN ('0','1','2','3');
> {code}
> {code}
> $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l
> 216870
> {code}
> This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher 
> versions(see: 
> [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433]
>  and 
> [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0])
>  and apache's hadoop 2.7.0 higher versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20107) Speed up HadoopMapReduceCommitProtocol#commitJob for many output files

2017-03-27 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-20107:

Summary: Speed up HadoopMapReduceCommitProtocol#commitJob for many output 
files  (was: Speed up FileOutputCommitter#commitJob for many output files)

> Speed up HadoopMapReduceCommitProtocol#commitJob for many output files
> --
>
> Key: SPARK-20107
> URL: https://issues.apache.org/jira/browse/SPARK-20107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>
> Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up 
> [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121]
>  for many output files.
> It can speed up {{11 minutes}} for 216869 output files:
> {code:sql}
> CREATE TABLE tmp.spark_20107 AS SELECT
>   category_id,
>   product_id,
>   track_id,
>   concat(
> substr(ds, 3, 2),
> substr(ds, 6, 2),
> substr(ds, 9, 2)
>   ) shortDate,
>   CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' 
> WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 
> 'invalid actio' END AS type
> FROM
>   tmp.user_action
> WHERE
>   ds > date_sub('2017-01-23', 730)
> AND actiontype IN ('0','1','2','3');
> {code}
> {code}
> $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l
> 216870
> {code}
> This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher 
> versions(see: 
> [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433]
>  and 
> [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0])
>  and apache's hadoop 2.7.0 higher versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20107) Speed up FileOutputCommitter#commitJob for many output files

2017-03-27 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-20107:

Description: 
Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up 
[HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121]
 for many output files.

It can speed up {{11 minutes}} for 216869 output files:
{code:sql}
CREATE TABLE tmp.spark_20107 AS SELECT
  category_id,
  product_id,
  track_id,
  concat(
substr(ds, 3, 2),
substr(ds, 6, 2),
substr(ds, 9, 2)
  ) shortDate,
  CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' 
WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 
'invalid actio' END AS type
FROM
  tmp.user_action
WHERE
  ds > date_sub('2017-01-23', 730)
AND actiontype IN ('0','1','2','3');
{code}
{code}
$ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l
216870
{code}


This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher 
versions(see: 
[cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433]
 and 
[cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0])
 and apache's hadoop 2.7.0 higher versions.

  was:
It can speed up {{11 minutes}} for 216869 output files.

This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher 
versions,(see: 
https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433
 and 
https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0)
 and apache's hadoop 2.7.0 higher versions.


> Speed up FileOutputCommitter#commitJob for many output files
> 
>
> Key: SPARK-20107
> URL: https://issues.apache.org/jira/browse/SPARK-20107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>
> Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up 
> [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121]
>  for many output files.
> It can speed up {{11 minutes}} for 216869 output files:
> {code:sql}
> CREATE TABLE tmp.spark_20107 AS SELECT
>   category_id,
>   product_id,
>   track_id,
>   concat(
> substr(ds, 3, 2),
> substr(ds, 6, 2),
> substr(ds, 9, 2)
>   ) shortDate,
>   CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' 
> WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 
> 'invalid actio' END AS type
> FROM
>   tmp.user_action
> WHERE
>   ds > date_sub('2017-01-23', 730)
> AND actiontype IN ('0','1','2','3');
> {code}
> {code}
> $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l
> 216870
> {code}
> This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher 
> versions(see: 
> [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433]
>  and 
> [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0])
>  and apache's hadoop 2.7.0 higher versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16938) Cannot resolve column name after a join

2017-03-27 Thread Michel Lemay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943382#comment-15943382
 ] 

Michel Lemay commented on SPARK-16938:
--

I just stumbled upon a similar issue with 'union' on two dataframes..   Anybody 
still working on this issue and if not, could it be revived?



> Cannot resolve column name after a join
> ---
>
> Key: SPARK-16938
> URL: https://issues.apache.org/jira/browse/SPARK-16938
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Mathieu D
>Priority: Minor
>
> Found a change of behavior on spark-2.0.0, which breaks a query in our code 
> base.
> The following works on previous spark versions, 1.6.1 up to 2.0.0-preview :
> {code}
> val dfa = Seq((1, 2), (2, 3)).toDF("id", "a").alias("dfa")
> val dfb = Seq((1, 0), (1, 1)).toDF("id", "b").alias("dfb")
> dfa.join(dfb, dfa("id") === dfb("id")).dropDuplicates(Array("dfa.id", 
> "dfb.id"))
> {code}
> but fails with spark-2.0.0 with the exception : 
> {code}
> Cannot resolve column name "dfa.id" among (id, a, id, b); 
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "dfa.id" 
> among (id, a, id, b);
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1818)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1817)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1817)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1814)
>   at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
>   at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1814)
>   at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1840)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20108) Spark query is getting failed with exception

2017-03-27 Thread ZS EDGE (JIRA)
ZS EDGE created SPARK-20108:
---

 Summary: Spark query is getting failed with exception
 Key: SPARK-20108
 URL: https://issues.apache.org/jira/browse/SPARK-20108
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0
Reporter: ZS EDGE




In our project we have implemented a logic where we programatically generate 
spark queries. These queries are executed as a sub query and below is the 
sample query--

sqlContext.sql("INSERT INTO TABLE 
test_client_r2_r2_2_prod_db1_oz.S3_EMPDTL_Incremental_invalid SELECT 
'S3_EMPDTL_Incremental',S3_EMPDTL_Incremental.row_id,S3_EMPDTL_Incremental.SOURCE_FILE_NAME,S3_EMPDTL_Incremental.SOURCE_ROW_ID,'S3_EMPDTL_Incremental','2017-03-22
 
20:18:59','1','Emp_id#$Emp_name#$Emp_phone#$Emp_salary_in_K#$Emp_address_id#$Date_of_Birth#$Status#$Dept_id#$Date_of_joining#$Row_Number#$Dec_check#$','test','Y','N/A','',''
 FROM S3_EMPDTL_Incremental_r AS S3_EMPDTL_Incremental where row_id IN (select 
row_id from s3_empdtl_incremental_r where row_id IN(42949672960))")

While executing the above code in the pyspark it is throwing below exception--

>
.spark.SparkException: Task failed while writing rows
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply

(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply

(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
at 
org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463)
at 
org.apache.spark.sql.execution.joins.LongHashedRelation.getValue(HashedRelation.scala:762)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
... 8 more

[Stage 32:=>  (10 + 5) / 
26]17/03/22 15:42:10 ERROR TaskSetManager: Task 4 in stage 32.0 

failed 4 times; aborting job
17/03/22 15:42:10 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in 
stage 32.0 failed 4 times, most recent failure: Lost task 4.3 in 

stage 32.0 (TID 857, ip-10-116-1-73.ec2.internal): 
org.apache.spark.SparkException: Task failed while writing rows
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply

(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply

(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused 

  1   2   >