[jira] [Updated] (SPARK-20876) If the input parameter is float type for ceil or floor ,the result is not we expected

2017-05-26 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-20876:

Affects Version/s: 2.2.0

> If the input parameter is float type for  ceil or floor ,the result is not we 
> expected
> --
>
> Key: SPARK-20876
> URL: https://issues.apache.org/jira/browse/SPARK-20876
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: liuxian
>
> spark-sql>SELECT ceil(cast(12345.1233 as float));
> spark-sql>12345
> For this case, the result we expected is 12346
> spark-sql>SELECT floor(cast(-12345.1233 as float));
> spark-sql>-12345
> For this case, the result we expected is  -12346



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6628) ClassCastException occurs when executing sql statement "insert into" on hbase table

2017-05-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16027214#comment-16027214
 ] 

Apache Spark commented on SPARK-6628:
-

User 'weiqingy' has created a pull request for this issue:
https://github.com/apache/spark/pull/18127

> ClassCastException occurs when executing sql statement "insert into" on hbase 
> table
> ---
>
> Key: SPARK-6628
> URL: https://issues.apache.org/jira/browse/SPARK-6628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: meiyoula
>
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in 
> stage 3.0 (TID 12, vm-17): java.lang.ClassCastException: 
> org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to 
> org.apache.hadoop.hive.ql.io.HiveOutputFormat
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat$lzycompute(hiveWriterContainers.scala:72)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat(hiveWriterContainers.scala:71)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.getOutputName(hiveWriterContainers.scala:91)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.initWriters(hiveWriterContainers.scala:115)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.executorSideSetup(hiveWriterContainers.scala:84)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:112)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19372) Code generation for Filter predicate including many OR conditions exceeds JVM method size limit

2017-05-26 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19372:
-
Fix Version/s: 2.2.0

> Code generation for Filter predicate including many OR conditions exceeds JVM 
> method size limit 
> 
>
> Key: SPARK-19372
> URL: https://issues.apache.org/jira/browse/SPARK-19372
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Jay Pranavamurthi
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.0, 2.3.0
>
> Attachments: wide400cols.csv
>
>
> For the attached csv file, the code below causes the exception 
> "org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" 
> grows beyond 64 KB
> Code:
> {code:borderStyle=solid}
>   val conf = new SparkConf().setMaster("local[1]")
>   val sqlContext = 
> SparkSession.builder().config(conf).getOrCreate().sqlContext
>   val dataframe =
> sqlContext
>   .read
>   .format("com.databricks.spark.csv")
>   .load("wide400cols.csv")
>   val filter = (0 to 399)
> .foldLeft(lit(false))((e, index) => 
> e.or(dataframe.col(dataframe.columns(index)) =!= s"column${index+1}"))
>   val filtered = dataframe.filter(filter)
>   filtered.show(100)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20843) Cannot gracefully kill drivers which take longer than 10 seconds to die

2017-05-26 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-20843.
--
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 2.2.0
   2.1.2

> Cannot gracefully kill drivers which take longer than 10 seconds to die
> ---
>
> Key: SPARK-20843
> URL: https://issues.apache.org/jira/browse/SPARK-20843
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Michael Allman
>Assignee: Shixiong Zhu
>  Labels: regression
> Fix For: 2.1.2, 2.2.0
>
>
> Commit 
> https://github.com/apache/spark/commit/1c9a386c6b6812a3931f3fb0004249894a01f657
>  changed the behavior of driver process termination. Whereas before 
> `Process.destroyForcibly` was never called, now it is called (on Java VM's 
> supporting that API) if the driver process does not die within 10 seconds.
> This prevents apps which take longer than 10 seconds to shutdown gracefully 
> from shutting down gracefully. For example, streaming apps with a large batch 
> duration (say, 30 seconds+) can take minutes to shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20748) Built-in SQL Function Support - CH[A]R

2017-05-26 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20748.
-
   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 2.3.0

> Built-in SQL Function Support - CH[A]R
> --
>
> Key: SPARK-20748
> URL: https://issues.apache.org/jira/browse/SPARK-20748
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>  Labels: starter
> Fix For: 2.3.0
>
>
> {noformat}
> CH[A]R()
> {noformat}
> Returns a character when given its ASCII code.
> Ref: https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions019.htm



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20905) When running spark with yarn-client, large executor-cores will lead to bad performance.

2017-05-26 Thread Cherry Zhang (JIRA)
Cherry Zhang created SPARK-20905:


 Summary: When running spark with yarn-client, large executor-cores 
will lead to bad performance. 
 Key: SPARK-20905
 URL: https://issues.apache.org/jira/browse/SPARK-20905
 Project: Spark
  Issue Type: Question
  Components: Examples
Affects Versions: 2.0.0
Reporter: Cherry Zhang


Hi, all:
 When I run a training job in spark with yarn-client, and set 
executor-cores=20(less than vcores=24) and executor-num=4(my cluster has 4 
slaves), then there will be always one node computing time is larger than 
others.

I checked some blogs, and they says executor-cores should be set less than 5 if 
there are tons of concurrency threads. I tried to set executor-cores=4, and  
executor-num=20, then it worked.

But I don't know why, can you give some explain? Thank you very much.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20904) Task failures during shutdown cause problems with preempted executors

2017-05-26 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-20904:
--

 Summary: Task failures during shutdown cause problems with 
preempted executors
 Key: SPARK-20904
 URL: https://issues.apache.org/jira/browse/SPARK-20904
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.6.0
Reporter: Marcelo Vanzin


Spark runs tasks in a thread pool that uses daemon threads in each executor. 
That means that when the JVM gets a signal to shut down, those tasks keep 
running.

Now when YARN preempts an executor, it sends a SIGTERM to the process, 
triggering the JVM shutdown. That causes shutdown hooks to run which may cause 
user code running in those tasks to fail, and report task failures to the 
driver. Those failures are then counted towards the maximum number of allowed 
failures, even though in this case we don't want that because the executor was 
preempted.

So we need a better way to handle that situation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20462) Spark-Kinesis Direct Connector

2017-05-26 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-20462:
-
Component/s: (was: Input/Output)
 DStreams

> Spark-Kinesis Direct Connector 
> ---
>
> Key: SPARK-20462
> URL: https://issues.apache.org/jira/browse/SPARK-20462
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Lauren Moos
>
> I'd like to propose and the vet the design for a direct connector between 
> Spark and Kinesis. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20178) Improve Scheduler fetch failures

2017-05-26 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16027036#comment-16027036
 ] 

Sital Kedia commented on SPARK-20178:
-

>> So to get the robustness for now I'm fine with just invalidating it 
>> immediately and see how that works.

[~tgraves] - Let me know if you want me to resurrect - 
https://github.com/apache/spark/pull/17088 which exactly does that. It was 
closed inadvertently as a stale PR. 

> Improve Scheduler fetch failures
> 
>
> Key: SPARK-20178
> URL: https://issues.apache.org/jira/browse/SPARK-20178
> Project: Spark
>  Issue Type: Epic
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> We have been having a lot of discussions around improving the handling of 
> fetch failures.  There are 4 jira currently related to this.  
> We should try to get a list of things we want to improve and come up with one 
> cohesive design.
> SPARK-20163,  SPARK-20091,  SPARK-14649 , and SPARK-19753
> I will put my initial thoughts in a follow on comment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10643) Support remote application download in client mode spark submit

2017-05-26 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-10643.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Support remote application download in client mode spark submit
> ---
>
> Key: SPARK-10643
> URL: https://issues.apache.org/jira/browse/SPARK-10643
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Submit
>Reporter: Alan Braithwaite
>Assignee: Yu Peng
>Priority: Minor
> Fix For: 2.2.0
>
>
> When using mesos with docker and marathon, it would be nice to be able to 
> make spark-submit deployable on marathon and have that download a jar from 
> HDFS instead of having to package the jar with the docker.
> {code}
> $ docker run -it docker.example.com/spark:latest 
> /usr/local/spark/bin/spark-submit  --class 
> com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar 
> Warning: Skip remote jar hdfs://hdfs/tmp/application.jar.
> java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> Although I'm aware that we can run in cluster mode with mesos, we've already 
> built some nice tools surrounding marathon for logging and monitoring.
> Code in question:
> https://github.com/apache/spark/blob/132718ad7f387e1002b708b19e471d9cd907e105/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L723-L736



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10643) Support remote application download in client mode spark submit

2017-05-26 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-10643:
---

Assignee: Yu Peng

> Support remote application download in client mode spark submit
> ---
>
> Key: SPARK-10643
> URL: https://issues.apache.org/jira/browse/SPARK-10643
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Submit
>Reporter: Alan Braithwaite
>Assignee: Yu Peng
>Priority: Minor
> Fix For: 2.2.0
>
>
> When using mesos with docker and marathon, it would be nice to be able to 
> make spark-submit deployable on marathon and have that download a jar from 
> HDFS instead of having to package the jar with the docker.
> {code}
> $ docker run -it docker.example.com/spark:latest 
> /usr/local/spark/bin/spark-submit  --class 
> com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar 
> Warning: Skip remote jar hdfs://hdfs/tmp/application.jar.
> java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> Although I'm aware that we can run in cluster mode with mesos, we've already 
> built some nice tools surrounding marathon for logging and monitoring.
> Code in question:
> https://github.com/apache/spark/blob/132718ad7f387e1002b708b19e471d9cd907e105/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L723-L736



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20873) Improve the error message for unsupported Column Type

2017-05-26 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20873.
-
   Resolution: Fixed
 Assignee: Ruben Janssen
Fix Version/s: 2.3.0

> Improve the error message for unsupported Column Type
> -
>
> Key: SPARK-20873
> URL: https://issues.apache.org/jira/browse/SPARK-20873
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ruben Janssen
>Assignee: Ruben Janssen
> Fix For: 2.3.0
>
>
> For unsupported column type, we simply output the column type instead of the 
> type name. 
> {noformat}
> java.lang.Exception: Unsupported type: 
> org.apache.spark.sql.execution.columnar.ColumnTypeSuite$$anonfun$4$$anon$1@2205a05d
> {noformat}
> We should improve it by outputting its name.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20694) Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide

2017-05-26 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20694.
-
   Resolution: Fixed
 Assignee: Maciej Szymkiewicz
Fix Version/s: 2.2.0

> Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide
> --
>
> Key: SPARK-20694
> URL: https://issues.apache.org/jira/browse/SPARK-20694
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Examples, SQL
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
> Fix For: 2.2.0
>
>
> - Spark SQL, DataFrames and Datasets Guide should contain a section about 
> partitioned, sorted and bucketed writes.
> - Bucketing should be removed from Unsupported Hive Functionality.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20843) Cannot gracefully kill drivers which take longer than 10 seconds to die

2017-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20843:


Assignee: Apache Spark

> Cannot gracefully kill drivers which take longer than 10 seconds to die
> ---
>
> Key: SPARK-20843
> URL: https://issues.apache.org/jira/browse/SPARK-20843
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Michael Allman
>Assignee: Apache Spark
>  Labels: regression
>
> Commit 
> https://github.com/apache/spark/commit/1c9a386c6b6812a3931f3fb0004249894a01f657
>  changed the behavior of driver process termination. Whereas before 
> `Process.destroyForcibly` was never called, now it is called (on Java VM's 
> supporting that API) if the driver process does not die within 10 seconds.
> This prevents apps which take longer than 10 seconds to shutdown gracefully 
> from shutting down gracefully. For example, streaming apps with a large batch 
> duration (say, 30 seconds+) can take minutes to shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20843) Cannot gracefully kill drivers which take longer than 10 seconds to die

2017-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20843:


Assignee: (was: Apache Spark)

> Cannot gracefully kill drivers which take longer than 10 seconds to die
> ---
>
> Key: SPARK-20843
> URL: https://issues.apache.org/jira/browse/SPARK-20843
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Michael Allman
>  Labels: regression
>
> Commit 
> https://github.com/apache/spark/commit/1c9a386c6b6812a3931f3fb0004249894a01f657
>  changed the behavior of driver process termination. Whereas before 
> `Process.destroyForcibly` was never called, now it is called (on Java VM's 
> supporting that API) if the driver process does not die within 10 seconds.
> This prevents apps which take longer than 10 seconds to shutdown gracefully 
> from shutting down gracefully. For example, streaming apps with a large batch 
> duration (say, 30 seconds+) can take minutes to shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20843) Cannot gracefully kill drivers which take longer than 10 seconds to die

2017-05-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026870#comment-16026870
 ] 

Apache Spark commented on SPARK-20843:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/18126

> Cannot gracefully kill drivers which take longer than 10 seconds to die
> ---
>
> Key: SPARK-20843
> URL: https://issues.apache.org/jira/browse/SPARK-20843
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Michael Allman
>  Labels: regression
>
> Commit 
> https://github.com/apache/spark/commit/1c9a386c6b6812a3931f3fb0004249894a01f657
>  changed the behavior of driver process termination. Whereas before 
> `Process.destroyForcibly` was never called, now it is called (on Java VM's 
> supporting that API) if the driver process does not die within 10 seconds.
> This prevents apps which take longer than 10 seconds to shutdown gracefully 
> from shutting down gracefully. For example, streaming apps with a large batch 
> duration (say, 30 seconds+) can take minutes to shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20894) Error while checkpointing to HDFS (similar to JIRA SPARK-19268)

2017-05-26 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026846#comment-16026846
 ] 

Shixiong Zhu commented on SPARK-20894:
--

Thanks for reporting it. I'm wondering if you can provide the driver log and 
logs of all executors. "HDFSBackedStateStoreProvider" will write a log when it 
writes a file. I want to know if it does write 
"/usr/local/hadoop/checkpoint/state/0/0/1.delta". The write operation may 
happen on another executor, that's why I want to see all logs.

> Error while checkpointing to HDFS (similar to JIRA SPARK-19268)
> ---
>
> Key: SPARK-20894
> URL: https://issues.apache.org/jira/browse/SPARK-20894
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.1
> Environment: Ubuntu, Spark 2.1.1, hadoop 2.7
>Reporter: kant kodali
>
> Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 
> hours", "24 hours"), df1.col("AppName")).count();
> StreamingQuery query = df2.writeStream().foreach(new 
> KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start();
> query.awaitTermination();
> This for some reason fails with the Error 
> ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalStateException: Error reading delta file 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = 
> (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist
> I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/  and all 
> consumer offsets in Kafka from all brokers prior to running and yet this 
> error still persists. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20894) Error while checkpointing to HDFS (similar to JIRA SPARK-19268)

2017-05-26 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20894:
-
Docs Text:   (was: Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
17/05/25 23:01:05 INFO CoarseGrainedExecutorBackend: Started daemon with 
process name: 1453@ip-172-31-25-189
17/05/25 23:01:05 INFO SignalUtils: Registered signal handler for TERM
17/05/25 23:01:05 INFO SignalUtils: Registered signal handler for HUP
17/05/25 23:01:05 INFO SignalUtils: Registered signal handler for INT
17/05/25 23:01:06 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
17/05/25 23:01:06 INFO SecurityManager: Changing view acls to: ubuntu
17/05/25 23:01:06 INFO SecurityManager: Changing modify acls to: ubuntu
17/05/25 23:01:06 INFO SecurityManager: Changing view acls groups to: 
17/05/25 23:01:06 INFO SecurityManager: Changing modify acls groups to: 
17/05/25 23:01:06 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(ubuntu); groups 
with view permissions: Set(); users  with modify permissions: Set(ubuntu); 
groups with modify permissions: Set()
17/05/25 23:01:06 INFO TransportClientFactory: Successfully created connection 
to /192.31.29.39:52000 after 55 ms (0 ms spent in bootstraps)
17/05/25 23:01:06 INFO SecurityManager: Changing view acls to: ubuntu
17/05/25 23:01:06 INFO SecurityManager: Changing modify acls to: ubuntu
17/05/25 23:01:06 INFO SecurityManager: Changing view acls groups to: 
17/05/25 23:01:06 INFO SecurityManager: Changing modify acls groups to: 
17/05/25 23:01:06 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(ubuntu); groups 
with view permissions: Set(); users  with modify permissions: Set(ubuntu); 
groups with modify permissions: Set()
17/05/25 23:01:06 INFO TransportClientFactory: Successfully created connection 
to /192.31.29.39:52000 after 1 ms (0 ms spent in bootstraps)
17/05/25 23:01:06 INFO DiskBlockManager: Created local directory at 
/usr/local/spark/temp/spark-14760b98-21b0-458f-9646-5321c472e66d/executor-d13962fb-be68-4243-832b-78e68f65e784/blockmgr-bc0640eb-2d3d-4933-b83c-1b0222740de5
17/05/25 23:01:06 INFO MemoryStore: MemoryStore started with capacity 912.3 MB
17/05/25 23:01:06 INFO CoarseGrainedExecutorBackend: Connecting to driver: 
spark://CoarseGrainedScheduler@192.31.29.39:52000
17/05/25 23:01:06 INFO WorkerWatcher: Connecting to worker 
spark://Worker@192.31.25.189:58000
17/05/25 23:01:06 INFO WorkerWatcher: Successfully connected to 
spark://Worker@192.31.25.189:58000
17/05/25 23:01:06 INFO TransportClientFactory: Successfully created connection 
to /192.31.25.189:58000 after 4 ms (0 ms spent in bootstraps)
17/05/25 23:01:06 INFO CoarseGrainedExecutorBackend: Successfully registered 
with driver
17/05/25 23:01:06 INFO Executor: Starting executor ID 0 on host 192.31.25.189
17/05/25 23:01:06 INFO Utils: Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 52100.
17/05/25 23:01:06 INFO NettyBlockTransferService: Server created on 
192.31.25.189:52100
17/05/25 23:01:06 INFO BlockManager: Using 
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
policy
17/05/25 23:01:06 INFO BlockManagerMaster: Registering BlockManager 
BlockManagerId(0, 192.31.25.189, 52100, None)
17/05/25 23:01:06 INFO BlockManagerMaster: Registered BlockManager 
BlockManagerId(0, 192.31.25.189, 52100, None)
17/05/25 23:01:06 INFO BlockManager: Initialized BlockManager: 
BlockManagerId(0, 192.31.25.189, 52100, None)
17/05/25 23:01:10 INFO CoarseGrainedExecutorBackend: Got assigned task 0
17/05/25 23:01:10 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/05/25 23:01:10 INFO Executor: Fetching 
spark://192.31.29.39:52000/jars/dataprocessing-client-stream.jar with timestamp 
1495753264902
17/05/25 23:01:10 INFO TransportClientFactory: Successfully created connection 
to /192.31.29.39:52000 after 2 ms (0 ms spent in bootstraps)
17/05/25 23:01:10 INFO Utils: Fetching 
spark://192.31.29.39:52000/jars/dataprocessing-client-stream.jar to 
/usr/local/spark/temp/spark-14760b98-21b0-458f-9646-5321c472e66d/executor-d13962fb-be68-4243-832b-78e68f65e784/spark-4d4ed6ae-202c-4ef0-88ea-ab9f1bdf424e/fetchFileTemp2546265996353018358.tmp
17/05/25 23:01:10 INFO Utils: Copying 
/usr/local/spark/temp/spark-14760b98-21b0-458f-9646-5321c472e66d/executor-d13962fb-be68-4243-832b-78e68f65e784/spark-4d4ed6ae-202c-4ef0-88ea-ab9f1bdf424e/2978712601495753264902_cache
 to 
/usr/local/spark/work/app-20170525230105-0019/0/./dataprocessing-client-stream.jar
17/05/25 23:01:10 INFO Executor: Adding 
file:/usr/local/spark/work/app-20170525230105-0019/0/./dataprocessing-client-stream.jar
 to class loader
17/05/25 23:01:10 INFO TorrentBroadcast: 

[jira] [Commented] (SPARK-20894) Error while checkpointing to HDFS (similar to JIRA SPARK-19268)

2017-05-26 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026848#comment-16026848
 ] 

Shixiong Zhu commented on SPARK-20894:
--

By the way, you can click "More" -> "Attach Files" to upload logs instead.

> Error while checkpointing to HDFS (similar to JIRA SPARK-19268)
> ---
>
> Key: SPARK-20894
> URL: https://issues.apache.org/jira/browse/SPARK-20894
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.1
> Environment: Ubuntu, Spark 2.1.1, hadoop 2.7
>Reporter: kant kodali
>
> Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 
> hours", "24 hours"), df1.col("AppName")).count();
> StreamingQuery query = df2.writeStream().foreach(new 
> KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start();
> query.awaitTermination();
> This for some reason fails with the Error 
> ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalStateException: Error reading delta file 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = 
> (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist
> I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/  and all 
> consumer offsets in Kafka from all brokers prior to running and yet this 
> error still persists. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20843) Cannot gracefully kill drivers which take longer than 10 seconds to die

2017-05-26 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026841#comment-16026841
 ] 

Michael Allman commented on SPARK-20843:


bq. Will a per-cluster config be enough for your usage?

Yes.

> Cannot gracefully kill drivers which take longer than 10 seconds to die
> ---
>
> Key: SPARK-20843
> URL: https://issues.apache.org/jira/browse/SPARK-20843
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Michael Allman
>  Labels: regression
>
> Commit 
> https://github.com/apache/spark/commit/1c9a386c6b6812a3931f3fb0004249894a01f657
>  changed the behavior of driver process termination. Whereas before 
> `Process.destroyForcibly` was never called, now it is called (on Java VM's 
> supporting that API) if the driver process does not die within 10 seconds.
> This prevents apps which take longer than 10 seconds to shutdown gracefully 
> from shutting down gracefully. For example, streaming apps with a large batch 
> duration (say, 30 seconds+) can take minutes to shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20891) Reduce duplicate code in typedaggregators.scala

2017-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20891:


Assignee: (was: Apache Spark)

> Reduce duplicate code in typedaggregators.scala
> ---
>
> Key: SPARK-20891
> URL: https://issues.apache.org/jira/browse/SPARK-20891
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ruben Janssen
>
> With SPARK-20411, a significant amount of functions will be added to 
> typedaggregators.scala, resulting in a large amount of duplicate code



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20891) Reduce duplicate code in typedaggregators.scala

2017-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20891:


Assignee: Apache Spark

> Reduce duplicate code in typedaggregators.scala
> ---
>
> Key: SPARK-20891
> URL: https://issues.apache.org/jira/browse/SPARK-20891
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ruben Janssen
>Assignee: Apache Spark
>
> With SPARK-20411, a significant amount of functions will be added to 
> typedaggregators.scala, resulting in a large amount of duplicate code



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20891) Reduce duplicate code in typedaggregators.scala

2017-05-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026831#comment-16026831
 ] 

Apache Spark commented on SPARK-20891:
--

User 'setjet' has created a pull request for this issue:
https://github.com/apache/spark/pull/18125

> Reduce duplicate code in typedaggregators.scala
> ---
>
> Key: SPARK-20891
> URL: https://issues.apache.org/jira/browse/SPARK-20891
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ruben Janssen
>
> With SPARK-20411, a significant amount of functions will be added to 
> typedaggregators.scala, resulting in a large amount of duplicate code



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20843) Cannot gracefully kill drivers which take longer than 10 seconds to die

2017-05-26 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026809#comment-16026809
 ] 

Shixiong Zhu commented on SPARK-20843:
--

[~michael] Will a per-cluster config be enough for your usage? A per-app config 
requires more changes and it's too risky for 2.2 now.

> Cannot gracefully kill drivers which take longer than 10 seconds to die
> ---
>
> Key: SPARK-20843
> URL: https://issues.apache.org/jira/browse/SPARK-20843
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Michael Allman
>  Labels: regression
>
> Commit 
> https://github.com/apache/spark/commit/1c9a386c6b6812a3931f3fb0004249894a01f657
>  changed the behavior of driver process termination. Whereas before 
> `Process.destroyForcibly` was never called, now it is called (on Java VM's 
> supporting that API) if the driver process does not die within 10 seconds.
> This prevents apps which take longer than 10 seconds to shutdown gracefully 
> from shutting down gracefully. For example, streaming apps with a large batch 
> duration (say, 30 seconds+) can take minutes to shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19547) KafkaUtil throw 'No current assignment for partition' Exception

2017-05-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026807#comment-16026807
 ] 

Sean Owen commented on SPARK-19547:
---

Questions are for the mailing list, not JIRA. 

> KafkaUtil throw 'No current assignment for partition' Exception
> ---
>
> Key: SPARK-19547
> URL: https://issues.apache.org/jira/browse/SPARK-19547
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 1.6.1
>Reporter: wuchang
>
> Below is my scala code to create spark kafka stream:
> val kafkaParams = Map[String, Object](
>   "bootstrap.servers" -> "server110:2181,server110:9092",
>   "zookeeper" -> "server110:2181",
>   "key.deserializer" -> classOf[StringDeserializer],
>   "value.deserializer" -> classOf[StringDeserializer],
>   "group.id" -> "example",
>   "auto.offset.reset" -> "latest",
>   "enable.auto.commit" -> (false: java.lang.Boolean)
> )
> val topics = Array("ABTest")
> val stream = KafkaUtils.createDirectStream[String, String](
>   ssc,
>   PreferConsistent,
>   Subscribe[String, String](topics, kafkaParams)
> )
> But after run for 10 hours, it throws exceptions:
> 2017-02-10 10:56:20,000 INFO  [JobGenerator] internals.ConsumerCoordinator: 
> Revoking previously assigned partitions [ABTest-0, ABTest-1] for group example
> 2017-02-10 10:56:20,000 INFO  [JobGenerator] internals.AbstractCoordinator: 
> (Re-)joining group example
> 2017-02-10 10:56:20,011 INFO  [JobGenerator] internals.AbstractCoordinator: 
> (Re-)joining group example
> 2017-02-10 10:56:40,057 INFO  [JobGenerator] internals.AbstractCoordinator: 
> Successfully joined group example with generation 5
> 2017-02-10 10:56:40,058 INFO  [JobGenerator] internals.ConsumerCoordinator: 
> Setting newly assigned partitions [ABTest-1] for group example
> 2017-02-10 10:56:40,080 ERROR [JobScheduler] scheduler.JobScheduler: Error 
> generating jobs for time 148669538 ms
> java.lang.IllegalStateException: No current assignment for partition ABTest-0
> at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
> at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
> at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:179)
> at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:196)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
> at scala.Option.orElse(Option.scala:289)
> at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
> at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:248)
> at 
> 

[jira] [Resolved] (SPARK-20014) Optimize mergeSpillsWithFileStream method

2017-05-26 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-20014.
--
   Resolution: Fixed
 Assignee: Sital Kedia
Fix Version/s: 2.3.0

> Optimize mergeSpillsWithFileStream method
> -
>
> Key: SPARK-20014
> URL: https://issues.apache.org/jira/browse/SPARK-20014
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Sital Kedia
> Fix For: 2.3.0
>
>
> When the individual partition size in a spill is small, 
> mergeSpillsWithTransferTo method does many small disk ios which is really 
> inefficient. One way to improve the performance will be to use 
> mergeSpillsWithFileStream method by turning off transfer to and using 
> buffered file read/write to improve the io throughput. 
> However, the current implementation of mergeSpillsWithFileStream does not do 
> a buffer read/write of the files and in addition to that it unnecessarily 
> flushes the output files for each partitions.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20844) Remove experimental from API and docs

2017-05-26 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-20844.
--
   Resolution: Fixed
 Assignee: Michael Armbrust
Fix Version/s: 2.2.0

> Remove experimental from API and docs
> -
>
> Key: SPARK-20844
> URL: https://issues.apache.org/jira/browse/SPARK-20844
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 2.2.0
>
>
> As of Spark 2.2. we know of several large scale production use cases of 
> Structured Streaming.  We should update the working in the API and docs 
> accordingly before the release.
> Lets leave {{Evolving}} on the internal APIs that we still plan to change 
> though (source and sink).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19547) KafkaUtil throw 'No current assignment for partition' Exception

2017-05-26 Thread Pankaj Rastogi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026795#comment-16026795
 ] 

Pankaj Rastogi commented on SPARK-19547:


This ticket is marked as Invalid, but there is no explanation or reasoning. Can 
someone provide information on why this issue is invalid?

> KafkaUtil throw 'No current assignment for partition' Exception
> ---
>
> Key: SPARK-19547
> URL: https://issues.apache.org/jira/browse/SPARK-19547
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 1.6.1
>Reporter: wuchang
>
> Below is my scala code to create spark kafka stream:
> val kafkaParams = Map[String, Object](
>   "bootstrap.servers" -> "server110:2181,server110:9092",
>   "zookeeper" -> "server110:2181",
>   "key.deserializer" -> classOf[StringDeserializer],
>   "value.deserializer" -> classOf[StringDeserializer],
>   "group.id" -> "example",
>   "auto.offset.reset" -> "latest",
>   "enable.auto.commit" -> (false: java.lang.Boolean)
> )
> val topics = Array("ABTest")
> val stream = KafkaUtils.createDirectStream[String, String](
>   ssc,
>   PreferConsistent,
>   Subscribe[String, String](topics, kafkaParams)
> )
> But after run for 10 hours, it throws exceptions:
> 2017-02-10 10:56:20,000 INFO  [JobGenerator] internals.ConsumerCoordinator: 
> Revoking previously assigned partitions [ABTest-0, ABTest-1] for group example
> 2017-02-10 10:56:20,000 INFO  [JobGenerator] internals.AbstractCoordinator: 
> (Re-)joining group example
> 2017-02-10 10:56:20,011 INFO  [JobGenerator] internals.AbstractCoordinator: 
> (Re-)joining group example
> 2017-02-10 10:56:40,057 INFO  [JobGenerator] internals.AbstractCoordinator: 
> Successfully joined group example with generation 5
> 2017-02-10 10:56:40,058 INFO  [JobGenerator] internals.ConsumerCoordinator: 
> Setting newly assigned partitions [ABTest-1] for group example
> 2017-02-10 10:56:40,080 ERROR [JobScheduler] scheduler.JobScheduler: Error 
> generating jobs for time 148669538 ms
> java.lang.IllegalStateException: No current assignment for partition ABTest-0
> at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
> at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
> at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:179)
> at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:196)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
> at scala.Option.orElse(Option.scala:289)
> at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
> at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
> at 
> 

[jira] [Commented] (SPARK-20877) Investigate if tests will time out on CRAN

2017-05-26 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026760#comment-16026760
 ] 

Felix Cheung commented on SPARK-20877:
--

ok, I will track down the test run time on windows 

> Investigate if tests will time out on CRAN
> --
>
> Key: SPARK-20877
> URL: https://issues.apache.org/jira/browse/SPARK-20877
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20900) ApplicationMaster crashes if SPARK_YARN_STAGING_DIR is not set

2017-05-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026694#comment-16026694
 ] 

Apache Spark commented on SPARK-20900:
--

User 'nonsleepr' has created a pull request for this issue:
https://github.com/apache/spark/pull/18124

> ApplicationMaster crashes if SPARK_YARN_STAGING_DIR is not set
> --
>
> Key: SPARK-20900
> URL: https://issues.apache.org/jira/browse/SPARK-20900
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0, 1.6.0, 2.1.0
> Environment: Spark 2.1.0
>Reporter: Alexander Bessonov
>Priority: Minor
>
> When running {{ApplicationMaster}} directly, if {{SPARK_YARN_STAGING_DIR}} is 
> not set or set to empty string, {{org.apache.hadoop.fs.Path}} will throw 
> {{IllegalArgumentException}} instead of returning {{null}}. This is not 
> handled and the exception crashes the job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20900) ApplicationMaster crashes if SPARK_YARN_STAGING_DIR is not set

2017-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20900:


Assignee: (was: Apache Spark)

> ApplicationMaster crashes if SPARK_YARN_STAGING_DIR is not set
> --
>
> Key: SPARK-20900
> URL: https://issues.apache.org/jira/browse/SPARK-20900
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0, 1.6.0, 2.1.0
> Environment: Spark 2.1.0
>Reporter: Alexander Bessonov
>Priority: Minor
>
> When running {{ApplicationMaster}} directly, if {{SPARK_YARN_STAGING_DIR}} is 
> not set or set to empty string, {{org.apache.hadoop.fs.Path}} will throw 
> {{IllegalArgumentException}} instead of returning {{null}}. This is not 
> handled and the exception crashes the job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20900) ApplicationMaster crashes if SPARK_YARN_STAGING_DIR is not set

2017-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20900:


Assignee: Apache Spark

> ApplicationMaster crashes if SPARK_YARN_STAGING_DIR is not set
> --
>
> Key: SPARK-20900
> URL: https://issues.apache.org/jira/browse/SPARK-20900
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0, 1.6.0, 2.1.0
> Environment: Spark 2.1.0
>Reporter: Alexander Bessonov
>Assignee: Apache Spark
>Priority: Minor
>
> When running {{ApplicationMaster}} directly, if {{SPARK_YARN_STAGING_DIR}} is 
> not set or set to empty string, {{org.apache.hadoop.fs.Path}} will throw 
> {{IllegalArgumentException}} instead of returning {{null}}. This is not 
> handled and the exception crashes the job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20902) Word2Vec implementations with Negative Sampling

2017-05-26 Thread Shubham Chopra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chopra updated SPARK-20902:
---
Description: 
Spark MLlib Word2Vec currently only implements Skip-Gram+Hierarchical softmax. 
Both Continuous bag of words (CBOW) and SkipGram have shown comparative or 
better performance with Negative Sampling. This umbrella JIRA is to keep a 
track of the effort to add negative sampling based implementations of both CBOW 
and SkipGram models to Spark MLlib.

Since word2vec is largely a pre-processing step, the performance often can 
depend on the application it is being used for, and the corpus it is estimated 
on. These implementation give users the choice of picking one that works best 
for their use-case.

  was:Spark MLlib Word2Vec currently only implements Skip-Gram+Hierarchical 
softmax. Both Continuous bag of words (CBOW) and SkipGram have shown 
comparative or better performance with Negative Sampling. This umbrella JIRA is 
to keep a track of the effort to add negative sampling based implementations of 
both CBOW and SkipGram models to Spark MLlib.


> Word2Vec implementations with Negative Sampling
> ---
>
> Key: SPARK-20902
> URL: https://issues.apache.org/jira/browse/SPARK-20902
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.1
>Reporter: Shubham Chopra
>  Labels: ML
>
> Spark MLlib Word2Vec currently only implements Skip-Gram+Hierarchical 
> softmax. Both Continuous bag of words (CBOW) and SkipGram have shown 
> comparative or better performance with Negative Sampling. This umbrella JIRA 
> is to keep a track of the effort to add negative sampling based 
> implementations of both CBOW and SkipGram models to Spark MLlib.
> Since word2vec is largely a pre-processing step, the performance often can 
> depend on the application it is being used for, and the corpus it is 
> estimated on. These implementation give users the choice of picking one that 
> works best for their use-case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20903) Word2Vec Skip-Gram + Negative Sampling

2017-05-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026669#comment-16026669
 ] 

Apache Spark commented on SPARK-20903:
--

User 'shubhamchopra' has created a pull request for this issue:
https://github.com/apache/spark/pull/18123

> Word2Vec Skip-Gram + Negative Sampling
> --
>
> Key: SPARK-20903
> URL: https://issues.apache.org/jira/browse/SPARK-20903
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Affects Versions: 2.1.1
>Reporter: Shubham Chopra
>
> SkipGram + Negative Sampling is shown to be comparative or out-performing the 
> hierarchical softmax based approach currently implemented with Spark. Since 
> word2vec is largely a pre-processing step, the performance often can depend 
> on the application it is being used for, and the corpus it is estimated on. 
> These implementation give users the choice of picking one that works best for 
> their use-case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20903) Word2Vec Skip-Gram + Negative Sampling

2017-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20903:


Assignee: (was: Apache Spark)

> Word2Vec Skip-Gram + Negative Sampling
> --
>
> Key: SPARK-20903
> URL: https://issues.apache.org/jira/browse/SPARK-20903
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Affects Versions: 2.1.1
>Reporter: Shubham Chopra
>
> SkipGram + Negative Sampling is shown to be comparative or out-performing the 
> hierarchical softmax based approach currently implemented with Spark. Since 
> word2vec is largely a pre-processing step, the performance often can depend 
> on the application it is being used for, and the corpus it is estimated on. 
> These implementation give users the choice of picking one that works best for 
> their use-case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20903) Word2Vec Skip-Gram + Negative Sampling

2017-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20903:


Assignee: Apache Spark

> Word2Vec Skip-Gram + Negative Sampling
> --
>
> Key: SPARK-20903
> URL: https://issues.apache.org/jira/browse/SPARK-20903
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Affects Versions: 2.1.1
>Reporter: Shubham Chopra
>Assignee: Apache Spark
>
> SkipGram + Negative Sampling is shown to be comparative or out-performing the 
> hierarchical softmax based approach currently implemented with Spark. Since 
> word2vec is largely a pre-processing step, the performance often can depend 
> on the application it is being used for, and the corpus it is estimated on. 
> These implementation give users the choice of picking one that works best for 
> their use-case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20843) Cannot gracefully kill drivers which take longer than 10 seconds to die

2017-05-26 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026663#comment-16026663
 ] 

Michael Allman commented on SPARK-20843:


bq. I will say that I don't think its really safe to rely on clean shutdowns 
for correctness...

I agree, but we don't want a "force kill" to become the norm for an app 
shutdown. An unclean shutdown should be an exceptional situation that will be 
cause for an ops alert, an investigation and a validation that no data loss or 
corruption occurred (i.e. our failure mechanisms held).

Also, I would say that in some cases where an app integrates with other systems 
and cannot guarantee transactional or idempotent semantics in the event of a 
failure, an unclean shutdown *will* require cross-system validation and any 
necessary data synchronization or recovery.

Cheers.

> Cannot gracefully kill drivers which take longer than 10 seconds to die
> ---
>
> Key: SPARK-20843
> URL: https://issues.apache.org/jira/browse/SPARK-20843
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Michael Allman
>  Labels: regression
>
> Commit 
> https://github.com/apache/spark/commit/1c9a386c6b6812a3931f3fb0004249894a01f657
>  changed the behavior of driver process termination. Whereas before 
> `Process.destroyForcibly` was never called, now it is called (on Java VM's 
> supporting that API) if the driver process does not die within 10 seconds.
> This prevents apps which take longer than 10 seconds to shutdown gracefully 
> from shutting down gracefully. For example, streaming apps with a large batch 
> duration (say, 30 seconds+) can take minutes to shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20903) Word2Vec Skip-Gram + Negative Sampling

2017-05-26 Thread Shubham Chopra (JIRA)
Shubham Chopra created SPARK-20903:
--

 Summary: Word2Vec Skip-Gram + Negative Sampling
 Key: SPARK-20903
 URL: https://issues.apache.org/jira/browse/SPARK-20903
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Affects Versions: 2.1.1
Reporter: Shubham Chopra


SkipGram + Negative Sampling is shown to be comparative or out-performing the 
hierarchical softmax based approach currently implemented with Spark. Since 
word2vec is largely a pre-processing step, the performance often can depend on 
the application it is being used for, and the corpus it is estimated on. These 
implementation give users the choice of picking one that works best for their 
use-case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20372) Word2Vec Continuous Bag Of Words model

2017-05-26 Thread Shubham Chopra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chopra updated SPARK-20372:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-20902

> Word2Vec Continuous Bag Of Words model
> --
>
> Key: SPARK-20372
> URL: https://issues.apache.org/jira/browse/SPARK-20372
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Shubham Chopra
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20902) Word2Vec implementations with Negative Sampling

2017-05-26 Thread Shubham Chopra (JIRA)
Shubham Chopra created SPARK-20902:
--

 Summary: Word2Vec implementations with Negative Sampling
 Key: SPARK-20902
 URL: https://issues.apache.org/jira/browse/SPARK-20902
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.1.1
Reporter: Shubham Chopra


Spark MLlib Word2Vec currently only implements Skip-Gram+Hierarchical softmax. 
Both Continuous bag of words (CBOW) and SkipGram have shown comparative or 
better performance with Negative Sampling. This umbrella JIRA is to keep a 
track of the effort to add negative sampling based implementations of both CBOW 
and SkipGram models to Spark MLlib.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20843) Cannot gracefully kill drivers which take longer than 10 seconds to die

2017-05-26 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026643#comment-16026643
 ] 

Marcelo Vanzin commented on SPARK-20843:


I mostly followed [~zsxwing]'s review since he didn't seem to have issues with 
it (and he filed the original bug, SPARK-13602). My comments were mostly 
stylistic.

Given that it may cause problems it might be safer to revert this in 2.2 and 
fix it in master. Maybe [~bryanc] can chime in too since he worked on the 
change itself.

> Cannot gracefully kill drivers which take longer than 10 seconds to die
> ---
>
> Key: SPARK-20843
> URL: https://issues.apache.org/jira/browse/SPARK-20843
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Michael Allman
>  Labels: regression
>
> Commit 
> https://github.com/apache/spark/commit/1c9a386c6b6812a3931f3fb0004249894a01f657
>  changed the behavior of driver process termination. Whereas before 
> `Process.destroyForcibly` was never called, now it is called (on Java VM's 
> supporting that API) if the driver process does not die within 10 seconds.
> This prevents apps which take longer than 10 seconds to shutdown gracefully 
> from shutting down gracefully. For example, streaming apps with a large batch 
> duration (say, 30 seconds+) can take minutes to shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20843) Cannot gracefully kill drivers which take longer than 10 seconds to die

2017-05-26 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026634#comment-16026634
 ] 

Michael Armbrust commented on SPARK-20843:
--

I don't have much context here /cc [~zsxwing] and [~vanzin].  I does seem like 
this could at least be configurable.

I will say that I don't think its really safe to rely on clean shutdowns for 
correctness and I don't think this change affects structured streaming.



> Cannot gracefully kill drivers which take longer than 10 seconds to die
> ---
>
> Key: SPARK-20843
> URL: https://issues.apache.org/jira/browse/SPARK-20843
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Michael Allman
>  Labels: regression
>
> Commit 
> https://github.com/apache/spark/commit/1c9a386c6b6812a3931f3fb0004249894a01f657
>  changed the behavior of driver process termination. Whereas before 
> `Process.destroyForcibly` was never called, now it is called (on Java VM's 
> supporting that API) if the driver process does not die within 10 seconds.
> This prevents apps which take longer than 10 seconds to shutdown gracefully 
> from shutting down gracefully. For example, streaming apps with a large batch 
> duration (say, 30 seconds+) can take minutes to shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20393) Strengthen Spark to prevent XSS vulnerabilities

2017-05-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20393:
--
Fix Version/s: (was: 2.3.0)
   2.2.0

> Strengthen Spark to prevent XSS vulnerabilities
> ---
>
> Key: SPARK-20393
> URL: https://issues.apache.org/jira/browse/SPARK-20393
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.2, 2.0.2, 2.1.0
>Reporter: Nicholas Marion
>Assignee: Nicholas Marion
>Priority: Minor
>  Labels: security
> Fix For: 2.2.0
>
>
> Using IBM Security AppScan Standard, we discovered several easy to recreate 
> MHTML cross site scripting vulnerabilities in the Apache Spark Web GUI 
> application and these vulnerabilities were found to exist in Spark version 
> 1.5.2 and 2.0.2, the two levels we initially tested. Cross-site scripting 
> attack is not really an attack on the Spark server as much as an attack on 
> the end user, taking advantage of their trust in the Spark server to get them 
> to click on a URL like the ones in the examples below.  So whether the user 
> could or could not change lots of stuff on the Spark server is not the key 
> point.  It is an attack on the user themselves.  If they click the link the 
> script could run in their browser and comprise their device.  Once the 
> browser is compromised it could submit Spark requests but it also might not.
> https://blogs.technet.microsoft.com/srd/2011/01/28/more-information-about-the-mhtml-script-injection-vulnerability/
> {quote}
> Request: GET 
> /app/?appId=Content-Type:%20multipart/related;%20boundary=_AppScan%0d%0a--
> _AppScan%0d%0aContent-Location:foo%0d%0aContent-Transfer-
> Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a
> HTTP/1.1
> Excerpt from response: No running application with ID 
> Content-Type: multipart/related;
> boundary=_AppScan
> --_AppScan
> Content-Location:foo
> Content-Transfer-Encoding:base64
> PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+
> 
> Result: In the above payload the BASE64 data decodes as:
> alert("XSS")
> Request: GET 
> /history/app-20161012202114-0038/stages/stage?id=1=0=Content-
> Type:%20multipart/related;%20boundary=_AppScan%0d%0a--_AppScan%0d%0aContent-
> Location:foo%0d%0aContent-Transfer-
> Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a
> k.pageSize=100 HTTP/1.1
> Excerpt from response: Content-Type: multipart/related;
> boundary=_AppScan
> --_AppScan
> Content-Location:foo
> Content-Transfer-Encoding:base64
> PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+
> Result: In the above payload the BASE64 data decodes as:
> alert("XSS")
> Request: GET /log?appId=app-20170113131903-=0=Content-
> Type:%20multipart/related;%20boundary=_AppScan%0d%0a--_AppScan%0d%0aContent-
> Location:foo%0d%0aContent-Transfer-
> Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a
> eLength=0 HTTP/1.1
> Excerpt from response:  Bytes 0-0 of 0 of 
> /u/nmarion/Spark_2.0.2.0/Spark-DK/work/app-20170113131903-/0/Content-
> Type: multipart/related; boundary=_AppScan
> --_AppScan
> Content-Location:foo
> Content-Transfer-Encoding:base64
> PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+
> Result: In the above payload the BASE64 data decodes as:
> alert("XSS")
> {quote}
> security@apache was notified and recommended a PR.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20882) Executor is waiting for ShuffleBlockFetcherIterator

2017-05-26 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026595#comment-16026595
 ] 

Shixiong Zhu commented on SPARK-20882:
--

[~cenyuhai] do you mind to share what's the root cause?

> Executor is waiting for ShuffleBlockFetcherIterator
> ---
>
> Key: SPARK-20882
> URL: https://issues.apache.org/jira/browse/SPARK-20882
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: cen yuhai
> Attachments: executor_jstack, executor_log, screenshot-1.png, 
> screenshot-2.png, screenshot-3.png
>
>
> This bug is like https://issues.apache.org/jira/browse/SPARK-19300.
> but I have updated my client netty version to 4.0.43.Final.
> The shuffle service handler is still 4.0.42.Final
> spark.sql.adaptive.enabled is true
> {code}
> "Executor task launch worker for task 4808985" #5373 daemon prio=5 os_prio=0 
> tid=0x7f54ef437000 nid=0x1aed0 waiting on condition [0x7f53aebfe000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> parking to wait for <0x000498c249c0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:189)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:58)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
> at org.apache.spark.scheduler.Task.run(Task.scala:114)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:323)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
> at java.lang.Thread.run(Thread.java:834)
> {code}
> {code}
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
> Set(shuffle_5_1431_805, shuffle_5_1431_808, shuffle_5_1431_806, 
> shuffle_5_1431_809, shuffle_5_1431_807)
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
> Set(shuffle_5_1431_808, shuffle_5_1431_806, shuffle_5_1431_809, 
> shuffle_5_1431_807)
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
> Set(shuffle_5_1431_808, shuffle_5_1431_809, shuffle_5_1431_807)
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
> Set(shuffle_5_1431_808, shuffle_5_1431_809)
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
> Set(shuffle_5_1431_809)
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: Set()
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 21
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 20
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 19
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 18
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 17
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 16
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 15
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 14
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 13
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 12
> 17/05/26 12:04:06 DEBUG 

[jira] [Commented] (SPARK-20897) cached self-join should not fail

2017-05-26 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026588#comment-16026588
 ] 

Michael Armbrust commented on SPARK-20897:
--

Is this a regression?  If so, can you please make sure that its targeted at the 
2.2.0 release.

> cached self-join should not fail
> 
>
> Key: SPARK-20897
> URL: https://issues.apache.org/jira/browse/SPARK-20897
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>
> code to reproduce this bug:
> {code}
> // force to plan sort merge join
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "0")
> val df = Seq(1 -> "a").toDF("i", "j")
> val df1 = df.as("t1")
> val df2 = df.as("t2")
> assert(df1.join(df2, $"t1.i" === $"t2.i").cache().count() == 1)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20901) Feature parity for ORC with Parquet

2017-05-26 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-20901:
-

 Summary: Feature parity for ORC with Parquet
 Key: SPARK-20901
 URL: https://issues.apache.org/jira/browse/SPARK-20901
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.1
Reporter: Dongjoon Hyun


This issue aims to track the feature parity for ORC with Parquet.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11412) Support merge schema for ORC

2017-05-26 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-11412:
--
Affects Version/s: 2.1.1

> Support merge schema for ORC
> 
>
> Key: SPARK-11412
> URL: https://issues.apache.org/jira/browse/SPARK-11412
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Dave
>
> when I tried to load partitioned orc files with a slight difference in a 
> nested column. say 
> column 
> -- request: struct (nullable = true)
>  ||-- datetime: string (nullable = true)
>  ||-- host: string (nullable = true)
>  ||-- ip: string (nullable = true)
>  ||-- referer: string (nullable = true)
>  ||-- request_uri: string (nullable = true)
>  ||-- uri: string (nullable = true)
>  ||-- useragent: string (nullable = true)
> And then there's a page_url_lists attributes in the later partitions.
> I tried to use
> val s = sqlContext.read.format("orc").option("mergeSchema", 
> "true").load("/data/warehouse/") to load the data.
> But the schema doesn't show request.page_url_lists.
> I am wondering if schema merge doesn't work for orc?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20900) ApplicationMaster crashes if SPARK_YARN_STAGING_DIR is not set

2017-05-26 Thread Alexander Bessonov (JIRA)
Alexander Bessonov created SPARK-20900:
--

 Summary: ApplicationMaster crashes if SPARK_YARN_STAGING_DIR is 
not set
 Key: SPARK-20900
 URL: https://issues.apache.org/jira/browse/SPARK-20900
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.1.0, 1.6.0, 1.2.0
 Environment: Spark 2.1.0
Reporter: Alexander Bessonov
Priority: Minor


When running {{ApplicationMaster}} directly, if {{SPARK_YARN_STAGING_DIR}} is 
not set or set to empty string, {{org.apache.hadoop.fs.Path}} will throw 
{{IllegalArgumentException}} instead of returning {{null}}. This is not handled 
and the exception crashes the job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-19809) NullPointerException on empty ORC file

2017-05-26 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-19809:
---

IMO, we had better be more robust on this. The 3rd party tools (reported pig or 
sqoop) sometimes introduce this issues. 
{code}
scala> sql("create table empty_orc(a int) stored as orc location 
'/tmp/empty_orc'").show
++
||
++
++

$ touch /tmp/empty_orc/zero.orc

scala> sql("select * from empty_orc").show
{code}

> NullPointerException on empty ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.6.3, 2.0.2, 2.1.1
>Reporter: Michał Dawid
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>   at 

[jira] [Comment Edited] (SPARK-19809) NullPointerException on empty ORC file

2017-05-26 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026515#comment-16026515
 ] 

Dongjoon Hyun edited comment on SPARK-19809 at 5/26/17 5:09 PM:


IMO, we had better be more robust on this. The 3rd party tools (reported pig or 
sqoop) sometimes introduce this issues. 
{code}
scala> sql("create table empty_orc(a int) stored as orc location 
'/tmp/empty_orc'").show
++
||
++
++

$ touch /tmp/empty_orc/zero.orc

scala> sql("select * from empty_orc").show
java.lang.RuntimeException: serious problem
  at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
  at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
{code}


was (Author: dongjoon):
IMO, we had better be more robust on this. The 3rd party tools (reported pig or 
sqoop) sometimes introduce this issues. 
{code}
scala> sql("create table empty_orc(a int) stored as orc location 
'/tmp/empty_orc'").show
++
||
++
++

$ touch /tmp/empty_orc/zero.orc

scala> sql("select * from empty_orc").show
{code}

> NullPointerException on empty ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.6.3, 2.0.2, 2.1.1
>Reporter: Michał Dawid
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> 

[jira] [Updated] (SPARK-19809) NullPointerException on empty ORC file

2017-05-26 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-19809:
--
Affects Version/s: 2.1.1

> NullPointerException on empty ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.6.3, 2.0.2, 2.1.1
>Reporter: Michał Dawid
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>

[jira] [Assigned] (SPARK-20835) It should exit directly when the --total-executor-cores parameter is setted less than 0 when submit a application

2017-05-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20835:
-

  Assignee: eaton
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> It should exit directly when the --total-executor-cores parameter is setted 
> less than 0 when submit a application
> -
>
> Key: SPARK-20835
> URL: https://issues.apache.org/jira/browse/SPARK-20835
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: eaton
>Assignee: eaton
>Priority: Minor
> Fix For: 2.3.0
>
>
> n my test, the submitted app running with out an error when the 
> --total-executor-cores less than 0
> and given the warnings:
> "2017-05-22 17:19:36,319 WARN org.apache.spark.scheduler.TaskSchedulerImpl: 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources";
> It should exit directly when the --total-executor-cores parameter is setted 
> less than 0 when submit a application



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20835) It should exit directly when the --total-executor-cores parameter is setted less than 0 when submit a application

2017-05-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20835.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18060
[https://github.com/apache/spark/pull/18060]

> It should exit directly when the --total-executor-cores parameter is setted 
> less than 0 when submit a application
> -
>
> Key: SPARK-20835
> URL: https://issues.apache.org/jira/browse/SPARK-20835
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: eaton
> Fix For: 2.3.0
>
>
> n my test, the submitted app running with out an error when the 
> --total-executor-cores less than 0
> and given the warnings:
> "2017-05-22 17:19:36,319 WARN org.apache.spark.scheduler.TaskSchedulerImpl: 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources";
> It should exit directly when the --total-executor-cores parameter is setted 
> less than 0 when submit a application



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20882) Executor is waiting for ShuffleBlockFetcherIterator

2017-05-26 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai closed SPARK-20882.
-
Resolution: Not A Problem

> Executor is waiting for ShuffleBlockFetcherIterator
> ---
>
> Key: SPARK-20882
> URL: https://issues.apache.org/jira/browse/SPARK-20882
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: cen yuhai
> Attachments: executor_jstack, executor_log, screenshot-1.png, 
> screenshot-2.png, screenshot-3.png
>
>
> This bug is like https://issues.apache.org/jira/browse/SPARK-19300.
> but I have updated my client netty version to 4.0.43.Final.
> The shuffle service handler is still 4.0.42.Final
> spark.sql.adaptive.enabled is true
> {code}
> "Executor task launch worker for task 4808985" #5373 daemon prio=5 os_prio=0 
> tid=0x7f54ef437000 nid=0x1aed0 waiting on condition [0x7f53aebfe000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> parking to wait for <0x000498c249c0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:189)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:58)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
> at org.apache.spark.scheduler.Task.run(Task.scala:114)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:323)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
> at java.lang.Thread.run(Thread.java:834)
> {code}
> {code}
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
> Set(shuffle_5_1431_805, shuffle_5_1431_808, shuffle_5_1431_806, 
> shuffle_5_1431_809, shuffle_5_1431_807)
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
> Set(shuffle_5_1431_808, shuffle_5_1431_806, shuffle_5_1431_809, 
> shuffle_5_1431_807)
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
> Set(shuffle_5_1431_808, shuffle_5_1431_809, shuffle_5_1431_807)
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
> Set(shuffle_5_1431_808, shuffle_5_1431_809)
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
> Set(shuffle_5_1431_809)
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: Set()
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 21
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 20
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 19
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 18
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 17
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 16
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 15
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 14
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 13
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 12
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 11
> 17/05/26 

[jira] [Assigned] (SPARK-20899) PySpark supports stringIndexerOrderType in RFormula

2017-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20899:


Assignee: Apache Spark

> PySpark supports stringIndexerOrderType in RFormula
> ---
>
> Key: SPARK-20899
> URL: https://issues.apache.org/jira/browse/SPARK-20899
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.1.1
>Reporter: Wayne Zhang
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20899) PySpark supports stringIndexerOrderType in RFormula

2017-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20899:


Assignee: (was: Apache Spark)

> PySpark supports stringIndexerOrderType in RFormula
> ---
>
> Key: SPARK-20899
> URL: https://issues.apache.org/jira/browse/SPARK-20899
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.1.1
>Reporter: Wayne Zhang
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20899) PySpark supports stringIndexerOrderType in RFormula

2017-05-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026489#comment-16026489
 ] 

Apache Spark commented on SPARK-20899:
--

User 'actuaryzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/18122

> PySpark supports stringIndexerOrderType in RFormula
> ---
>
> Key: SPARK-20899
> URL: https://issues.apache.org/jira/browse/SPARK-20899
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.1.1
>Reporter: Wayne Zhang
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20899) PySpark supports stringIndexerOrderType in RFormula

2017-05-26 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-20899:
---

 Summary: PySpark supports stringIndexerOrderType in RFormula
 Key: SPARK-20899
 URL: https://issues.apache.org/jira/browse/SPARK-20899
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 2.1.1
Reporter: Wayne Zhang






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20898) spark.blacklist.killBlacklistedExecutors doesn't work in YARN

2017-05-26 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-20898:
-

 Summary: spark.blacklist.killBlacklistedExecutors doesn't work in 
YARN
 Key: SPARK-20898
 URL: https://issues.apache.org/jira/browse/SPARK-20898
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Thomas Graves


I was trying out the new spark.blacklist.killBlacklistedExecutors on YARN but 
it doesn't appear to work.  Everytime I get:

17/05/26 16:28:12 WARN BlacklistTracker: Not attempting to kill blacklisted 
executor id 4 since allocation client is not defined

Even though dynamic allocation is on.  Taking a quick look, I think the way it 
creates the blacklisttracker and passes the allocation client is wrong. The 
scheduler backend is 
 not set yet so it never passes the allocation client to the blacklisttracker 
correctly.  Thus it will never kill.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20897) cached self-join should not fail

2017-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20897:


Assignee: Wenchen Fan  (was: Apache Spark)

> cached self-join should not fail
> 
>
> Key: SPARK-20897
> URL: https://issues.apache.org/jira/browse/SPARK-20897
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>
> code to reproduce this bug:
> {code}
> // force to plan sort merge join
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "0")
> val df = Seq(1 -> "a").toDF("i", "j")
> val df1 = df.as("t1")
> val df2 = df.as("t2")
> assert(df1.join(df2, $"t1.i" === $"t2.i").cache().count() == 1)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20897) cached self-join should not fail

2017-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20897:


Assignee: Apache Spark  (was: Wenchen Fan)

> cached self-join should not fail
> 
>
> Key: SPARK-20897
> URL: https://issues.apache.org/jira/browse/SPARK-20897
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>
> code to reproduce this bug:
> {code}
> // force to plan sort merge join
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "0")
> val df = Seq(1 -> "a").toDF("i", "j")
> val df1 = df.as("t1")
> val df2 = df.as("t2")
> assert(df1.join(df2, $"t1.i" === $"t2.i").cache().count() == 1)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20897) cached self-join should not fail

2017-05-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026392#comment-16026392
 ] 

Apache Spark commented on SPARK-20897:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/18121

> cached self-join should not fail
> 
>
> Key: SPARK-20897
> URL: https://issues.apache.org/jira/browse/SPARK-20897
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>
> code to reproduce this bug:
> {code}
> // force to plan sort merge join
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "0")
> val df = Seq(1 -> "a").toDF("i", "j")
> val df1 = df.as("t1")
> val df2 = df.as("t2")
> assert(df1.join(df2, $"t1.i" === $"t2.i").cache().count() == 1)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20897) cached self-join should not fail

2017-05-26 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-20897:
---

 Summary: cached self-join should not fail
 Key: SPARK-20897
 URL: https://issues.apache.org/jira/browse/SPARK-20897
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan


code to reproduce this bug:
{code}
// force to plan sort merge join
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "0")
val df = Seq(1 -> "a").toDF("i", "j")
val df1 = df.as("t1")
val df2 = df.as("t2")
assert(df1.join(df2, $"t1.i" === $"t2.i").cache().count() == 1)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20775) from_json should also have an API where the schema is specified with a string

2017-05-26 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20775:
---

Assignee: Ruben Janssen

> from_json should also have an API where the schema is specified with a string
> -
>
> Key: SPARK-20775
> URL: https://issues.apache.org/jira/browse/SPARK-20775
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Burak Yavuz
>Assignee: Ruben Janssen
> Fix For: 2.3.0
>
>
> Right now you also have to provide a java.util.Map which is not nice for 
> Scala users.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20887) support alternative keys in ConfigBuilder

2017-05-26 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20887.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18110
[https://github.com/apache/spark/pull/18110]

> support alternative keys in ConfigBuilder
> -
>
> Key: SPARK-20887
> URL: https://issues.apache.org/jira/browse/SPARK-20887
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19669) Open up visibility for sharedState, sessionState, and a few other functions

2017-05-26 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026219#comment-16026219
 ] 

Steve Loughran commented on SPARK-19669:


thanks for this, very nice to have Logging usable outside Spark's own codebase 
again; I had my own copy & paste but when you start playing with traits across 
things, life gets hard

> Open up visibility for sharedState, sessionState, and a few other functions
> ---
>
> Key: SPARK-19669
> URL: https://issues.apache.org/jira/browse/SPARK-19669
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.2.0
>
>
> To ease debugging, most of Spark SQL internals have public level visibility. 
> Two of the most important internal states, sharedState and sessionState, 
> however, are package private. It would make more sense to open these up as 
> well with clear documentation that they are internal.
> In addition, users currently have way to set active/default SparkSession, but 
> no way to actually get them back. We should open those up as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20199) GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter

2017-05-26 Thread pralabhkumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar updated SPARK-20199:
-
Description: 
Spark GradientBoostedTreesModel doesn't have featureSubsetStrategy . It Uses 
random forest internally ,which have featureSubsetStrategy hardcoded "all". It 
should be provided by the user to have randomness at the feature level.




This parameter is available in H2O and XGBoost. 

Sample from H2O.ai 
gbmParams._col_sample_rate

Please provide the parameter . 

  was:
Spark GradientBoostedTreesModel doesn't have Column  sampling rate parameter . 
This parameter is available in H2O and XGBoost. 

Sample from H2O.ai 
gbmParams._col_sample_rate

Please provide the parameter . 


> GradientBoostedTreesModel doesn't have  featureSubsetStrategy parameter
> ---
>
> Key: SPARK-20199
> URL: https://issues.apache.org/jira/browse/SPARK-20199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: pralabhkumar
>
> Spark GradientBoostedTreesModel doesn't have featureSubsetStrategy . It Uses 
> random forest internally ,which have featureSubsetStrategy hardcoded "all". 
> It should be provided by the user to have randomness at the feature level.
> This parameter is available in H2O and XGBoost. 
> Sample from H2O.ai 
> gbmParams._col_sample_rate
> Please provide the parameter . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20896) spark executor get java.lang.ClassCastException when trigger two job at same time

2017-05-26 Thread poseidon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

poseidon updated SPARK-20896:
-
Description: 
1、zeppelin 0.6.2  in *SCOPE* mode 
2、spark 1.6.2 
3、HDP 2.4 for HDFS YARN 

trigger scala code like :
{quote}
var tmpDataFrame = sql(" select b1,b2,b3 from xxx.x")
val vectorDf = assembler.transform(tmpDataFrame)
val vectRdd = vectorDf.select("features").map{x:Row => x.getAs[Vector](0)}
val correlMatrix: Matrix = Statistics.corr(vectRdd, "spearman")

val columns = correlMatrix.toArray.grouped(correlMatrix.numRows)
val rows = columns.toSeq.transpose
val vectors = rows.map(row => new DenseVector(row.toArray))
val vRdd = sc.parallelize(vectors)

import sqlContext.implicits._
val dfV = vRdd.map(_.toArray).map{ case Array(b1,b2,b3) => (b1,b2,b3) }.toDF()

val rows = dfV.rdd.zipWithIndex.map(_.swap)
  
.join(sc.parallelize(Array("b1","b2","b3")).zipWithIndex.map(_.swap))
  .values.map{case (row: Row, x: String) => Row.fromSeq(row.toSeq 
:+ x)}
{quote}
---
and code :
{quote}
var df = sql("select b1,b2 from .x")
var i = 0
var threshold = Array(2.0,3.0)
var inputCols = Array("b1","b2")

var tmpDataFrame = df
for (col <- inputCols){
  val binarizer: Binarizer = new Binarizer().setInputCol(col)
.setOutputCol(inputCols(i)+"_binary")
.setThreshold(threshold(i))
  tmpDataFrame = binarizer.transform(tmpDataFrame).drop(inputCols(i))
  i = i+1
}
var saveDFBin = tmpDataFrame

val dfAppendBin = sql("select b3 from poseidon.corelatdemo")
val rows = saveDFBin.rdd.zipWithIndex.map(_.swap)
  .join(dfAppendBin.rdd.zipWithIndex.map(_.swap))
  .values.map{case (row1: Row, row2: Row) => Row.fromSeq(row1.toSeq ++ 
row2.toSeq)}

import org.apache.spark.sql.types.StructType
val rowSchema = StructType(saveDFBin.schema.fields ++ 
dfAppendBin.schema.fields)
saveDFBin = sqlContext.createDataFrame(rows, rowSchema)


//save result to table
import org.apache.spark.sql.SaveMode
saveDFBin.write.mode(SaveMode.Overwrite).saveAsTable(".")
sql("alter table . set lifecycle 1")
{quote}

on zeppelin with two different notebook at same time. 

Found this exception log in  executor :
{quote}
l1.dtdream.com): java.lang.ClassCastException: 
org.apache.spark.mllib.linalg.DenseVector cannot be cast to scala.Tuple2
at 
$line127359816836.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:34)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1597)
at 
org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52)
at 
org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1875)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1875)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

{quote}

OR 
{quote}
java.lang.ClassCastException: scala.Tuple2 cannot be cast to 
org.apache.spark.mllib.linalg.DenseVector
at 
$line34684895436.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:57)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{quote}

some log from executor:
{quote}
17/05/26 16:39:44 INFO executor.Executor: Running task 3.1 in stage 36.0 (TID 
598)
17/05/26 16:39:44 INFO broadcast.TorrentBroadcast: Started reading broadcast 
variable 30
17/05/26 16:39:44 INFO storage.MemoryStore: Block broadcast_30_piece0 stored as 

[jira] [Updated] (SPARK-20896) spark executor get java.lang.ClassCastException when trigger two job at same time

2017-05-26 Thread poseidon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

poseidon updated SPARK-20896:
-
Description: 
1、zeppelin 0.6.2  in *SCOPE* mode 
2、spark 1.6.2 
3、HDP 2.4 for HDFS YARN 

trigger scala code like :
{quote}
var tmpDataFrame = sql(" select b1,b2,b3 from xxx.x")
val vectorDf = assembler.transform(tmpDataFrame)
val vectRdd = vectorDf.select("features").map{x:Row => x.getAs[Vector](0)}
val correlMatrix: Matrix = Statistics.corr(vectRdd, "spearman")

val columns = correlMatrix.toArray.grouped(correlMatrix.numRows)
val rows = columns.toSeq.transpose
val vectors = rows.map(row => new DenseVector(row.toArray))
val vRdd = sc.parallelize(vectors)

import sqlContext.implicits._
val dfV = vRdd.map(_.toArray).map{ case Array(b1,b2,b3) => (b1,b2,b3) }.toDF()

val rows = dfV.rdd.zipWithIndex.map(_.swap)
  
.join(sc.parallelize(Array("b1","b2","b3")).zipWithIndex.map(_.swap))
  .values.map{case (row: Row, x: String) => Row.fromSeq(row.toSeq 
:+ x)}
{quote}
---
and code :
{quote}
var df = sql("select b1,b2 from .x")
var i = 0
var threshold = Array(2.0,3.0)
var inputCols = Array("b1","b2")

var tmpDataFrame = df
for (col <- inputCols){
  val binarizer: Binarizer = new Binarizer().setInputCol(col)
.setOutputCol(inputCols(i)+"_binary")
.setThreshold(threshold(i))
  tmpDataFrame = binarizer.transform(tmpDataFrame).drop(inputCols(i))
  i = i+1
}
var saveDFBin = tmpDataFrame

val dfAppendBin = sql("select b3 from poseidon.corelatdemo")
val rows = saveDFBin.rdd.zipWithIndex.map(_.swap)
  .join(dfAppendBin.rdd.zipWithIndex.map(_.swap))
  .values.map{case (row1: Row, row2: Row) => Row.fromSeq(row1.toSeq ++ 
row2.toSeq)}

import org.apache.spark.sql.types.StructType
val rowSchema = StructType(saveDFBin.schema.fields ++ 
dfAppendBin.schema.fields)
saveDFBin = sqlContext.createDataFrame(rows, rowSchema)


//save result to table
import org.apache.spark.sql.SaveMode
saveDFBin.write.mode(SaveMode.Overwrite).saveAsTable(".")
sql("alter table . set lifecycle 1")
{quote}

on zeppelin with two different notebook at same time. 

Found this exception log in  executor :
{quote}
l1.dtdream.com): java.lang.ClassCastException: 
org.apache.spark.mllib.linalg.DenseVector cannot be cast to scala.Tuple2
at 
$line127359816836.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:34)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1597)
at 
org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52)
at 
org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1875)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1875)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

{quote}

OR 
{quote}
java.lang.ClassCastException: scala.Tuple2 cannot be cast to 
org.apache.spark.mllib.linalg.DenseVector
at 
$line34684895436.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:57)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{quote}

some log from executor:
{quote}
17/05/26 16:39:44 INFO executor.Executor: Running task 3.1 in stage 36.0 (TID 
598)
17/05/26 16:39:44 INFO broadcast.TorrentBroadcast: Started reading broadcast 
variable 30
17/05/26 16:39:44 INFO storage.MemoryStore: Block broadcast_30_piece0 stored as 

[jira] [Updated] (SPARK-20896) spark executor get java.lang.ClassCastException when trigger two job at same time

2017-05-26 Thread poseidon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

poseidon updated SPARK-20896:
-
Description: 
1、zeppelin 0.6.2 
2、spark 1.6.2 
3、hdp 2.4 for HDFS YARN 

trigger scala code like :
{quote}
var tmpDataFrame = sql(" select b1,b2,b3 from xxx.x")
val vectorDf = assembler.transform(tmpDataFrame)
val vectRdd = vectorDf.select("features").map{x:Row => x.getAs[Vector](0)}
val correlMatrix: Matrix = Statistics.corr(vectRdd, "spearman")

val columns = correlMatrix.toArray.grouped(correlMatrix.numRows)
val rows = columns.toSeq.transpose
val vectors = rows.map(row => new DenseVector(row.toArray))
val vRdd = sc.parallelize(vectors)

import sqlContext.implicits._
val dfV = vRdd.map(_.toArray).map{ case Array(b1,b2,b3) => (b1,b2,b3) }.toDF()

val rows = dfV.rdd.zipWithIndex.map(_.swap)
  
.join(sc.parallelize(Array("b1","b2","b3")).zipWithIndex.map(_.swap))
  .values.map{case (row: Row, x: String) => Row.fromSeq(row.toSeq 
:+ x)}
{quote}
---
and code :
{quote}
var df = sql("select b1,b2 from .x")
var i = 0
var threshold = Array(2.0,3.0)
var inputCols = Array("b1","b2")

var tmpDataFrame = df
for (col <- inputCols){
  val binarizer: Binarizer = new Binarizer().setInputCol(col)
.setOutputCol(inputCols(i)+"_binary")
.setThreshold(threshold(i))
  tmpDataFrame = binarizer.transform(tmpDataFrame).drop(inputCols(i))
  i = i+1
}
var saveDFBin = tmpDataFrame

val dfAppendBin = sql("select b3 from poseidon.corelatdemo")
val rows = saveDFBin.rdd.zipWithIndex.map(_.swap)
  .join(dfAppendBin.rdd.zipWithIndex.map(_.swap))
  .values.map{case (row1: Row, row2: Row) => Row.fromSeq(row1.toSeq ++ 
row2.toSeq)}

import org.apache.spark.sql.types.StructType
val rowSchema = StructType(saveDFBin.schema.fields ++ 
dfAppendBin.schema.fields)
saveDFBin = sqlContext.createDataFrame(rows, rowSchema)


//save result to table
import org.apache.spark.sql.SaveMode
saveDFBin.write.mode(SaveMode.Overwrite).saveAsTable(".")
sql("alter table . set lifecycle 1")
{quote}

on zeppelin with two different notebook at same time. 

Found this exeption log in  executor :
{quote}
l1.dtdream.com): java.lang.ClassCastException: 
org.apache.spark.mllib.linalg.DenseVector cannot be cast to scala.Tuple2
at 
$line127359816836.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:34)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1597)
at 
org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52)
at 
org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1875)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1875)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

{quote}

OR 
{quote}
java.lang.ClassCastException: scala.Tuple2 cannot be cast to 
org.apache.spark.mllib.linalg.DenseVector
at 
$line34684895436.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:57)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{quote}

these two execption nerver show in pairs. 




  was:
1、zeppelin 0.6.2 
2、spark 1.6.2 
3、hdp 2.4 for HDFS YARN 

trigger scala code like :
{quote}
var tmpDataFrame = sql(" select b1,b2,b3 from xxx.x")
val vectorDf = assembler.transform(tmpDataFrame)
val vectRdd = vectorDf.select("features").map{x:Row => 

[jira] [Updated] (SPARK-20896) spark executor get java.lang.ClassCastException when trigger two job at same time

2017-05-26 Thread poseidon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

poseidon updated SPARK-20896:
-
Description: 
1、zeppelin 0.6.2 
2、spark 1.6.2 
3、hdp 2.4 for HDFS YARN 

trigger scala code like :
{quote}
var tmpDataFrame = sql(" select b1,b2,b3 from xxx.x")
val vectorDf = assembler.transform(tmpDataFrame)
val vectRdd = vectorDf.select("features").map{x:Row => x.getAs[Vector](0)}
val correlMatrix: Matrix = Statistics.corr(vectRdd, "spearman")

val columns = correlMatrix.toArray.grouped(correlMatrix.numRows)
val rows = columns.toSeq.transpose
val vectors = rows.map(row => new DenseVector(row.toArray))
val vRdd = sc.parallelize(vectors)

import sqlContext.implicits._
val dfV = vRdd.map(_.toArray).map{ case Array(b1,b2,b3) => (b1,b2,b3) }.toDF()

val rows = dfV.rdd.zipWithIndex.map(_.swap)
  
.join(sc.parallelize(Array("b1","b2","b3")).zipWithIndex.map(_.swap))
  .values.map{case (row: Row, x: String) => Row.fromSeq(row.toSeq 
:+ x)}
{quote}
---
and code :
{quote}
var df = sql("select b1,b2 from .x")
var i = 0
var threshold = Array(2.0,3.0)
var inputCols = Array("b1","b2")

var tmpDataFrame = df
for (col <- inputCols){
  val binarizer: Binarizer = new Binarizer().setInputCol(col)
.setOutputCol(inputCols(i)+"_binary")
.setThreshold(threshold(i))
  tmpDataFrame = binarizer.transform(tmpDataFrame).drop(inputCols(i))
  i = i+1
}
var saveDFBin = tmpDataFrame

val dfAppendBin = sql("select b3 from poseidon.corelatdemo")
val rows = saveDFBin.rdd.zipWithIndex.map(_.swap)
  .join(dfAppendBin.rdd.zipWithIndex.map(_.swap))
  .values.map{case (row1: Row, row2: Row) => Row.fromSeq(row1.toSeq ++ 
row2.toSeq)}

import org.apache.spark.sql.types.StructType
val rowSchema = StructType(saveDFBin.schema.fields ++ 
dfAppendBin.schema.fields)
saveDFBin = sqlContext.createDataFrame(rows, rowSchema)


//save result to table
import org.apache.spark.sql.SaveMode
saveDFBin.write.mode(SaveMode.Overwrite).saveAsTable(".")
sql("alter table . set lifecycle 1")
{quote}

at the same time 

  was:
1、zeppelin 0.6.2 
2、spark 1.6.2 
3、hdp 2.4 for HDFS YARN 

trigger scala code like 



> spark executor get java.lang.ClassCastException when trigger two job at same 
> time
> -
>
> Key: SPARK-20896
> URL: https://issues.apache.org/jira/browse/SPARK-20896
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: poseidon
>
> 1、zeppelin 0.6.2 
> 2、spark 1.6.2 
> 3、hdp 2.4 for HDFS YARN 
> trigger scala code like :
> {quote}
> var tmpDataFrame = sql(" select b1,b2,b3 from xxx.x")
> val vectorDf = assembler.transform(tmpDataFrame)
> val vectRdd = vectorDf.select("features").map{x:Row => x.getAs[Vector](0)}
> val correlMatrix: Matrix = Statistics.corr(vectRdd, "spearman")
> val columns = correlMatrix.toArray.grouped(correlMatrix.numRows)
> val rows = columns.toSeq.transpose
> val vectors = rows.map(row => new DenseVector(row.toArray))
> val vRdd = sc.parallelize(vectors)
> import sqlContext.implicits._
> val dfV = vRdd.map(_.toArray).map{ case Array(b1,b2,b3) => (b1,b2,b3) }.toDF()
> val rows = dfV.rdd.zipWithIndex.map(_.swap)
>   
> .join(sc.parallelize(Array("b1","b2","b3")).zipWithIndex.map(_.swap))
>   .values.map{case (row: Row, x: String) => Row.fromSeq(row.toSeq 
> :+ x)}
> {quote}
> ---
> and code :
> {quote}
> var df = sql("select b1,b2 from .x")
> var i = 0
> var threshold = Array(2.0,3.0)
> var inputCols = Array("b1","b2")
> var tmpDataFrame = df
> for (col <- inputCols){
>   val binarizer: Binarizer = new Binarizer().setInputCol(col)
> .setOutputCol(inputCols(i)+"_binary")
> .setThreshold(threshold(i))
>   tmpDataFrame = binarizer.transform(tmpDataFrame).drop(inputCols(i))
>   i = i+1
> }
> var saveDFBin = tmpDataFrame
> val dfAppendBin = sql("select b3 from poseidon.corelatdemo")
> val rows = saveDFBin.rdd.zipWithIndex.map(_.swap)
>   .join(dfAppendBin.rdd.zipWithIndex.map(_.swap))
>   .values.map{case (row1: Row, row2: Row) => Row.fromSeq(row1.toSeq 
> ++ row2.toSeq)}
> import org.apache.spark.sql.types.StructType
> val rowSchema = StructType(saveDFBin.schema.fields ++ 
> dfAppendBin.schema.fields)
> saveDFBin = sqlContext.createDataFrame(rows, rowSchema)
> //save result to table
> import org.apache.spark.sql.SaveMode
> saveDFBin.write.mode(SaveMode.Overwrite).saveAsTable(".")
> sql("alter table . set lifecycle 1")
> {quote}
> at the same time 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: 

[jira] [Updated] (SPARK-20896) spark executor get java.lang.ClassCastException when trigger two job at same time

2017-05-26 Thread poseidon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

poseidon updated SPARK-20896:
-
Attachment: (was: token_err.log)

> spark executor get java.lang.ClassCastException when trigger two job at same 
> time
> -
>
> Key: SPARK-20896
> URL: https://issues.apache.org/jira/browse/SPARK-20896
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: poseidon
>
> 1、zeppelin 0.6.2 
> 2、spark 1.6.2 
> 3、hdp 2.4 for HDFS YARN 
> trigger scala code like 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20896) spark executor get java.lang.ClassCastException when trigger two job at same time

2017-05-26 Thread poseidon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

poseidon updated SPARK-20896:
-
Attachment: token_err.log

> spark executor get java.lang.ClassCastException when trigger two job at same 
> time
> -
>
> Key: SPARK-20896
> URL: https://issues.apache.org/jira/browse/SPARK-20896
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: poseidon
>
> 1、zeppelin 0.6.2 
> 2、spark 1.6.2 
> 3、hdp 2.4 for HDFS YARN 
> trigger scala code like 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20896) spark executor get java.lang.ClassCastException when trigger two job at same time

2017-05-26 Thread poseidon (JIRA)
poseidon created SPARK-20896:


 Summary: spark executor get java.lang.ClassCastException when 
trigger two job at same time
 Key: SPARK-20896
 URL: https://issues.apache.org/jira/browse/SPARK-20896
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.6.1
Reporter: poseidon


1、zeppelin 0.6.2 
2、spark 1.6.2 
3、hdp 2.4 for HDFS YARN 

trigger scala code like 




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20882) Executor is waiting for ShuffleBlockFetcherIterator

2017-05-26 Thread cen yuhai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026040#comment-16026040
 ] 

cen yuhai edited comment on SPARK-20882 at 5/26/17 9:13 AM:


[~zsxwing] yes.
{code}
17/05/26 12:04:14 INFO TransportResponseHandler: Still have 1 requests 
outstanding when connection from 10.0.139.110:7337 is closed
17/05/26 12:04:14 INFO TransportResponseHandler: Still have 1 requests 
outstanding when connection from 10.0.139.110:7337 is closed
{code}
before the requests in flight 1. the remainingBlocks is empty.
{code}
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
Set(shuffle_5_1431_809)
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: Set()
{code}


was (Author: cenyuhai):
yes.
{code}
17/05/26 12:04:14 INFO TransportResponseHandler: Still have 1 requests 
outstanding when connection from 10.0.139.110:7337 is closed
17/05/26 12:04:14 INFO TransportResponseHandler: Still have 1 requests 
outstanding when connection from 10.0.139.110:7337 is closed
{code}
before the requests in flight 1. the remainingBlocks is empty.
{code}
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
Set(shuffle_5_1431_809)
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: Set()
{code}

> Executor is waiting for ShuffleBlockFetcherIterator
> ---
>
> Key: SPARK-20882
> URL: https://issues.apache.org/jira/browse/SPARK-20882
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: cen yuhai
> Attachments: executor_jstack, executor_log, screenshot-1.png, 
> screenshot-2.png, screenshot-3.png
>
>
> This bug is like https://issues.apache.org/jira/browse/SPARK-19300.
> but I have updated my client netty version to 4.0.43.Final.
> The shuffle service handler is still 4.0.42.Final
> spark.sql.adaptive.enabled is true
> {code}
> "Executor task launch worker for task 4808985" #5373 daemon prio=5 os_prio=0 
> tid=0x7f54ef437000 nid=0x1aed0 waiting on condition [0x7f53aebfe000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> parking to wait for <0x000498c249c0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:189)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:58)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
> at org.apache.spark.scheduler.Task.run(Task.scala:114)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:323)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
> at java.lang.Thread.run(Thread.java:834)
> {code}
> {code}
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
> Set(shuffle_5_1431_805, shuffle_5_1431_808, shuffle_5_1431_806, 
> shuffle_5_1431_809, shuffle_5_1431_807)
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
> Set(shuffle_5_1431_808, shuffle_5_1431_806, shuffle_5_1431_809, 
> shuffle_5_1431_807)
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
> Set(shuffle_5_1431_808, shuffle_5_1431_809, shuffle_5_1431_807)
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
> Set(shuffle_5_1431_808, shuffle_5_1431_809)
> 17/05/26 

[jira] [Commented] (SPARK-20882) Executor is waiting for ShuffleBlockFetcherIterator

2017-05-26 Thread cen yuhai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026040#comment-16026040
 ] 

cen yuhai commented on SPARK-20882:
---

yes.
{code}
17/05/26 12:04:14 INFO TransportResponseHandler: Still have 1 requests 
outstanding when connection from 10.0.139.110:7337 is closed
17/05/26 12:04:14 INFO TransportResponseHandler: Still have 1 requests 
outstanding when connection from 10.0.139.110:7337 is closed
{code}
before the requests in flight 1. the remainingBlocks is empty.
{code}
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
Set(shuffle_5_1431_809)
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: Set()
{code}

> Executor is waiting for ShuffleBlockFetcherIterator
> ---
>
> Key: SPARK-20882
> URL: https://issues.apache.org/jira/browse/SPARK-20882
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: cen yuhai
> Attachments: executor_jstack, executor_log, screenshot-1.png, 
> screenshot-2.png, screenshot-3.png
>
>
> This bug is like https://issues.apache.org/jira/browse/SPARK-19300.
> but I have updated my client netty version to 4.0.43.Final.
> The shuffle service handler is still 4.0.42.Final
> spark.sql.adaptive.enabled is true
> {code}
> "Executor task launch worker for task 4808985" #5373 daemon prio=5 os_prio=0 
> tid=0x7f54ef437000 nid=0x1aed0 waiting on condition [0x7f53aebfe000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> parking to wait for <0x000498c249c0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:189)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:58)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
> at org.apache.spark.scheduler.Task.run(Task.scala:114)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:323)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
> at java.lang.Thread.run(Thread.java:834)
> {code}
> {code}
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
> Set(shuffle_5_1431_805, shuffle_5_1431_808, shuffle_5_1431_806, 
> shuffle_5_1431_809, shuffle_5_1431_807)
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
> Set(shuffle_5_1431_808, shuffle_5_1431_806, shuffle_5_1431_809, 
> shuffle_5_1431_807)
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
> Set(shuffle_5_1431_808, shuffle_5_1431_809, shuffle_5_1431_807)
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
> Set(shuffle_5_1431_808, shuffle_5_1431_809)
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
> Set(shuffle_5_1431_809)
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: Set()
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 21
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 20
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 19
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 18
> 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
> flight 17
> 

[jira] [Updated] (SPARK-20882) Executor is waiting for ShuffleBlockFetcherIterator

2017-05-26 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-20882:
--
Description: 
This bug is like https://issues.apache.org/jira/browse/SPARK-19300.
but I have updated my client netty version to 4.0.43.Final.
The shuffle service handler is still 4.0.42.Final
spark.sql.adaptive.enabled is true
{code}
"Executor task launch worker for task 4808985" #5373 daemon prio=5 os_prio=0 
tid=0x7f54ef437000 nid=0x1aed0 waiting on condition [0x7f53aebfe000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
parking to wait for <0x000498c249c0> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:189)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:58)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at org.apache.spark.scheduler.Task.run(Task.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:323)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
at java.lang.Thread.run(Thread.java:834)
{code}


{code}
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
Set(shuffle_5_1431_805, shuffle_5_1431_808, shuffle_5_1431_806, 
shuffle_5_1431_809, shuffle_5_1431_807)
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
Set(shuffle_5_1431_808, shuffle_5_1431_806, shuffle_5_1431_809, 
shuffle_5_1431_807)
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
Set(shuffle_5_1431_808, shuffle_5_1431_809, shuffle_5_1431_807)
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
Set(shuffle_5_1431_808, shuffle_5_1431_809)
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: 
Set(shuffle_5_1431_809)
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: Set()
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
flight 21
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
flight 20
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
flight 19
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
flight 18
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
flight 17
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
flight 16
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
flight 15
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
flight 14
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
flight 13
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
flight 12
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
flight 11
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
flight 10
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
flight 9
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
flight 8
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
flight 7
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
flight 6
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
flight 5
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 
flight 4
17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in 

[jira] [Issue Comment Deleted] (SPARK-20199) GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter

2017-05-26 Thread pralabhkumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar updated SPARK-20199:
-
Comment: was deleted

(was: Shouldn't there be a parameter in GBMParameters() 

val gbmParams = new GBMParameters()
gbmParams._featuresubsetStrategy="auto"

and then in 

private[ml] def train(data: RDD[LabeledPoint],
  oldStrategy: OldStrategy): DecisionTreeRegressionModel = {

we should have
oldStrategy.getStrategy() and set it in

val trees = RandomForest.run(data, oldStrategy, numTrees = 1, 
featureSubsetStrategy = oldStrategy.getStrategy(), seed = $(seed), instr = 
Some(instr), parentUID = Some(uid)))

> GradientBoostedTreesModel doesn't have  featureSubsetStrategy parameter
> ---
>
> Key: SPARK-20199
> URL: https://issues.apache.org/jira/browse/SPARK-20199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: pralabhkumar
>
> Spark GradientBoostedTreesModel doesn't have Column  sampling rate parameter 
> . This parameter is available in H2O and XGBoost. 
> Sample from H2O.ai 
> gbmParams._col_sample_rate
> Please provide the parameter . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19581) running NaiveBayes model with 0 features can crash the executor with D rorreGEMV

2017-05-26 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-19581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16015345#comment-16015345
 ] 

Yan Facai (颜发才) edited comment on SPARK-19581 at 5/26/17 9:02 AM:
--

[~barrybecker4] Hi, Becker.
I can't reproduce the bug on spark-2.1.1-bin-hadoop2.7.

1) For 0 size of feature, the exception is harmless.

{code}
  val data = 
spark.read.format("libsvm").load("/user/facai/data/libsvm/sample_libsvm_data.txt").cache
  import org.apache.spark.ml.classification.NaiveBayes
  val model = new NaiveBayes().fit(data)
  import org.apache.spark.ml.linalg.{Vectors => SV}
  case class TestData(features: org.apache.spark.ml.linalg.Vector)
  val emptyVector = SV.sparse(0, Array.empty[Int], Array.empty[Double])
  val test = Seq(TestData(emptyVector)).toDF
scala>  test.show
+-+
| features|
+-+
|(0,[],[])|
+-+

scala> model.transform(test).show
org.apache.spark.SparkException: Failed to execute user defined 
function($anonfun$1: (vector) => vector)
  at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1072)
  ... 48 elided
Caused by: java.lang.IllegalArgumentException: requirement failed: The columns 
of A don't match the number of elements of x. A: 692, x: 0
  at scala.Predef$.require(Predef.scala:224)
  ... 99 more
{code}

2) For 692 size of empty feature, it's OK.

{code}
scala> val emptyVector = SV.sparse(692, Array.empty[Int], Array.empty[Double])
emptyVector: org.apache.spark.ml.linalg.Vector = (692,[],[])

scala> val test = Seq(TestData(emptyVector)).toDF
test: org.apache.spark.sql.DataFrame = [features: vector]

scala> test.show
+---+
|   features|
+---+
|(692,[],[])|
+---+

scala> model.transform(test).show
+---+++--+
|   features|   rawPrediction| probability|prediction|
+---+++--+
|(692,[],[])|[-0.8407831793660...|[0.43137254901960...|   1.0|
+---+++--+
{code}


was (Author: facai):
[~barrybecker4] Hi, Becker.
I can't reproduce the bug on spark-2.1.1-bin-hadoop2.7.

1) For 0 size of feature, the exception is harmless.

```scala
  val data = 
spark.read.format("libsvm").load("/user/facai/data/libsvm/sample_libsvm_data.txt").cache
  import org.apache.spark.ml.classification.NaiveBayes
  val model = new NaiveBayes().fit(data)
  import org.apache.spark.ml.linalg.{Vectors => SV}
  case class TestData(features: org.apache.spark.ml.linalg.Vector)
  val emptyVector = SV.sparse(0, Array.empty[Int], Array.empty[Double])
  val test = Seq(TestData(emptyVector)).toDF
scala>  test.show
+-+
| features|
+-+
|(0,[],[])|
+-+

scala> model.transform(test).show
org.apache.spark.SparkException: Failed to execute user defined 
function($anonfun$1: (vector) => vector)
  at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1072)
  ... 48 elided
Caused by: java.lang.IllegalArgumentException: requirement failed: The columns 
of A don't match the number of elements of x. A: 692, x: 0
  at scala.Predef$.require(Predef.scala:224)
  ... 99 more
```

2) For 692 size of empty feature, it's OK.

```scala
scala> val emptyVector = SV.sparse(692, Array.empty[Int], Array.empty[Double])
emptyVector: org.apache.spark.ml.linalg.Vector = (692,[],[])

scala> val test = Seq(TestData(emptyVector)).toDF
test: org.apache.spark.sql.DataFrame = [features: vector]

scala> test.show
+---+
|   features|
+---+
|(692,[],[])|
+---+

scala> model.transform(test).show
+---+++--+
|   features|   rawPrediction| probability|prediction|
+---+++--+
|(692,[],[])|[-0.8407831793660...|[0.43137254901960...|   1.0|
+---+++--+

```

> running NaiveBayes model with 0 features can crash the executor with D 
> rorreGEMV
> 
>
> Key: SPARK-19581
> URL: https://issues.apache.org/jira/browse/SPARK-19581
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
> Environment: spark development or standalone mode on windows or linux.
>Reporter: Barry Becker
>Priority: Minor
>
> The severity of this bug is high (because nothing should cause spark to crash 
> like this) but the priority may be low (because there is an easy workaround).
> In our application, a user can select features and a target to run the 
> NaiveBayes inducer. If columns have too many values or all one value, they 
> will be removed before we call the inducer to create the model. As a result, 
> there are 

[jira] [Commented] (SPARK-20895) Support fast execution based on an optimized plan and parameter placeholders

2017-05-26 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026028#comment-16026028
 ] 

Reynold Xin commented on SPARK-20895:
-

Have you seen a specific case in practice in which the optimizer is the 
bottleneck? Maybe we should improve optimizer speed and add these in the future 
when needed ... it's a lot of complexity.


> Support fast execution based on an optimized plan and parameter placeholders
> 
>
> Key: SPARK-20895
> URL: https://issues.apache.org/jira/browse/SPARK-20895
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In database scenarios, users sometimes use parameterized queries for repeated 
> execution (e.g., by using prepared statements).
> So, I think this functionality is also useful for Spark users.
> What I suggest here seems to be like:
> My prototype here: 
> https://github.com/apache/spark/compare/master...maropu:PreparedStmt2
> {code}
> scala> Seq((1, 2), (2, 3)).toDF("col1", "col2").createOrReplaceTempView("t")
> // Define a query with a parameter placeholder named `val`
> scala> val df = sql("SELECT * FROM t WHERE col1 = $val")
> scala> df.explain
> == Physical Plan ==
> *Project [_1#13 AS col1#16, _2#14 AS col2#17]
> +- *Filter (_1#13 = cast(parameterholder(val) as int))
>+- LocalTableScan [_1#13, _2#14]
> // Apply optimizer rules and get an optimized logical plan with the parameter 
> placeholder
> scala> val preparedDf = df.prepared
> // Bind an actual value and do execution
> scala> preparedDf.bindParam("val", 1).show()
> +++
> |col1|col2|
> +++
> |   1|   2|
> +++
> {code}
> To implement this, my prototype adds a new expression leaf node named 
> `ParameterHolder`.
> In a binding phase, this node is replaced with `Literal` including an actual 
> value by using `bindParam`.
> Currently, Spark sometimes consumes much time to rewrite logical plans in 
> `Optimizer` (e.g. constant propagation desribed in SPARK-19846).
> So, I feel this approach is also helpful in that case:
> {code}
> def timer[R](f: => {}): Unit = {
>   val count = 9
>   val iters = (0 until count).map { i =>
> val t0 = System.nanoTime()
> f
> val t1 = System.nanoTime()
> val elapsed = t1 - t0 + 0.0
> println(s"#$i: ${elapsed / 10.0}")
> elapsed
>   }
>   println("Avg. Elapsed Time: " + ((iters.sum / count) / 10.0) + "s")
> }
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val numCols = 50
> val df = spark.range(100).selectExpr((0 until numCols).map(i => s"id AS 
> _c$i"): _*)
> // Add conditions to take much time in Optimizer
> val filter = (0 until 128).foldLeft(lit(false))((e, i) => 
> e.or(df.col(df.columns(i % numCols)) === (rand() * 10).cast("int")))
> val df2 = df.filter(filter).sort(df.columns(0))
> // Regular path
> timer {
>   df2.filter(df2.col(df2.columns(0)) === lit(3)).collect
>   df2.filter(df2.col(df2.columns(0)) === lit(4)).collect
>   df2.filter(df2.col(df2.columns(0)) === lit(5)).collect
>   df2.filter(df2.col(df2.columns(0)) === lit(6)).collect
>   df2.filter(df2.col(df2.columns(0)) === lit(7)).collect
>   df2.filter(df2.col(df2.columns(0)) === lit(8)).collect
> }
> #0: 24.178487906
> #1: 22.619839888
> #2: 22.318617035
> #3: 22.131305502
> #4: 22.532095611
> #5: 22.245152778
> #6: 22.314114847
> #7: 22.284385952
> #8: 22.053593855
> Avg. Elapsed Time: 22.51973259712s
> // Prepared path
> val df3b = df2.filter(df2.col(df2.columns(0)) === param("val")).prepared
> timer {
>   df3b.bindParam("val", 3).collect
>   df3b.bindParam("val", 4).collect
>   df3b.bindParam("val", 5).collect
>   df3b.bindParam("val", 6).collect
>   df3b.bindParam("val", 7).collect
>   df3b.bindParam("val", 8).collect
> }
> #0: 0.744693912
> #1: 0.743187129
> #2: 0.74513
> #3: 0.721668718
> #4: 0.757573342
> #5: 0.763240883
> #6: 0.731287275
> #7: 0.728740601
> #8: 0.674275592
> Avg. Elapsed Time: 0.734418606112s
> {code}
> I'm not sure this approach is acceptable, so welcome any suggestion and 
> advice.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark

2017-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20498:


Assignee: Apache Spark  (was: Xin Ren)

> RandomForestRegressionModel should expose getMaxDepth in PySpark
> 
>
> Key: SPARK-20498
> URL: https://issues.apache.org/jira/browse/SPARK-20498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Nick Lothian
>Assignee: Apache Spark
>Priority: Minor
>
> Currently it isn't clear hot to get the max depth of a 
> RandomForestRegressionModel (eg, after doing a grid search)
> It is possible to call
> {{regressor._java_obj.getMaxDepth()}} 
> but most other decision trees allow
> {{regressor.getMaxDepth()}} 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark

2017-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20498:


Assignee: Xin Ren  (was: Apache Spark)

> RandomForestRegressionModel should expose getMaxDepth in PySpark
> 
>
> Key: SPARK-20498
> URL: https://issues.apache.org/jira/browse/SPARK-20498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Nick Lothian
>Assignee: Xin Ren
>Priority: Minor
>
> Currently it isn't clear hot to get the max depth of a 
> RandomForestRegressionModel (eg, after doing a grid search)
> It is possible to call
> {{regressor._java_obj.getMaxDepth()}} 
> but most other decision trees allow
> {{regressor.getMaxDepth()}} 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20891) Reduce duplicate code in typedaggregators.scala

2017-05-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026021#comment-16026021
 ] 

Sean Owen commented on SPARK-20891:
---

Can you refactor the code first, then make your change?
I understand your point but unless it's a huge change, you probably want to 
make the change plus necessary other updates together.
Yes, someone might make a conflicting change in the meantime and you'd have to 
be vigilant to keep your PR up to date until it's merged.

> Reduce duplicate code in typedaggregators.scala
> ---
>
> Key: SPARK-20891
> URL: https://issues.apache.org/jira/browse/SPARK-20891
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ruben Janssen
>
> With SPARK-20411, a significant amount of functions will be added to 
> typedaggregators.scala, resulting in a large amount of duplicate code



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark

2017-05-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026022#comment-16026022
 ] 

Apache Spark commented on SPARK-20498:
--

User 'facaiy' has created a pull request for this issue:
https://github.com/apache/spark/pull/18120

> RandomForestRegressionModel should expose getMaxDepth in PySpark
> 
>
> Key: SPARK-20498
> URL: https://issues.apache.org/jira/browse/SPARK-20498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Nick Lothian
>Assignee: Xin Ren
>Priority: Minor
>
> Currently it isn't clear hot to get the max depth of a 
> RandomForestRegressionModel (eg, after doing a grid search)
> It is possible to call
> {{regressor._java_obj.getMaxDepth()}} 
> but most other decision trees allow
> {{regressor.getMaxDepth()}} 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19581) running NaiveBayes model with 0 features can crash the executor with D rorreGEMV

2017-05-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19581.
---
Resolution: Cannot Reproduce

> running NaiveBayes model with 0 features can crash the executor with D 
> rorreGEMV
> 
>
> Key: SPARK-19581
> URL: https://issues.apache.org/jira/browse/SPARK-19581
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
> Environment: spark development or standalone mode on windows or linux.
>Reporter: Barry Becker
>Priority: Minor
>
> The severity of this bug is high (because nothing should cause spark to crash 
> like this) but the priority may be low (because there is an easy workaround).
> In our application, a user can select features and a target to run the 
> NaiveBayes inducer. If columns have too many values or all one value, they 
> will be removed before we call the inducer to create the model. As a result, 
> there are some cases, where all the features may get removed. When this 
> happens, executors will crash and get restarted (if on a cluster) or spark 
> will crash and need to be manually restarted (if in development mode).
> It looks like NaiveBayes uses BLAS, and BLAS does not handle this case well 
> when it is encountered. I emits this vague error :
> ** On entry to DGEMV  parameter number  6 had an illegal value
> and terminates.
> My code looks like this:
> {code}
>val predictions = model.transform(testData)  // Make predictions
> // figure out how many were correctly predicted
> val numCorrect = predictions.filter(new Column(actualTarget) === new 
> Column(PREDICTION_LABEL_COLUMN)).count()
> val numIncorrect = testRowCount - numCorrect
> {code}
> The failure is at the line that does the count, but it is not the count that 
> causes the problem, it is the model.transform step (where the model contains 
> the NaiveBayes classifier).
> Here is the stack trace (in development mode):
> {code}
> [2017-02-13 06:28:39,946] TRACE evidence.EvidenceVizModel$ [] 
> [akka://JobServer/user/context-supervisor/sql-context] -  done making 
> predictions in 232
>  ** On entry to DGEMV  parameter number  6 had an illegal value
>  ** On entry to DGEMV  parameter number  6 had an illegal value
>  ** On entry to DGEMV  parameter number  6 had an illegal value
> [2017-02-13 06:28:40,506] ERROR .scheduler.LiveListenerBus [] 
> [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has 
> already stopped! Dropping event SparkListenerSQLExecutionEnd(9,1486996120505)
> [2017-02-13 06:28:40,506] ERROR .scheduler.LiveListenerBus [] 
> [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@1f6c4a29)
> [2017-02-13 06:28:40,508] ERROR .scheduler.LiveListenerBus [] 
> [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerJobEnd(12,1486996120507,JobFailed(org.apache.spark.SparkException:
>  Job 12 cancelled because SparkContext was shut down))
> [2017-02-13 06:28:40,509] ERROR .jobserver.JobManagerActor [] 
> [akka://JobServer/user/context-supervisor/sql-context] - Got Throwable
> org.apache.spark.SparkException: Job 12 cancelled because SparkContext was 
> shut down
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:808)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:806)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
> at 
> org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:806)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1668)
> at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83)
> at 
> org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1587)
> at 
> org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1826)
> at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:1825)
> at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:581)
> at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>  

[jira] [Resolved] (SPARK-20767) The training continuation for saved LDA model

2017-05-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20767.
---
Resolution: Duplicate

Let's roll this proposal into the existing issue

> The training continuation for saved LDA model
> -
>
> Key: SPARK-20767
> URL: https://issues.apache.org/jira/browse/SPARK-20767
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Cezary Dendek
>Priority: Minor
>
> Current online implementation of the LDA model fit (OnlineLDAOptimizer) does 
> not support the model update (ie. to account for the population/covariates 
> drift) nor the continuation of model fitting in case of the insufficient 
> number of iterations.
> Technical aspects:
> 1. The implementation of LDA fitting does not currently allow the 
> coefficients pre-setting (private setter), as noted by a comment in the 
> source code of OnlineLDAOptimizer.setLambda: "This is only used for testing 
> now. In the future, it can help support training stop/resume".
> 2. The lambda matrix is always randomly initialized by the optimizer, which 
> needs fixing for preset lambda matrix.
> The adaptation of the classes by the user is not possible due to protected 
> setters & sealed / final classes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20895) Support fast execution based on an optimized plan and parameter placeholders

2017-05-26 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025976#comment-16025976
 ] 

Takeshi Yamamuro commented on SPARK-20895:
--

At least, I feel we could get performance improvements based on this when 
optimizer takes much time described in the description.

> Support fast execution based on an optimized plan and parameter placeholders
> 
>
> Key: SPARK-20895
> URL: https://issues.apache.org/jira/browse/SPARK-20895
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In database scenarios, users sometimes use parameterized queries for repeated 
> execution (e.g., by using prepared statements).
> So, I think this functionality is also useful for Spark users.
> What I suggest here seems to be like:
> My prototype here: 
> https://github.com/apache/spark/compare/master...maropu:PreparedStmt2
> {code}
> scala> Seq((1, 2), (2, 3)).toDF("col1", "col2").createOrReplaceTempView("t")
> // Define a query with a parameter placeholder named `val`
> scala> val df = sql("SELECT * FROM t WHERE col1 = $val")
> scala> df.explain
> == Physical Plan ==
> *Project [_1#13 AS col1#16, _2#14 AS col2#17]
> +- *Filter (_1#13 = cast(parameterholder(val) as int))
>+- LocalTableScan [_1#13, _2#14]
> // Apply optimizer rules and get an optimized logical plan with the parameter 
> placeholder
> scala> val preparedDf = df.prepared
> // Bind an actual value and do execution
> scala> preparedDf.bindParam("val", 1).show()
> +++
> |col1|col2|
> +++
> |   1|   2|
> +++
> {code}
> To implement this, my prototype adds a new expression leaf node named 
> `ParameterHolder`.
> In a binding phase, this node is replaced with `Literal` including an actual 
> value by using `bindParam`.
> Currently, Spark sometimes consumes much time to rewrite logical plans in 
> `Optimizer` (e.g. constant propagation desribed in SPARK-19846).
> So, I feel this approach is also helpful in that case:
> {code}
> def timer[R](f: => {}): Unit = {
>   val count = 9
>   val iters = (0 until count).map { i =>
> val t0 = System.nanoTime()
> f
> val t1 = System.nanoTime()
> val elapsed = t1 - t0 + 0.0
> println(s"#$i: ${elapsed / 10.0}")
> elapsed
>   }
>   println("Avg. Elapsed Time: " + ((iters.sum / count) / 10.0) + "s")
> }
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val numCols = 50
> val df = spark.range(100).selectExpr((0 until numCols).map(i => s"id AS 
> _c$i"): _*)
> // Add conditions to take much time in Optimizer
> val filter = (0 until 128).foldLeft(lit(false))((e, i) => 
> e.or(df.col(df.columns(i % numCols)) === (rand() * 10).cast("int")))
> val df2 = df.filter(filter).sort(df.columns(0))
> // Regular path
> timer {
>   df2.filter(df2.col(df2.columns(0)) === lit(3)).collect
>   df2.filter(df2.col(df2.columns(0)) === lit(4)).collect
>   df2.filter(df2.col(df2.columns(0)) === lit(5)).collect
>   df2.filter(df2.col(df2.columns(0)) === lit(6)).collect
>   df2.filter(df2.col(df2.columns(0)) === lit(7)).collect
>   df2.filter(df2.col(df2.columns(0)) === lit(8)).collect
> }
> #0: 24.178487906
> #1: 22.619839888
> #2: 22.318617035
> #3: 22.131305502
> #4: 22.532095611
> #5: 22.245152778
> #6: 22.314114847
> #7: 22.284385952
> #8: 22.053593855
> Avg. Elapsed Time: 22.51973259712s
> // Prepared path
> val df3b = df2.filter(df2.col(df2.columns(0)) === param("val")).prepared
> timer {
>   df3b.bindParam("val", 3).collect
>   df3b.bindParam("val", 4).collect
>   df3b.bindParam("val", 5).collect
>   df3b.bindParam("val", 6).collect
>   df3b.bindParam("val", 7).collect
>   df3b.bindParam("val", 8).collect
> }
> #0: 0.744693912
> #1: 0.743187129
> #2: 0.74513
> #3: 0.721668718
> #4: 0.757573342
> #5: 0.763240883
> #6: 0.731287275
> #7: 0.728740601
> #8: 0.674275592
> Avg. Elapsed Time: 0.734418606112s
> {code}
> I'm not sure this approach is acceptable, so welcome any suggestion and 
> advice.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19293) Spark 2.1.x unstable with spark.speculation=true

2017-05-26 Thread Damian Momot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025974#comment-16025974
 ] 

Damian Momot commented on SPARK-19293:
--

Some additional errors for speculative tasks

{code}
java.lang.reflect.UndeclaredThrowableException
at com.sun.proxy.$Proxy18.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2102)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:1214)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:1210)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1210)
at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
at 
org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162)
at 
org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:399)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:355)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:168)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:159)
... 28 more
{code}

{code}
java.lang.IllegalArgumentException: requirement failed: cannot write to a 
closed ByteBufferOutputStream
at scala.Predef$.require(Predef.scala:224)
at 
org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:40)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at 
java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1580)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:351)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at 
org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:253)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:512)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.statusUpdate(CoarseGrainedExecutorBackend.scala:142)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

> Spark 2.1.x unstable with spark.speculation=true
> 
>
> Key: SPARK-19293
> URL: https://issues.apache.org/jira/browse/SPARK-19293
> Project: Spark
>  

[jira] [Commented] (SPARK-20895) Support fast execution based on an optimized plan and parameter placeholders

2017-05-26 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025970#comment-16025970
 ] 

Reynold Xin commented on SPARK-20895:
-

What's the benefit?

> Support fast execution based on an optimized plan and parameter placeholders
> 
>
> Key: SPARK-20895
> URL: https://issues.apache.org/jira/browse/SPARK-20895
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In database scenarios, users sometimes use parameterized queries for repeated 
> execution (e.g., by using prepared statements).
> So, I think this functionality is also useful for Spark users.
> What I suggest here seems to be like:
> My prototype here: 
> https://github.com/apache/spark/compare/master...maropu:PreparedStmt2
> {code}
> scala> Seq((1, 2), (2, 3)).toDF("col1", "col2").createOrReplaceTempView("t")
> // Define a query with a parameter placeholder named `val`
> scala> val df = sql("SELECT * FROM t WHERE col1 = $val")
> scala> df.explain
> == Physical Plan ==
> *Project [_1#13 AS col1#16, _2#14 AS col2#17]
> +- *Filter (_1#13 = cast(parameterholder(val) as int))
>+- LocalTableScan [_1#13, _2#14]
> // Apply optimizer rules and get an optimized logical plan with the parameter 
> placeholder
> scala> val preparedDf = df.prepared
> // Bind an actual value and do execution
> scala> preparedDf.bindParam("val", 1).show()
> +++
> |col1|col2|
> +++
> |   1|   2|
> +++
> {code}
> To implement this, my prototype adds a new expression leaf node named 
> `ParameterHolder`.
> In a binding phase, this node is replaced with `Literal` including an actual 
> value by using `bindParam`.
> Currently, Spark sometimes consumes much time to rewrite logical plans in 
> `Optimizer` (e.g. constant propagation desribed in SPARK-19846).
> So, I feel this approach is also helpful in that case:
> {code}
> def timer[R](f: => {}): Unit = {
>   val count = 9
>   val iters = (0 until count).map { i =>
> val t0 = System.nanoTime()
> f
> val t1 = System.nanoTime()
> val elapsed = t1 - t0 + 0.0
> println(s"#$i: ${elapsed / 10.0}")
> elapsed
>   }
>   println("Avg. Elapsed Time: " + ((iters.sum / count) / 10.0) + "s")
> }
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val numCols = 50
> val df = spark.range(100).selectExpr((0 until numCols).map(i => s"id AS 
> _c$i"): _*)
> // Add conditions to take much time in Optimizer
> val filter = (0 until 128).foldLeft(lit(false))((e, i) => 
> e.or(df.col(df.columns(i % numCols)) === (rand() * 10).cast("int")))
> val df2 = df.filter(filter).sort(df.columns(0))
> // Regular path
> timer {
>   df2.filter(df2.col(df2.columns(0)) === lit(3)).collect
>   df2.filter(df2.col(df2.columns(0)) === lit(4)).collect
>   df2.filter(df2.col(df2.columns(0)) === lit(5)).collect
>   df2.filter(df2.col(df2.columns(0)) === lit(6)).collect
>   df2.filter(df2.col(df2.columns(0)) === lit(7)).collect
>   df2.filter(df2.col(df2.columns(0)) === lit(8)).collect
> }
> #0: 24.178487906
> #1: 22.619839888
> #2: 22.318617035
> #3: 22.131305502
> #4: 22.532095611
> #5: 22.245152778
> #6: 22.314114847
> #7: 22.284385952
> #8: 22.053593855
> Avg. Elapsed Time: 22.51973259712s
> // Prepared path
> val df3b = df2.filter(df2.col(df2.columns(0)) === param("val")).prepared
> timer {
>   df3b.bindParam("val", 3).collect
>   df3b.bindParam("val", 4).collect
>   df3b.bindParam("val", 5).collect
>   df3b.bindParam("val", 6).collect
>   df3b.bindParam("val", 7).collect
>   df3b.bindParam("val", 8).collect
> }
> #0: 0.744693912
> #1: 0.743187129
> #2: 0.74513
> #3: 0.721668718
> #4: 0.757573342
> #5: 0.763240883
> #6: 0.731287275
> #7: 0.728740601
> #8: 0.674275592
> Avg. Elapsed Time: 0.734418606112s
> {code}
> I'm not sure this approach is acceptable, so welcome any suggestion and 
> advice.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20895) Support fast execution based on an optimized plan and parameter placeholders

2017-05-26 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025966#comment-16025966
 ] 

Takeshi Yamamuro commented on SPARK-20895:
--

cc: [~rxin][~cloud_fan][~smilegator]

> Support fast execution based on an optimized plan and parameter placeholders
> 
>
> Key: SPARK-20895
> URL: https://issues.apache.org/jira/browse/SPARK-20895
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In database scenarios, users sometimes use parameterized queries for repeated 
> execution (e.g., by using prepared statements).
> So, I think this functionality is also useful for Spark users.
> What I suggest here seems to be like:
> My prototype here: 
> https://github.com/apache/spark/compare/master...maropu:PreparedStmt2
> {code}
> scala> Seq((1, 2), (2, 3)).toDF("col1", "col2").createOrReplaceTempView("t")
> // Define a query with a parameter placeholder named `val`
> scala> val df = sql("SELECT * FROM t WHERE col1 = $val")
> scala> df.explain
> == Physical Plan ==
> *Project [_1#13 AS col1#16, _2#14 AS col2#17]
> +- *Filter (_1#13 = cast(parameterholder(val) as int))
>+- LocalTableScan [_1#13, _2#14]
> // Apply optimizer rules and get an optimized logical plan with the parameter 
> placeholder
> scala> val preparedDf = df.prepared
> // Bind an actual value and do execution
> scala> preparedDf.bindParam("val", 1).show()
> +++
> |col1|col2|
> +++
> |   1|   2|
> +++
> {code}
> To implement this, my prototype adds a new expression leaf node named 
> `ParameterHolder`.
> In a binding phase, this node is replaced with `Literal` including an actual 
> value by using `bindParam`.
> Currently, Spark sometimes consumes much time to rewrite logical plans in 
> `Optimizer` (e.g. constant propagation desribed in SPARK-19846).
> So, I feel this approach is also helpful in that case:
> {code}
> def timer[R](f: => {}): Unit = {
>   val count = 9
>   val iters = (0 until count).map { i =>
> val t0 = System.nanoTime()
> f
> val t1 = System.nanoTime()
> val elapsed = t1 - t0 + 0.0
> println(s"#$i: ${elapsed / 10.0}")
> elapsed
>   }
>   println("Avg. Elapsed Time: " + ((iters.sum / count) / 10.0) + "s")
> }
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val numCols = 50
> val df = spark.range(100).selectExpr((0 until numCols).map(i => s"id AS 
> _c$i"): _*)
> // Add conditions to take much time in Optimizer
> val filter = (0 until 128).foldLeft(lit(false))((e, i) => 
> e.or(df.col(df.columns(i % numCols)) === (rand() * 10).cast("int")))
> val df2 = df.filter(filter).sort(df.columns(0))
> // Regular path
> timer {
>   df2.filter(df2.col(df2.columns(0)) === lit(3)).collect
>   df2.filter(df2.col(df2.columns(0)) === lit(4)).collect
>   df2.filter(df2.col(df2.columns(0)) === lit(5)).collect
>   df2.filter(df2.col(df2.columns(0)) === lit(6)).collect
>   df2.filter(df2.col(df2.columns(0)) === lit(7)).collect
>   df2.filter(df2.col(df2.columns(0)) === lit(8)).collect
> }
> #0: 24.178487906
> #1: 22.619839888
> #2: 22.318617035
> #3: 22.131305502
> #4: 22.532095611
> #5: 22.245152778
> #6: 22.314114847
> #7: 22.284385952
> #8: 22.053593855
> Avg. Elapsed Time: 22.51973259712s
> // Prepared path
> val df3b = df2.filter(df2.col(df2.columns(0)) === param("val")).prepared
> timer {
>   df3b.bindParam("val", 3).collect
>   df3b.bindParam("val", 4).collect
>   df3b.bindParam("val", 5).collect
>   df3b.bindParam("val", 6).collect
>   df3b.bindParam("val", 7).collect
>   df3b.bindParam("val", 8).collect
> }
> #0: 0.744693912
> #1: 0.743187129
> #2: 0.74513
> #3: 0.721668718
> #4: 0.757573342
> #5: 0.763240883
> #6: 0.731287275
> #7: 0.728740601
> #8: 0.674275592
> Avg. Elapsed Time: 0.734418606112s
> {code}
> I'm not sure this approach is acceptable, so welcome any suggestion and 
> advice.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20895) Support fast execution based on an optimized plan and parameter placeholders

2017-05-26 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-20895:


 Summary: Support fast execution based on an optimized plan and 
parameter placeholders
 Key: SPARK-20895
 URL: https://issues.apache.org/jira/browse/SPARK-20895
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.1
Reporter: Takeshi Yamamuro
Priority: Minor


In database scenarios, users sometimes use parameterized queries for repeated 
execution (e.g., by using prepared statements).
So, I think this functionality is also useful for Spark users.
What I suggest here seems to be like:
My prototype here: 
https://github.com/apache/spark/compare/master...maropu:PreparedStmt2

{code}
scala> Seq((1, 2), (2, 3)).toDF("col1", "col2").createOrReplaceTempView("t")

// Define a query with a parameter placeholder named `val`
scala> val df = sql("SELECT * FROM t WHERE col1 = $val")
scala> df.explain
== Physical Plan ==
*Project [_1#13 AS col1#16, _2#14 AS col2#17]
+- *Filter (_1#13 = cast(parameterholder(val) as int))
   +- LocalTableScan [_1#13, _2#14]

// Apply optimizer rules and get an optimized logical plan with the parameter 
placeholder
scala> val preparedDf = df.prepared

// Bind an actual value and do execution
scala> preparedDf.bindParam("val", 1).show()
+++
|col1|col2|
+++
|   1|   2|
+++
{code}

To implement this, my prototype adds a new expression leaf node named 
`ParameterHolder`.
In a binding phase, this node is replaced with `Literal` including an actual 
value by using `bindParam`.
Currently, Spark sometimes consumes much time to rewrite logical plans in 
`Optimizer` (e.g. constant propagation desribed in SPARK-19846).
So, I feel this approach is also helpful in that case:

{code}
def timer[R](f: => {}): Unit = {
  val count = 9
  val iters = (0 until count).map { i =>
val t0 = System.nanoTime()
f
val t1 = System.nanoTime()
val elapsed = t1 - t0 + 0.0
println(s"#$i: ${elapsed / 10.0}")
elapsed
  }
  println("Avg. Elapsed Time: " + ((iters.sum / count) / 10.0) + "s")
}

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val numCols = 50
val df = spark.range(100).selectExpr((0 until numCols).map(i => s"id AS _c$i"): 
_*)
// Add conditions to take much time in Optimizer
val filter = (0 until 128).foldLeft(lit(false))((e, i) => 
e.or(df.col(df.columns(i % numCols)) === (rand() * 10).cast("int")))
val df2 = df.filter(filter).sort(df.columns(0))

// Regular path
timer {
  df2.filter(df2.col(df2.columns(0)) === lit(3)).collect
  df2.filter(df2.col(df2.columns(0)) === lit(4)).collect
  df2.filter(df2.col(df2.columns(0)) === lit(5)).collect
  df2.filter(df2.col(df2.columns(0)) === lit(6)).collect
  df2.filter(df2.col(df2.columns(0)) === lit(7)).collect
  df2.filter(df2.col(df2.columns(0)) === lit(8)).collect
}

#0: 24.178487906
#1: 22.619839888
#2: 22.318617035
#3: 22.131305502
#4: 22.532095611
#5: 22.245152778
#6: 22.314114847
#7: 22.284385952
#8: 22.053593855
Avg. Elapsed Time: 22.51973259712s

// Prepared path
val df3b = df2.filter(df2.col(df2.columns(0)) === param("val")).prepared

timer {
  df3b.bindParam("val", 3).collect
  df3b.bindParam("val", 4).collect
  df3b.bindParam("val", 5).collect
  df3b.bindParam("val", 6).collect
  df3b.bindParam("val", 7).collect
  df3b.bindParam("val", 8).collect
}

#0: 0.744693912
#1: 0.743187129
#2: 0.74513
#3: 0.721668718
#4: 0.757573342
#5: 0.763240883
#6: 0.731287275
#7: 0.728740601
#8: 0.674275592
Avg. Elapsed Time: 0.734418606112s
{code}

I'm not sure this approach is acceptable, so welcome any suggestion and advice.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20775) from_json should also have an API where the schema is specified with a string

2017-05-26 Thread Ruben Janssen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025962#comment-16025962
 ] 

Ruben Janssen commented on SPARK-20775:
---

Hi, it is RubenJanssen  :)

> from_json should also have an API where the schema is specified with a string
> -
>
> Key: SPARK-20775
> URL: https://issues.apache.org/jira/browse/SPARK-20775
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Burak Yavuz
> Fix For: 2.3.0
>
>
> Right now you also have to provide a java.util.Map which is not nice for 
> Scala users.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark

2017-05-26 Thread Xin Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren updated SPARK-20498:


Sure please go ahead
On Fri, May 26, 2017 at 12:55 AM Yan Facai (颜发才) (JIRA) 



> RandomForestRegressionModel should expose getMaxDepth in PySpark
> 
>
> Key: SPARK-20498
> URL: https://issues.apache.org/jira/browse/SPARK-20498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Nick Lothian
>Assignee: Xin Ren
>Priority: Minor
>
> Currently it isn't clear hot to get the max depth of a 
> RandomForestRegressionModel (eg, after doing a grid search)
> It is possible to call
> {{regressor._java_obj.getMaxDepth()}} 
> but most other decision trees allow
> {{regressor.getMaxDepth()}} 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark

2017-05-26 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025949#comment-16025949
 ] 

Yan Facai (颜发才) commented on SPARK-20498:
-

[~iamshrek] Hi, Xin Ren.
As the task is quite easy, 
if you are a little busy, I'm glad to work on it.
Is it OK?




> RandomForestRegressionModel should expose getMaxDepth in PySpark
> 
>
> Key: SPARK-20498
> URL: https://issues.apache.org/jira/browse/SPARK-20498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Nick Lothian
>Assignee: Xin Ren
>Priority: Minor
>
> Currently it isn't clear hot to get the max depth of a 
> RandomForestRegressionModel (eg, after doing a grid search)
> It is possible to call
> {{regressor._java_obj.getMaxDepth()}} 
> but most other decision trees allow
> {{regressor.getMaxDepth()}} 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter

2017-05-26 Thread pralabhkumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025932#comment-16025932
 ] 

pralabhkumar commented on SPARK-20199:
--

1) Have Created pull request.

Basically Moved 

1) featureSubsetStrategy to TreeEnsembleParams instead of having it on 
RandomForestParams . So that it can be used for both Random Forest and GBT
2 ) Changed DecisionTreeRegressor private train method to pass 
featureSubsetStrategy
3) To Test changed GradientBoostedTreeClassifierExample with
val gbt = new GBTClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setMaxIter(10)
  .setFeatureSubsetStrategy("auto") 



> GradientBoostedTreesModel doesn't have  featureSubsetStrategy parameter
> ---
>
> Key: SPARK-20199
> URL: https://issues.apache.org/jira/browse/SPARK-20199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: pralabhkumar
>
> Spark GradientBoostedTreesModel doesn't have Column  sampling rate parameter 
> . This parameter is available in H2O and XGBoost. 
> Sample from H2O.ai 
> gbmParams._col_sample_rate
> Please provide the parameter . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19372) Code generation for Filter predicate including many OR conditions exceeds JVM method size limit

2017-05-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025927#comment-16025927
 ] 

Apache Spark commented on SPARK-19372:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/18119

> Code generation for Filter predicate including many OR conditions exceeds JVM 
> method size limit 
> 
>
> Key: SPARK-19372
> URL: https://issues.apache.org/jira/browse/SPARK-19372
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Jay Pranavamurthi
>Assignee: Kazuaki Ishizaki
> Fix For: 2.3.0
>
> Attachments: wide400cols.csv
>
>
> For the attached csv file, the code below causes the exception 
> "org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" 
> grows beyond 64 KB
> Code:
> {code:borderStyle=solid}
>   val conf = new SparkConf().setMaster("local[1]")
>   val sqlContext = 
> SparkSession.builder().config(conf).getOrCreate().sqlContext
>   val dataframe =
> sqlContext
>   .read
>   .format("com.databricks.spark.csv")
>   .load("wide400cols.csv")
>   val filter = (0 to 399)
> .foldLeft(lit(false))((e, index) => 
> e.or(dataframe.col(dataframe.columns(index)) =!= s"column${index+1}"))
>   val filtered = dataframe.filter(filter)
>   filtered.show(100)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20199) GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter

2017-05-26 Thread pralabhkumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar updated SPARK-20199:
-
Summary: GradientBoostedTreesModel doesn't have  featureSubsetStrategy 
parameter  (was: GradientBoostedTreesModel doesn't have  Column Sampling Rate 
Paramenter)

> GradientBoostedTreesModel doesn't have  featureSubsetStrategy parameter
> ---
>
> Key: SPARK-20199
> URL: https://issues.apache.org/jira/browse/SPARK-20199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: pralabhkumar
>
> Spark GradientBoostedTreesModel doesn't have Column  sampling rate parameter 
> . This parameter is available in H2O and XGBoost. 
> Sample from H2O.ai 
> gbmParams._col_sample_rate
> Please provide the parameter . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter

2017-05-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025923#comment-16025923
 ] 

Apache Spark commented on SPARK-20199:
--

User 'pralabhkumar' has created a pull request for this issue:
https://github.com/apache/spark/pull/18118

> GradientBoostedTreesModel doesn't have  Column Sampling Rate Paramenter
> ---
>
> Key: SPARK-20199
> URL: https://issues.apache.org/jira/browse/SPARK-20199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: pralabhkumar
>
> Spark GradientBoostedTreesModel doesn't have Column  sampling rate parameter 
> . This parameter is available in H2O and XGBoost. 
> Sample from H2O.ai 
> gbmParams._col_sample_rate
> Please provide the parameter . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter

2017-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20199:


Assignee: Apache Spark

> GradientBoostedTreesModel doesn't have  Column Sampling Rate Paramenter
> ---
>
> Key: SPARK-20199
> URL: https://issues.apache.org/jira/browse/SPARK-20199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: pralabhkumar
>Assignee: Apache Spark
>
> Spark GradientBoostedTreesModel doesn't have Column  sampling rate parameter 
> . This parameter is available in H2O and XGBoost. 
> Sample from H2O.ai 
> gbmParams._col_sample_rate
> Please provide the parameter . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter

2017-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20199:


Assignee: (was: Apache Spark)

> GradientBoostedTreesModel doesn't have  Column Sampling Rate Paramenter
> ---
>
> Key: SPARK-20199
> URL: https://issues.apache.org/jira/browse/SPARK-20199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: pralabhkumar
>
> Spark GradientBoostedTreesModel doesn't have Column  sampling rate parameter 
> . This parameter is available in H2O and XGBoost. 
> Sample from H2O.ai 
> gbmParams._col_sample_rate
> Please provide the parameter . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >