[jira] [Created] (SPARK-12145) Command "Set Role [ADMIN|NONE|ALL]" doesn't work in SQL based authorization

2015-12-04 Thread Ferdinand Xu (JIRA)
Ferdinand Xu created SPARK-12145:


 Summary: Command "Set Role [ADMIN|NONE|ALL]" doesn't work in SQL 
based authorization
 Key: SPARK-12145
 URL: https://issues.apache.org/jira/browse/SPARK-12145
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Ferdinand Xu


"Set role [ADMIN|NONE|ALL]" is not working since it's treated as "SET 
key=value" command. Also the user information is missing when creating sessions 
which is required to initialize the authorization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9838) Support Poisson family in SparkR:::glm

2015-12-04 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041258#comment-15041258
 ] 

Xusen Yin commented on SPARK-9838:
--

Hi [~mengxr], can I work on this?

> Support Poisson family in SparkR:::glm
> --
>
> Key: SPARK-9838
> URL: https://issues.apache.org/jira/browse/SPARK-9838
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Support Poisson family in SparkR::glm(). This task might need further 
> refinements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10123) Cannot set "--deploy-mode" in default configuration

2015-12-04 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041260#comment-15041260
 ] 

Saisai Shao commented on SPARK-10123:
-

Hi [~vanzin], would you mind letting me take a crack on this?

> Cannot set "--deploy-mode" in default configuration
> ---
>
> Key: SPARK-10123
> URL: https://issues.apache.org/jira/browse/SPARK-10123
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> There's no configuration option that is the equivalent of "--deploy-mode". So 
> it's not possible, for example, to have applications be submitted in 
> standalone cluster mode by default - you have to always use the command line 
> argument for that.
> YARN is special because it has the (somewhat deprecated) "yarn-cluster" 
> master, but it would be nice to be consistent and have a proper config option 
> for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12145) Command "Set Role [ADMIN|NONE|ALL]" doesn't work in SQL based authorization

2015-12-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12145:


Assignee: (was: Apache Spark)

> Command "Set Role [ADMIN|NONE|ALL]" doesn't work in SQL based authorization
> ---
>
> Key: SPARK-12145
> URL: https://issues.apache.org/jira/browse/SPARK-12145
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Ferdinand Xu
>
> "Set role [ADMIN|NONE|ALL]" is not working since it's treated as "SET 
> key=value" command. Also the user information is missing when creating 
> sessions which is required to initialize the authorization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12145) Command "Set Role [ADMIN|NONE|ALL]" doesn't work in SQL based authorization

2015-12-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041254#comment-15041254
 ] 

Apache Spark commented on SPARK-12145:
--

User 'winningsix' has created a pull request for this issue:
https://github.com/apache/spark/pull/10144

> Command "Set Role [ADMIN|NONE|ALL]" doesn't work in SQL based authorization
> ---
>
> Key: SPARK-12145
> URL: https://issues.apache.org/jira/browse/SPARK-12145
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Ferdinand Xu
>
> "Set role [ADMIN|NONE|ALL]" is not working since it's treated as "SET 
> key=value" command. Also the user information is missing when creating 
> sessions which is required to initialize the authorization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12145) Command "Set Role [ADMIN|NONE|ALL]" doesn't work in SQL based authorization

2015-12-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12145:


Assignee: Apache Spark

> Command "Set Role [ADMIN|NONE|ALL]" doesn't work in SQL based authorization
> ---
>
> Key: SPARK-12145
> URL: https://issues.apache.org/jira/browse/SPARK-12145
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Ferdinand Xu
>Assignee: Apache Spark
>
> "Set role [ADMIN|NONE|ALL]" is not working since it's treated as "SET 
> key=value" command. Also the user information is missing when creating 
> sessions which is required to initialize the authorization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12146) SparkR jsonFile should support multiple input files

2015-12-04 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-12146:
---

 Summary: SparkR jsonFile should support multiple input files
 Key: SPARK-12146
 URL: https://issues.apache.org/jira/browse/SPARK-12146
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Yanbo Liang


SparkR jsonFile should support multiple input files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-04 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041301#comment-15041301
 ] 

Saisai Shao commented on SPARK-12103:
-

I think I had a proposal of message handler (receiver interceptor) for 
receiver-based Kafka stream to support your scenario, but it was obsolete for a 
long time and only adopted in direct Kafka stream. Currently adding back this 
feature to receiver based Kafka stream will break a lot things, also it is not 
so meaningful (I think lots of users are shifting to direct API now).

> KafkaUtils createStream with multiple topics -- does not work as expected
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Affects Versions: 1.4.1
>Reporter: Dan Dutrow
>Priority: Minor
> Fix For: 1.4.2
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to use one (or a few) Kafka Streaming Receiver to pool 
> resources. I have 10+ topics and don't want to dedicate 10 cores to 
> processing all of them. However, when reading the data procuced by 
> KafkaUtils.createStream, the DStream[(String,String)] does not properly 
> insert the topic name into the tuple. The left-key always null, making it 
> impossible to know what topic that data came from other than stashing your 
> key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER))
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12146) SparkR jsonFile should support multiple input files

2015-12-04 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12146:

Description: 
This bug is easy to reproduce, jsonFile did not support character vector as 
arguments.
{code}
> path <- c("/path/to/dir1","/path/to/dir2")
> raw.terror<-jsonFile(sqlContext,path)
15/12/03 15:59:55 ERROR RBackendHandler: jsonFile on 1 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  java.io.IOException: No input paths specified in job
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2
{code}

  was:
> path <- c("/path/to/dir1","/path/to/dir2")
> raw.terror<-jsonFile(sqlContext,path)
15/12/03 15:59:55 ERROR RBackendHandler: jsonFile on 1 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  java.io.IOException: No input paths specified in job
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2


> SparkR jsonFile should support multiple input files
> ---
>
> Key: SPARK-12146
> URL: https://issues.apache.org/jira/browse/SPARK-12146
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yanbo Liang
>
> This bug is easy to reproduce, jsonFile did not support character vector as 
> arguments.
> {code}
> > path <- c("/path/to/dir1","/path/to/dir2")
> > raw.terror<-jsonFile(sqlContext,path)
> 15/12/03 15:59:55 ERROR RBackendHandler: jsonFile on 1 failed
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
>   java.io.IOException: No input paths specified in job
> at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12137) Spark Streaming State Recovery limitations

2015-12-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041500#comment-15041500
 ] 

Sean Owen commented on SPARK-12137:
---

The checkpoint mechanism works for recovering after an app fails and is 
restarted. There's no way to guarantee it works for an arbitrarily different 
new version of your app. It may work if you've designed the state you serialize 
appropriately. This is no different from the issue of deserializing objects 
serialized from an old version of POJOs, for instance.

What's the issue here -- are you just asking the question? I believe this is 
the answer.

> Spark Streaming State Recovery limitations
> --
>
> Key: SPARK-12137
> URL: https://issues.apache.org/jira/browse/SPARK-12137
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.4.1
>Reporter: Ravindar
>Priority: Critical
>
> There was multiple threads in forums asking similar question without a clear 
> answer and hence entering it here.
> We have a streaming application that goes through multi-step processing. In 
> some of these steps stateful operations like *updateStateByKey* are used to 
> maintain an accumulated running state (and other state info) with incoming 
> RDD streams. As streaming application is incremental, it is imperative that 
> we recover/restore from previous known state in the following two scenarios
>   1. On spark driver/streaming application failure.
>  In this scenario the driver/streaming application shutdown and 
> restarted. The recommended approach is enable the *checkpoint(checkpointDir)* 
> and use *StreamingContext.getOrCreate* to restore the context from checkpoint 
> state.
>   2. Upgrade driver/streaming application with additional steps in the 
> processing
>  In this scenario, we introduced new steps with downstream processing for 
> new functionality without changes to existing steps.  Upgrading the streaming 
> application with the new fails on  *StreamingContext.getOrCreate* as there is 
> mismatch in checkpoint saved.
> Both of the above scenarios needs a unified approach where accumulated state 
> has to be saved and restored. The first approach of restoring from checkpoint 
> works for driver failure but not code upgrade. When the application code 
> changed, there is a recommendation to delete checkpoint data when new code is 
> deployed. If so, how do you reconstitute all of the stateful (e.g: 
> updateStateByKey) information from the last run. Every streaming application 
> has to save  up-to-date state for each session represented by key and then 
> initialize it from this when a new session starts for the same key. Does 
> every application have to create their own mechanism given this is very 
> similar to current state checkpointing to HDFS. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11868) wrong results returned from dataframe create from Rows without consistent schma on pyspark

2015-12-04 Thread Yuval Tanny (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041497#comment-15041497
 ] 

Yuval Tanny commented on SPARK-11868:
-

No. I think there should be a decision on the desired behaviour beforehand. I 
provided here details that I found when trying to understand how to reproduce 
the problem I've encountered.

> wrong results returned from dataframe create from Rows without consistent 
> schma on pyspark
> --
>
> Key: SPARK-11868
> URL: https://issues.apache.org/jira/browse/SPARK-11868
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
> Environment: pyspark
>Reporter: Yuval Tanny
>
> When schema is inconsistent (but is the sames for the 10 first rows), it's 
> possible to create a dataframe form dictionaries and if a key is missing, its 
> value is None. But when trying to create dataframe from corresponding rows, 
> we get inconsistent behavior (wrong values for keys) without exception. See 
> example below.
> The problems seems to be:
> 1. Not verifying all rows in schema.
> 2. In pyspark.sql.types._create_converter, None is being set when converting 
> dictionary and field is not exist:
> {code}
> return tuple([conv(d.get(name)) for name, conv in zip(names, converters)])
> {code}
> But for Rows, it is just assumed that the number of fields in tuple is equal 
> the number of in the inferred schema, and we place wrong values for wrong 
> keys otherwise:
> {code}
> return tuple(conv(v) for v, conv in zip(obj, converters))
> {code}
> Thanks. 
> example:
> {code}
> dicts = [{'1':1,'2':2,'3':3}]*10+[{'1':1,'3':3}]
> rows = [pyspark.sql.Row(**r) for r in dicts]
> rows_rdd = sc.parallelize(rows)
> dicts_rdd = sc.parallelize(dicts)
> rows_df = sqlContext.createDataFrame(rows_rdd)
> dicts_df = sqlContext.createDataFrame(dicts_rdd)
> print(rows_df.select(['2']).collect()[10])
> print(dicts_df.select(['2']).collect()[10])
> {code}
> output:
> {code}
> Row(2=3)
> Row(2=None)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12146) SparkR jsonFile should support multiple input files

2015-12-04 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12146:

Description: 
> path <- c("/path/to/dir1","/path/to/dir2")
> raw.terror<-jsonFile(sqlContext,path)
15/12/03 15:59:55 ERROR RBackendHandler: jsonFile on 1 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  java.io.IOException: No input paths specified in job
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2

  was:SparkR jsonFile should support multiple input files.


> SparkR jsonFile should support multiple input files
> ---
>
> Key: SPARK-12146
> URL: https://issues.apache.org/jira/browse/SPARK-12146
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yanbo Liang
>
> > path <- c("/path/to/dir1","/path/to/dir2")
> > raw.terror<-jsonFile(sqlContext,path)
> 15/12/03 15:59:55 ERROR RBackendHandler: jsonFile on 1 failed
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
>   java.io.IOException: No input paths specified in job
> at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12122) Recovered streaming context can sometimes run a batch twice

2015-12-04 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-12122.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> Recovered streaming context can sometimes run a batch twice
> ---
>
> Key: SPARK-12122
> URL: https://issues.apache.org/jira/browse/SPARK-12122
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
> Fix For: 1.6.0
>
>
> After recovering from checkpoint, the JobGenerator figures out which batches 
> to run again. That can sometimes lead to a batch being submitted twice. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12146) SparkR jsonFile should support multiple input files

2015-12-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12146:


Assignee: (was: Apache Spark)

> SparkR jsonFile should support multiple input files
> ---
>
> Key: SPARK-12146
> URL: https://issues.apache.org/jira/browse/SPARK-12146
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR jsonFile should support multiple input files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12146) SparkR jsonFile should support multiple input files

2015-12-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041320#comment-15041320
 ] 

Apache Spark commented on SPARK-12146:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10145

> SparkR jsonFile should support multiple input files
> ---
>
> Key: SPARK-12146
> URL: https://issues.apache.org/jira/browse/SPARK-12146
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR jsonFile should support multiple input files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12146) SparkR jsonFile should support multiple input files

2015-12-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12146:


Assignee: Apache Spark

> SparkR jsonFile should support multiple input files
> ---
>
> Key: SPARK-12146
> URL: https://issues.apache.org/jira/browse/SPARK-12146
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> SparkR jsonFile should support multiple input files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12147) Off heap storage and dynamicAllocation operation

2015-12-04 Thread Rares Mirica (JIRA)
Rares Mirica created SPARK-12147:


 Summary: Off heap storage and dynamicAllocation operation
 Key: SPARK-12147
 URL: https://issues.apache.org/jira/browse/SPARK-12147
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.2
 Environment: Cloudera Hadoop 2.6.0-cdh5.4.8
Tachyon 0.7.1
Yarn
Reporter: Rares Mirica


For the purpose of increasing computation density and efficiency I set up to 
test off-heap storage (using Tachyon) with dynamicAllocation enabled.

Following the available documentation (programming-guide for Spark 1.5.2) I was 
expecting data to be cached in Tachyon for the lifetime of the application 
(driver instance) or until unpersist() is called. This belief was supported by 
the doc: "Cached data is not lost if individual executors crash." where with 
crash I also assimilate Graceful Decommission. Furthermore, in the GD 
description documented in the job-scheduling document cached data preservation 
through off-heap storage is also hinted at.

Seeing how Tachyon is now in a state where these promises of a better future 
are well within reach, I consider it a bug that upon graceful decommission of 
an executor the off-heap data is deleted (presumably as part of the cleanup 
phase).

Needless to say, enabling the preservation of the off-heap persisted data after 
graceful decommission for dynamic allocation would yield significant 
improvements in resource allocation, especially over yarn where executors use 
up compute "slots" even if idle. After a long, expensive, computation where we 
take advantage of the dynamically scaled executors, the rest of the spark jobs 
can use the cached data while releasing the compute resources for other cluster 
tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2489) Unsupported parquet datatype optional fixed_len_byte_array

2015-12-04 Thread Jorge Sanchez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041626#comment-15041626
 ] 

Jorge Sanchez commented on SPARK-2489:
--

This functionality would be very helpful.
Is there any way we can help to fix it?

> Unsupported parquet datatype optional fixed_len_byte_array
> --
>
> Key: SPARK-2489
> URL: https://issues.apache.org/jira/browse/SPARK-2489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Pei-Lun Lee
>
> tested against commit 9fe693b5
> {noformat}
> scala> sqlContext.parquetFile("/tmp/foo")
> java.lang.RuntimeException: Unsupported parquet datatype optional 
> fixed_len_byte_array(4) b
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:58)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:109)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:282)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:279)
> {noformat}
> example avro schema
> {noformat}
> protocol Test {
> fixed Bytes4(4);
> record Foo {
> union {null, Bytes4} b;
> }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11707) StreamCorruptedException if authentication is enabled

2015-12-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041623#comment-15041623
 ] 

Sean Owen commented on SPARK-11707:
---

[~jlewandowski] any chance you can try this with the latest code?

> StreamCorruptedException if authentication is enabled
> -
>
> Key: SPARK-11707
> URL: https://issues.apache.org/jira/browse/SPARK-11707
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Jacek Lewandowski
>
> When authentication (and encryption) is enabled (at least in standalone 
> mode), the following code (in Spark shell):
> {code}
> sc.makeRDD(1 to 10, 10).map(x => x*x).map(_.toString).reduce(_ + _)
> {code}
> finishes with exception:
> {noformat}
> [Stage 0:> (0 + 8) / 
> 10]15/11/12 20:36:29 ERROR TransportRequestHandler: Error while invoking 
> RpcHandler#receive() on RPC id 5750598674048943239
> java.io.StreamCorruptedException: invalid type code: 30
>   at 
> java.io.ObjectInputStream$BlockDataInputStream.readBlockHeader(ObjectInputStream.java:2508)
>   at 
> java.io.ObjectInputStream$BlockDataInputStream.refill(ObjectInputStream.java:2543)
>   at 
> java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2702)
>   at java.io.ObjectInputStream.read(ObjectInputStream.java:865)
>   at 
> java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385)
>   at 
> org.apache.spark.util.SerializableBuffer$$anonfun$readObject$1.apply(SerializableBuffer.scala:38)
>   at 
> org.apache.spark.util.SerializableBuffer$$anonfun$readObject$1.apply(SerializableBuffer.scala:32)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1186)
>   at 
> org.apache.spark.util.SerializableBuffer.readObject(SerializableBuffer.scala:32)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:248)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:296)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:247)
>   at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:448)
>   at 
> org.apache.spark.network.sasl.SaslRpcHandler.receive(SaslRpcHandler.java:76)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:122)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:94)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:101)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> 

[jira] [Updated] (SPARK-12147) Off heap storage and dynamicAllocation operation

2015-12-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12147:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

(Make the title more specific?)

I disagree it's a 'bug'. Cached data is not lost when an executor is in the 
sense that it was just a cached copy. The cached copy is lost, but can be 
recreated.

Tachyon is not required, and doesn't look like it's going to be, so I'm not 
sure in general it's a solution to something. Your suggestion requires doubling 
the amount of storage for cached data: now things live in memory or on local 
disk and also in Tachyon or something. Right?

It's also not true that after executors are decommissioned that others can keep 
using the cached data. There aren't enough executors to keep the cached data 
live any more.

I think this has problems but maybe you can clarify what you mean.

> Off heap storage and dynamicAllocation operation
> 
>
> Key: SPARK-12147
> URL: https://issues.apache.org/jira/browse/SPARK-12147
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.2
> Environment: Cloudera Hadoop 2.6.0-cdh5.4.8
> Tachyon 0.7.1
> Yarn
>Reporter: Rares Mirica
>Priority: Minor
> Attachments: spark-defaults.conf
>
>
> For the purpose of increasing computation density and efficiency I set up to 
> test off-heap storage (using Tachyon) with dynamicAllocation enabled.
> Following the available documentation (programming-guide for Spark 1.5.2) I 
> was expecting data to be cached in Tachyon for the lifetime of the 
> application (driver instance) or until unpersist() is called. This belief was 
> supported by the doc: "Cached data is not lost if individual executors 
> crash." where with crash I also assimilate Graceful Decommission. 
> Furthermore, in the GD description documented in the job-scheduling document 
> cached data preservation through off-heap storage is also hinted at.
> Seeing how Tachyon is now in a state where these promises of a better future 
> are well within reach, I consider it a bug that upon graceful decommission of 
> an executor the off-heap data is deleted (presumably as part of the cleanup 
> phase).
> Needless to say, enabling the preservation of the off-heap persisted data 
> after graceful decommission for dynamic allocation would yield significant 
> improvements in resource allocation, especially over yarn where executors use 
> up compute "slots" even if idle. After a long, expensive, computation where 
> we take advantage of the dynamically scaled executors, the rest of the spark 
> jobs can use the cached data while releasing the compute resources for other 
> cluster tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12147) Off heap storage and dynamicAllocation operation

2015-12-04 Thread Rares Mirica (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rares Mirica updated SPARK-12147:
-
Attachment: spark-defaults.conf

> Off heap storage and dynamicAllocation operation
> 
>
> Key: SPARK-12147
> URL: https://issues.apache.org/jira/browse/SPARK-12147
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
> Environment: Cloudera Hadoop 2.6.0-cdh5.4.8
> Tachyon 0.7.1
> Yarn
>Reporter: Rares Mirica
> Attachments: spark-defaults.conf
>
>
> For the purpose of increasing computation density and efficiency I set up to 
> test off-heap storage (using Tachyon) with dynamicAllocation enabled.
> Following the available documentation (programming-guide for Spark 1.5.2) I 
> was expecting data to be cached in Tachyon for the lifetime of the 
> application (driver instance) or until unpersist() is called. This belief was 
> supported by the doc: "Cached data is not lost if individual executors 
> crash." where with crash I also assimilate Graceful Decommission. 
> Furthermore, in the GD description documented in the job-scheduling document 
> cached data preservation through off-heap storage is also hinted at.
> Seeing how Tachyon is now in a state where these promises of a better future 
> are well within reach, I consider it a bug that upon graceful decommission of 
> an executor the off-heap data is deleted (presumably as part of the cleanup 
> phase).
> Needless to say, enabling the preservation of the off-heap persisted data 
> after graceful decommission for dynamic allocation would yield significant 
> improvements in resource allocation, especially over yarn where executors use 
> up compute "slots" even if idle. After a long, expensive, computation where 
> we take advantage of the dynamically scaled executors, the rest of the spark 
> jobs can use the cached data while releasing the compute resources for other 
> cluster tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12147) Off heap storage and dynamicAllocation operation

2015-12-04 Thread Rares Mirica (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041679#comment-15041679
 ] 

Rares Mirica commented on SPARK-12147:
--

Sorry, I wasn't specific enough about the use-case and how to trigger/take 
advantage of this.

There is no need to cache data in the traditional sense (by calling .cache() on 
the RDD) so no on-heap space is required. One only needs to append 
.persist(OFF_HEAP) after the computation to take advantage of this. All of the 
data should therefore reside in OFF-HEAP storage (for the time being this is 
Tachyon). There is no alternative off-heap implementation so for taking 
advantage of this Tachyon is required, the only alternative would be to 
serialise the result of the expensive computation to disk (through a .saveX 
call) and then re-load the RDD through sparkContext.textFile (or equivalent, 
using parquet, java serialised objects).

The data should only live in one place: tachyon, and should be considered 
persisted (as it would through serialising and saving to hdfs) for the lifetime 
of the application. If this would be the case death or decommission of an 
executor would be completely decoupled from the data originatin in that 
executor and "cached" in tacyhon.

> Off heap storage and dynamicAllocation operation
> 
>
> Key: SPARK-12147
> URL: https://issues.apache.org/jira/browse/SPARK-12147
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.2
> Environment: Cloudera Hadoop 2.6.0-cdh5.4.8
> Tachyon 0.7.1
> Yarn
>Reporter: Rares Mirica
>Priority: Minor
> Attachments: spark-defaults.conf
>
>
> For the purpose of increasing computation density and efficiency I set up to 
> test off-heap storage (using Tachyon) with dynamicAllocation enabled.
> Following the available documentation (programming-guide for Spark 1.5.2) I 
> was expecting data to be cached in Tachyon for the lifetime of the 
> application (driver instance) or until unpersist() is called. This belief was 
> supported by the doc: "Cached data is not lost if individual executors 
> crash." where with crash I also assimilate Graceful Decommission. 
> Furthermore, in the GD description documented in the job-scheduling document 
> cached data preservation through off-heap storage is also hinted at.
> Seeing how Tachyon is now in a state where these promises of a better future 
> are well within reach, I consider it a bug that upon graceful decommission of 
> an executor the off-heap data is deleted (presumably as part of the cleanup 
> phase).
> Needless to say, enabling the preservation of the off-heap persisted data 
> after graceful decommission for dynamic allocation would yield significant 
> improvements in resource allocation, especially over yarn where executors use 
> up compute "slots" even if idle. After a long, expensive, computation where 
> we take advantage of the dynamically scaled executors, the rest of the spark 
> jobs can use the cached data while releasing the compute resources for other 
> cluster tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12026) ChiSqTest gets slower and slower over time when number of features is large

2015-12-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12026:


Assignee: Apache Spark

> ChiSqTest gets slower and slower over time when number of features is large
> ---
>
> Key: SPARK-12026
> URL: https://issues.apache.org/jira/browse/SPARK-12026
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Hunter Kelly
>Assignee: Apache Spark
>  Labels: mllib, stats
> Attachments: First Stages.png, Latest Stages.png
>
>
> I've been running a ChiSqTest to pick features for feature reduction.  My 
> understanding is that internally it creates jobs to run on batches of 1000 
> features at a time.
> I was under the impression that the features are treated as independant, but 
> this does not appear to be the case.  When the number of features is large 
> (160k in my case), each batch gets slower and slower.  As an example, running 
> on 25 m3.2xlarges on Amazon EMR, it started at just over 1 minute per batch.  
> By the end, batches were taking over 30 minutes per batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12026) ChiSqTest gets slower and slower over time when number of features is large

2015-12-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12026:


Assignee: (was: Apache Spark)

> ChiSqTest gets slower and slower over time when number of features is large
> ---
>
> Key: SPARK-12026
> URL: https://issues.apache.org/jira/browse/SPARK-12026
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Hunter Kelly
>  Labels: mllib, stats
> Attachments: First Stages.png, Latest Stages.png
>
>
> I've been running a ChiSqTest to pick features for feature reduction.  My 
> understanding is that internally it creates jobs to run on batches of 1000 
> features at a time.
> I was under the impression that the features are treated as independant, but 
> this does not appear to be the case.  When the number of features is large 
> (160k in my case), each batch gets slower and slower.  As an example, running 
> on 25 m3.2xlarges on Amazon EMR, it started at just over 1 minute per batch.  
> By the end, batches were taking over 30 minutes per batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7999) Graph complement function in GraphX

2015-12-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7999.
--
Resolution: Won't Fix

> Graph complement function in GraphX
> ---
>
> Key: SPARK-7999
> URL: https://issues.apache.org/jira/browse/SPARK-7999
> Project: Spark
>  Issue Type: Sub-task
>  Components: GraphX
>Reporter: Tarek Auel
>Priority: Minor
>
> This task is for implementing the complement operation (compare to parent 
> task).
> http://techieme.in/complex-graph-operations/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7894) Graph Union Operator

2015-12-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7894.
--
Resolution: Won't Fix

> Graph Union Operator
> 
>
> Key: SPARK-7894
> URL: https://issues.apache.org/jira/browse/SPARK-7894
> Project: Spark
>  Issue Type: Sub-task
>  Components: GraphX
>Reporter: Andy Huang
>  Labels: graph, union
> Attachments: union_operator.png
>
>
> This operator aims to union two graphs and generate a new graph directly. The 
> union of two graphs is the union of their vertex sets and their edge 
> families.Vertexes and edges which are included in either graph will be part 
> of the new graph.
> bq.  G ∪ H = (VG ∪ VH, EG ∪ EH).
> The below image shows a union of graph G and graph H
> !union_operator.png|width=600px,align=center!
> A Simple interface would be:
> bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
> However, inevitably vertexes and edges overlapping will happen between 
> borders of graphs. For vertex, it's quite nature to just make a union and 
> remove those duplicate ones. But for edges, a mergeEdges function seems to be 
> more reasonable.
> bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
> (ED, ED) => ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7893) Complex Operators between Graphs

2015-12-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7893.
--
Resolution: Won't Fix

> Complex Operators between Graphs
> 
>
> Key: SPARK-7893
> URL: https://issues.apache.org/jira/browse/SPARK-7893
> Project: Spark
>  Issue Type: Umbrella
>  Components: GraphX
>Reporter: Andy Huang
>  Labels: complex, graph, join, operators, union
>
> Currently there are 30+ operators in GraphX, while few of them consider 
> operators between graphs. The only one is _*mask*_, which takes another graph 
> as a parameter and return a new graph.
> In many complex case,such as _*streaming graph, small graph merge into huge 
> graph*_, higher level operators of graphs can help users to focus and think 
> in graph. Performance optimization can be done internally and be transparent 
> to them.
> Complex graph operator list is 
> here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]
> * Union of Graphs ( G ∪ H )
> * Intersection of Graphs( G ∩ H)
> * Graph Join
> * Difference of Graphs(G – H)
> * Graph Complement
> * Line Graph ( L(G) )
> This issue will be index of all these operators



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12026) ChiSqTest gets slower and slower over time when number of features is large

2015-12-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041620#comment-15041620
 ] 

Apache Spark commented on SPARK-12026:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/10146

> ChiSqTest gets slower and slower over time when number of features is large
> ---
>
> Key: SPARK-12026
> URL: https://issues.apache.org/jira/browse/SPARK-12026
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Hunter Kelly
>  Labels: mllib, stats
> Attachments: First Stages.png, Latest Stages.png
>
>
> I've been running a ChiSqTest to pick features for feature reduction.  My 
> understanding is that internally it creates jobs to run on batches of 1000 
> features at a time.
> I was under the impression that the features are treated as independant, but 
> this does not appear to be the case.  When the number of features is large 
> (160k in my case), each batch gets slower and slower.  As an example, running 
> on 25 m3.2xlarges on Amazon EMR, it started at just over 1 minute per batch.  
> By the end, batches were taking over 30 minutes per batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12147) Off heap storage and dynamicAllocation operation

2015-12-04 Thread Rares Mirica (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041789#comment-15041789
 ] 

Rares Mirica commented on SPARK-12147:
--

Yes, I am talking about the executor stopping as part of scaling down on
dynamic allocation. I am observing this in am actual test, I was reading
the docs just to test my assumption.



> Off heap storage and dynamicAllocation operation
> 
>
> Key: SPARK-12147
> URL: https://issues.apache.org/jira/browse/SPARK-12147
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.2
> Environment: Cloudera Hadoop 2.6.0-cdh5.4.8
> Tachyon 0.7.1
> Yarn
>Reporter: Rares Mirica
>Priority: Minor
> Attachments: spark-defaults.conf
>
>
> For the purpose of increasing computation density and efficiency I set up to 
> test off-heap storage (using Tachyon) with dynamicAllocation enabled.
> Following the available documentation (programming-guide for Spark 1.5.2) I 
> was expecting data to be cached in Tachyon for the lifetime of the 
> application (driver instance) or until unpersist() is called. This belief was 
> supported by the doc: "Cached data is not lost if individual executors 
> crash." where with crash I also assimilate Graceful Decommission. 
> Furthermore, in the GD description documented in the job-scheduling document 
> cached data preservation through off-heap storage is also hinted at.
> Seeing how Tachyon is now in a state where these promises of a better future 
> are well within reach, I consider it a bug that upon graceful decommission of 
> an executor the off-heap data is deleted (presumably as part of the cleanup 
> phase).
> Needless to say, enabling the preservation of the off-heap persisted data 
> after graceful decommission for dynamic allocation would yield significant 
> improvements in resource allocation, especially over yarn where executors use 
> up compute "slots" even if idle. After a long, expensive, computation where 
> we take advantage of the dynamically scaled executors, the rest of the spark 
> jobs can use the cached data while releasing the compute resources for other 
> cluster tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12140) Support Streaming UI in HistoryServer

2015-12-04 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041820#comment-15041820
 ] 

Marcelo Vanzin commented on SPARK-12140:


You don't need to ask permission to work on things.

> Support Streaming UI in HistoryServer
> -
>
> Key: SPARK-12140
> URL: https://issues.apache.org/jira/browse/SPARK-12140
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>
> SPARK-11206 added infrastructure that would allow the streaming UI to be 
> shown in the History Server. We should add the necessary code to make that 
> happen, although it requires some changes to how events and listeners are 
> used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10123) Cannot set "--deploy-mode" in default configuration

2015-12-04 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041817#comment-15041817
 ] 

Marcelo Vanzin commented on SPARK-10123:


Bug is not assigned to anyone, so no need to ask permission...

> Cannot set "--deploy-mode" in default configuration
> ---
>
> Key: SPARK-10123
> URL: https://issues.apache.org/jira/browse/SPARK-10123
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> There's no configuration option that is the equivalent of "--deploy-mode". So 
> it's not possible, for example, to have applications be submitted in 
> standalone cluster mode by default - you have to always use the command line 
> argument for that.
> YARN is special because it has the (somewhat deprecated) "yarn-cluster" 
> master, but it would be nice to be consistent and have a proper config option 
> for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8513) _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition

2015-12-04 Thread Jiri Syrovy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041742#comment-15041742
 ] 

Jiri Syrovy edited comment on SPARK-8513 at 12/4/15 4:47 PM:
-

The same thing happens when writing partitioned files with coalesce(1) and 
spark.speculation set to "false". I've tried multiple different hadoop versions 
(2.6 -> 2.7.1), multiple writing modes (Overwrite, append, ...), but the result 
is always almost the same.
{code:java}
DataFrameWriter writer = 
df.coalesce(1).write().format("parquet");
writer = perTemplate ? writer.partitionBy("templateId", 
"definitionId") 
: writer.partitionBy("definitionId");
writer.mode(SaveMode.Append).save(ConfConsts.STORAGE_PREFIX + 
location);
{code}

{noformat}
[2015-12-04 16:14:59,821] WARN  .apache.hadoop.fs.FileUtil [] 
[akka://JobServer/user/context-supervisor/CSCONTEXT] - Failed to delete file or 
dir 
[/data/build/result_12349.parquet.stats/_temporary/0/task_201512041614_5493_m_00/templateId=2918/definitionId=4/.part-r-0-cee32e28-fa7c-43ec-bbe1-63b639deb395.snappy.parquet.crc]:
 it still exists.
[2015-12-04 16:14:59,821] ERROR InsertIntoHadoopFsRelation [] 
[akka://JobServer/user/context-supervisor/CSCONTEXT] - Aborting job.
java.io.IOException: Failed to rename 
DeprecatedRawLocalFileStatus{path=file:/data/build/result_12349.parquet.stats/_temporary/0/task_201512041614_5493_m_00/templateId=2918;
 isDirectory=true; modification_time=1449245693000; access_time=0; owner=; 
group=; permission=rwxrwxrwx; isSymlink=false} to 
file:/data/build/result_12349.parquet.stats/templateId=2918
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.renameOrMerge(FileOutputCommitter.java:397)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:388)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326)
at 
org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
{noformat}


was (Author: xjrk):
The same thing happens when writing partitioned files with coalesce(1) and 
spark.speculation set to "false":
{code:java}
DataFrameWriter writer = 
df.coalesce(1).write().format("parquet");
writer = perTemplate ? writer.partitionBy("templateId", 
"definitionId") 
: writer.partitionBy("definitionId");
writer.mode(SaveMode.Append).save(ConfConsts.STORAGE_PREFIX + 
location);
{code}

{noformat}
[2015-12-04 16:14:59,821] WARN  .apache.hadoop.fs.FileUtil [] 
[akka://JobServer/user/context-supervisor/CSCONTEXT] - Failed to delete file or 
dir 
[/data/build/result_12349.parquet.stats/_temporary/0/task_201512041614_5493_m_00/templateId=2918/definitionId=4/.part-r-0-cee32e28-fa7c-43ec-bbe1-63b639deb395.snappy.parquet.crc]:
 it still exists.
[2015-12-04 16:14:59,821] ERROR InsertIntoHadoopFsRelation [] 
[akka://JobServer/user/context-supervisor/CSCONTEXT] - Aborting job.
java.io.IOException: Failed to rename 
DeprecatedRawLocalFileStatus{path=file:/data/build/result_12349.parquet.stats/_temporary/0/task_201512041614_5493_m_00/templateId=2918;
 isDirectory=true; modification_time=1449245693000; access_time=0; owner=; 
group=; permission=rwxrwxrwx; isSymlink=false} to 
file:/data/build/result_12349.parquet.stats/templateId=2918
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.renameOrMerge(FileOutputCommitter.java:397)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:388)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326)
at 
org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
{noformat}

> _temporary may be left undeleted when a write job committed with 
> FileOutputCommitter fails due to a race condition
> --
>
> Key: SPARK-8513
> URL: https://issues.apache.org/jira/browse/SPARK-8513
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.2.2, 1.3.1, 1.4.0
>Reporter: Cheng Lian
>
> To reproduce this issue, we need a node with relatively more cores, say 32 
> (e.g., Spark Jenkins builder is a good candidate).  With such a node, the 
> following code should be relatively easy to reproduce this issue:
> {code}
> 

[jira] [Created] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame

2015-12-04 Thread Michael Lawrence (JIRA)
Michael Lawrence created SPARK-12148:


 Summary: SparkR: rename DataFrame to SparkDataFrame
 Key: SPARK-12148
 URL: https://issues.apache.org/jira/browse/SPARK-12148
 Project: Spark
  Issue Type: Wish
  Components: R
Reporter: Michael Lawrence
Priority: Minor


The SparkR package represents a Spark DataFrame with the class "DataFrame". 
That conflicts with the more general DataFrame class defined in the S4Vectors 
package. Would it not be more appropriate to use the name "SparkDataFrame" 
instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12147) Off heap storage and dynamicAllocation operation

2015-12-04 Thread Rares Mirica (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041683#comment-15041683
 ] 

Rares Mirica commented on SPARK-12147:
--

I would also like to object on setting this as minor, this is a massive 
improvement in usability of Spark in multi-tenant environments or 
interactive-use environments where a large number of executors is needed to 
prepare an RDD for later use (eg: exploratory research) and caching is needed 
to avoid resource waste.

The only alternative is to permanently persist the RDD, the api for which is 
quite a bit more complicated and also puts the responsibility of cleaning and 
maintaining the data on the shoulders of the user (instead of treating the data 
as ephemeral and only available for the lifetime of the current application).

> Off heap storage and dynamicAllocation operation
> 
>
> Key: SPARK-12147
> URL: https://issues.apache.org/jira/browse/SPARK-12147
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.2
> Environment: Cloudera Hadoop 2.6.0-cdh5.4.8
> Tachyon 0.7.1
> Yarn
>Reporter: Rares Mirica
>Priority: Minor
> Attachments: spark-defaults.conf
>
>
> For the purpose of increasing computation density and efficiency I set up to 
> test off-heap storage (using Tachyon) with dynamicAllocation enabled.
> Following the available documentation (programming-guide for Spark 1.5.2) I 
> was expecting data to be cached in Tachyon for the lifetime of the 
> application (driver instance) or until unpersist() is called. This belief was 
> supported by the doc: "Cached data is not lost if individual executors 
> crash." where with crash I also assimilate Graceful Decommission. 
> Furthermore, in the GD description documented in the job-scheduling document 
> cached data preservation through off-heap storage is also hinted at.
> Seeing how Tachyon is now in a state where these promises of a better future 
> are well within reach, I consider it a bug that upon graceful decommission of 
> an executor the off-heap data is deleted (presumably as part of the cleanup 
> phase).
> Needless to say, enabling the preservation of the off-heap persisted data 
> after graceful decommission for dynamic allocation would yield significant 
> improvements in resource allocation, especially over yarn where executors use 
> up compute "slots" even if idle. After a long, expensive, computation where 
> we take advantage of the dynamically scaled executors, the rest of the spark 
> jobs can use the cached data while releasing the compute resources for other 
> cluster tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8513) _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition

2015-12-04 Thread Jiri Syrovy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041742#comment-15041742
 ] 

Jiri Syrovy commented on SPARK-8513:


The same thing happens when writing partitioned files with coalesce(1) and 
spark.speculation set to "false":
{code:java}
DataFrameWriter writer = 
df.coalesce(1).write().format("parquet");
writer = perTemplate ? writer.partitionBy("templateId", 
"definitionId") 
: writer.partitionBy("definitionId");
writer.mode(SaveMode.Append).save(ConfConsts.STORAGE_PREFIX + 
location);
{code:java}

{noformat}
[2015-12-04 16:14:59,821] WARN  .apache.hadoop.fs.FileUtil [] 
[akka://JobServer/user/context-supervisor/CSCONTEXT] - Failed to delete file or 
dir 
[/data/build/result_12349.parquet.stats/_temporary/0/task_201512041614_5493_m_00/templateId=2918/definitionId=4/.part-r-0-cee32e28-fa7c-43ec-bbe1-63b639deb395.snappy.parquet.crc]:
 it still exists.
[2015-12-04 16:14:59,821] ERROR InsertIntoHadoopFsRelation [] 
[akka://JobServer/user/context-supervisor/CSCONTEXT] - Aborting job.
java.io.IOException: Failed to rename 
DeprecatedRawLocalFileStatus{path=file:/data/build/result_12349.parquet.stats/_temporary/0/task_201512041614_5493_m_00/templateId=2918;
 isDirectory=true; modification_time=1449245693000; access_time=0; owner=; 
group=; permission=rwxrwxrwx; isSymlink=false} to 
file:/data/build/result_12349.parquet.stats/templateId=2918
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.renameOrMerge(FileOutputCommitter.java:397)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:388)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326)
at 
org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
{noformat}

> _temporary may be left undeleted when a write job committed with 
> FileOutputCommitter fails due to a race condition
> --
>
> Key: SPARK-8513
> URL: https://issues.apache.org/jira/browse/SPARK-8513
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.2.2, 1.3.1, 1.4.0
>Reporter: Cheng Lian
>
> To reproduce this issue, we need a node with relatively more cores, say 32 
> (e.g., Spark Jenkins builder is a good candidate).  With such a node, the 
> following code should be relatively easy to reproduce this issue:
> {code}
> sqlContext.range(0, 10).repartition(32).select('id / 
> 0).write.mode("overwrite").parquet("file:///tmp/foo")
> {code}
> You may observe similar log lines as below:
> {noformat}
> 01:58:27.682 pool-1-thread-1-ScalaTest-running-CommitFailureTestRelationSuite 
> WARN FileUtil: Failed to delete file or dir 
> [/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-a918b285-fa59-4a29-857e-a95e38fa355a/_temporary/0/_temporary]:
>  it still exists.
> {noformat}
> The reason is that, for a Spark job with multiple tasks, when a task fails 
> after multiple retries, the job gets canceled on driver side.  At the same 
> time, all child tasks of this job also get canceled.  However, task 
> cancelation is asynchronous.  This means, some tasks may still be running 
> when the job is already killed on driver side.
> With this in mind, the following execution order may cause the log line 
> mentioned above:
> # Job {{A}} spawns 32 tasks to write the Parquet file
>   Since {{ParquetOutputCommitter}} is a subclass of {{FileOutputClass}}, a 
> temporary directory {{D1}} is created to hold output files of different task 
> attempts.
> # Task {{a1}} fails after several retries first because of the division by 
> zero error
> # Task {{a1}} aborts the Parquet write task and tries to remove its task 
> attempt output directory {{d1}} (a sub-directory of {{D1}})
> # Job {{A}} gets canceled on driver side, all the other 31 tasks also get 
> canceled *asynchronously*
> # {{ParquetOutputCommitter.abortJob()}} tries to remove {{D1}} by first 
> removing all its child files/directories first
>   Note that when testing with local directory, {{RawLocalFileSystem}} simply 
> calls {{java.io.File.delete()}} to deletion, and only empty directories can 
> be deleted.
> # Because tasks are canceled asynchronously, some other task, say {{a2}}, may 
> just get scheduled and create its own task attempt directory {{d2}} under 
> {{D1}}
> # Now {{ParquetOutputCommitter.abortJob()}} tries to finally remove {{D1}} 
> itself, but fails because {{d2}} makes {{D1}} non-empty again
> Notice that this bug affects 

[jira] [Commented] (SPARK-12147) Off heap storage and dynamicAllocation operation

2015-12-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041740#comment-15041740
 ] 

Sean Owen commented on SPARK-12147:
---

To be honest, the priority doesn't matter much at all. It won't make anyone 
work on it. You're talking about a feature that's not generally used in Spark 
(Tachyon) and may be moving out of the core, so I dunno, it doesn't seem 
"Major".

I understand that this is about RDDs that are cached off-heap only now. You're 
talking about an executor being shut down, not the app terminating right? I 
would not think an executor stopping changes off-heap cached data, but I don't 
know the details -- maybe there is a locality aspect to it, and if there is 
then that's the reason its fate is like that of on-heap storage. 

To double check, are you saying you observe this happening or this is just your 
reading of the docs? the passage you quote is about something else as far as I 
can tell.

> Off heap storage and dynamicAllocation operation
> 
>
> Key: SPARK-12147
> URL: https://issues.apache.org/jira/browse/SPARK-12147
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.2
> Environment: Cloudera Hadoop 2.6.0-cdh5.4.8
> Tachyon 0.7.1
> Yarn
>Reporter: Rares Mirica
>Priority: Minor
> Attachments: spark-defaults.conf
>
>
> For the purpose of increasing computation density and efficiency I set up to 
> test off-heap storage (using Tachyon) with dynamicAllocation enabled.
> Following the available documentation (programming-guide for Spark 1.5.2) I 
> was expecting data to be cached in Tachyon for the lifetime of the 
> application (driver instance) or until unpersist() is called. This belief was 
> supported by the doc: "Cached data is not lost if individual executors 
> crash." where with crash I also assimilate Graceful Decommission. 
> Furthermore, in the GD description documented in the job-scheduling document 
> cached data preservation through off-heap storage is also hinted at.
> Seeing how Tachyon is now in a state where these promises of a better future 
> are well within reach, I consider it a bug that upon graceful decommission of 
> an executor the off-heap data is deleted (presumably as part of the cleanup 
> phase).
> Needless to say, enabling the preservation of the off-heap persisted data 
> after graceful decommission for dynamic allocation would yield significant 
> improvements in resource allocation, especially over yarn where executors use 
> up compute "slots" even if idle. After a long, expensive, computation where 
> we take advantage of the dynamically scaled executors, the rest of the spark 
> jobs can use the cached data while releasing the compute resources for other 
> cluster tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8513) _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition

2015-12-04 Thread Jiri Syrovy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041742#comment-15041742
 ] 

Jiri Syrovy edited comment on SPARK-8513 at 12/4/15 4:45 PM:
-

The same thing happens when writing partitioned files with coalesce(1) and 
spark.speculation set to "false":
{code:java}
DataFrameWriter writer = 
df.coalesce(1).write().format("parquet");
writer = perTemplate ? writer.partitionBy("templateId", 
"definitionId") 
: writer.partitionBy("definitionId");
writer.mode(SaveMode.Append).save(ConfConsts.STORAGE_PREFIX + 
location);
{code}

{noformat}
[2015-12-04 16:14:59,821] WARN  .apache.hadoop.fs.FileUtil [] 
[akka://JobServer/user/context-supervisor/CSCONTEXT] - Failed to delete file or 
dir 
[/data/build/result_12349.parquet.stats/_temporary/0/task_201512041614_5493_m_00/templateId=2918/definitionId=4/.part-r-0-cee32e28-fa7c-43ec-bbe1-63b639deb395.snappy.parquet.crc]:
 it still exists.
[2015-12-04 16:14:59,821] ERROR InsertIntoHadoopFsRelation [] 
[akka://JobServer/user/context-supervisor/CSCONTEXT] - Aborting job.
java.io.IOException: Failed to rename 
DeprecatedRawLocalFileStatus{path=file:/data/build/result_12349.parquet.stats/_temporary/0/task_201512041614_5493_m_00/templateId=2918;
 isDirectory=true; modification_time=1449245693000; access_time=0; owner=; 
group=; permission=rwxrwxrwx; isSymlink=false} to 
file:/data/build/result_12349.parquet.stats/templateId=2918
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.renameOrMerge(FileOutputCommitter.java:397)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:388)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326)
at 
org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
{noformat}


was (Author: xjrk):
The same thing happens when writing partitioned files with coalesce(1) and 
spark.speculation set to "false":
{code:java}
DataFrameWriter writer = 
df.coalesce(1).write().format("parquet");
writer = perTemplate ? writer.partitionBy("templateId", 
"definitionId") 
: writer.partitionBy("definitionId");
writer.mode(SaveMode.Append).save(ConfConsts.STORAGE_PREFIX + 
location);
{code:java}

{noformat}
[2015-12-04 16:14:59,821] WARN  .apache.hadoop.fs.FileUtil [] 
[akka://JobServer/user/context-supervisor/CSCONTEXT] - Failed to delete file or 
dir 
[/data/build/result_12349.parquet.stats/_temporary/0/task_201512041614_5493_m_00/templateId=2918/definitionId=4/.part-r-0-cee32e28-fa7c-43ec-bbe1-63b639deb395.snappy.parquet.crc]:
 it still exists.
[2015-12-04 16:14:59,821] ERROR InsertIntoHadoopFsRelation [] 
[akka://JobServer/user/context-supervisor/CSCONTEXT] - Aborting job.
java.io.IOException: Failed to rename 
DeprecatedRawLocalFileStatus{path=file:/data/build/result_12349.parquet.stats/_temporary/0/task_201512041614_5493_m_00/templateId=2918;
 isDirectory=true; modification_time=1449245693000; access_time=0; owner=; 
group=; permission=rwxrwxrwx; isSymlink=false} to 
file:/data/build/result_12349.parquet.stats/templateId=2918
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.renameOrMerge(FileOutputCommitter.java:397)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:388)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326)
at 
org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
{noformat}

> _temporary may be left undeleted when a write job committed with 
> FileOutputCommitter fails due to a race condition
> --
>
> Key: SPARK-8513
> URL: https://issues.apache.org/jira/browse/SPARK-8513
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.2.2, 1.3.1, 1.4.0
>Reporter: Cheng Lian
>
> To reproduce this issue, we need a node with relatively more cores, say 32 
> (e.g., Spark Jenkins builder is a good candidate).  With such a node, the 
> following code should be relatively easy to reproduce this issue:
> {code}
> sqlContext.range(0, 10).repartition(32).select('id / 
> 0).write.mode("overwrite").parquet("file:///tmp/foo")
> {code}
> You may observe similar log 

[jira] [Commented] (SPARK-12144) Implement DataFrameReader and DataFrameWriter API in SparkR

2015-12-04 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042103#comment-15042103
 ] 

Felix Cheung commented on SPARK-12144:
--

+1 [~shivaram]

The style
{code}
read.format("json").option(options).load("people.json")
{code}

doesn't really fit R

instead, why don't we have
{code}
read(format, options_named_list, ...)
{code}

in fact we could leave/add convenience functions like
{code}
read.csv(options)
read.json(options)
read.parquet(options)
{code}



> Implement DataFrameReader and DataFrameWriter API in SparkR
> ---
>
> Key: SPARK-12144
> URL: https://issues.apache.org/jira/browse/SPARK-12144
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>
> DataFrameReader API: 
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader
> DataFrameWriter API: 
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12038) Add support for pickle protocol to graphite sink

2015-12-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042124#comment-15042124
 ] 

Apache Spark commented on SPARK-12038:
--

User 'alberskib' has created a pull request for this issue:
https://github.com/apache/spark/pull/10148

> Add support for pickle protocol to graphite sink
> 
>
> Key: SPARK-12038
> URL: https://issues.apache.org/jira/browse/SPARK-12038
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Bartlomiej Alberski
>Priority: Minor
>
> Currently GraphiteSink support TCP sender and UDP sender. UDP sender does not 
> guarantee that graphtie will receive metrics. On the other hand, usage of TCP 
> sender could be problematic in case that several jobs are sending lot of 
> different metrics with high frequency - it could lead to network congestion.
> I think that implementing support for pickle protocol for sending metrics to 
> graphite could be helpfull in minimalizing network traffic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12038) Add support for pickle protocol to graphite sink

2015-12-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12038:


Assignee: (was: Apache Spark)

> Add support for pickle protocol to graphite sink
> 
>
> Key: SPARK-12038
> URL: https://issues.apache.org/jira/browse/SPARK-12038
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Bartlomiej Alberski
>Priority: Minor
>
> Currently GraphiteSink support TCP sender and UDP sender. UDP sender does not 
> guarantee that graphtie will receive metrics. On the other hand, usage of TCP 
> sender could be problematic in case that several jobs are sending lot of 
> different metrics with high frequency - it could lead to network congestion.
> I think that implementing support for pickle protocol for sending metrics to 
> graphite could be helpfull in minimalizing network traffic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12038) Add support for pickle protocol to graphite sink

2015-12-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12038:


Assignee: Apache Spark

> Add support for pickle protocol to graphite sink
> 
>
> Key: SPARK-12038
> URL: https://issues.apache.org/jira/browse/SPARK-12038
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Bartlomiej Alberski
>Assignee: Apache Spark
>Priority: Minor
>
> Currently GraphiteSink support TCP sender and UDP sender. UDP sender does not 
> guarantee that graphtie will receive metrics. On the other hand, usage of TCP 
> sender could be problematic in case that several jobs are sending lot of 
> different metrics with high frequency - it could lead to network congestion.
> I think that implementing support for pickle protocol for sending metrics to 
> graphite could be helpfull in minimalizing network traffic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11944) Python API for mllib.clustering.BisectingKMeans

2015-12-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11944:


Assignee: Apache Spark

> Python API for mllib.clustering.BisectingKMeans
> ---
>
> Key: SPARK-11944
> URL: https://issues.apache.org/jira/browse/SPARK-11944
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib, PySpark
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> Add Python API for mllib.clustering.BisectingKMeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11944) Python API for mllib.clustering.BisectingKMeans

2015-12-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11944:


Assignee: (was: Apache Spark)

> Python API for mllib.clustering.BisectingKMeans
> ---
>
> Key: SPARK-11944
> URL: https://issues.apache.org/jira/browse/SPARK-11944
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> Add Python API for mllib.clustering.BisectingKMeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11944) Python API for mllib.clustering.BisectingKMeans

2015-12-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042339#comment-15042339
 ] 

Apache Spark commented on SPARK-11944:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/10150

> Python API for mllib.clustering.BisectingKMeans
> ---
>
> Key: SPARK-11944
> URL: https://issues.apache.org/jira/browse/SPARK-11944
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> Add Python API for mllib.clustering.BisectingKMeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12152) Speed up Scalastyle by only running one SBT command instead of four

2015-12-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042381#comment-15042381
 ] 

Apache Spark commented on SPARK-12152:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10151

> Speed up Scalastyle by only running one SBT command instead of four
> ---
>
> Key: SPARK-12152
> URL: https://issues.apache.org/jira/browse/SPARK-12152
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> dev/scalastyle runs four SBT commands when only one would suffice. We should 
> fix this in order to speed up pull request builds by about 60 seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12152) Speed up Scalastyle by only running one SBT command instead of four

2015-12-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12152:


Assignee: Apache Spark  (was: Josh Rosen)

> Speed up Scalastyle by only running one SBT command instead of four
> ---
>
> Key: SPARK-12152
> URL: https://issues.apache.org/jira/browse/SPARK-12152
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> dev/scalastyle runs four SBT commands when only one would suffice. We should 
> fix this in order to speed up pull request builds by about 60 seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1832) Executor UI improvement suggestions

2015-12-04 Thread Alexander Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041936#comment-15041936
 ] 

Alexander Bozarth edited comment on SPARK-1832 at 12/4/15 8:49 PM:
---

I've split off a subtask for the Color improvements that I will be submitting a 
PR for shortly.

Also what do you mean by "the MASTER task" in the description?

I will also continue to work on the column totals task


was (Author: ajbozarth):
I've split off a subtask for the Color improvements that I will be submitting a 
PR for shortly.

Also what do you mean by "the MASTER task" in the description?

> Executor UI improvement suggestions
> ---
>
> Key: SPARK-1832
> URL: https://issues.apache.org/jira/browse/SPARK-1832
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>
> I received some suggestions from a user for the /executors UI page to make it 
> more helpful. This gets more important when you have a really large number of 
> executors.
>  Fill some of the cells with color in order to make it easier to absorb 
> the info, e.g.
> RED if Failed Tasks greater than 0 (maybe the more failed, the more intense 
> the red)
> GREEN if Active Tasks greater than 0 (maybe more intense the larger the 
> number)
> Possibly color code COMPLETE TASKS using various shades of blue (e.g., based 
> on the log(# completed)
> - if dark blue then write the value in white (same for the RED and GREEN above
> Maybe mark the MASTER task somehow
>  
> Report the TOTALS in each column (do this at the TOP so no need to scroll 
> to the bottom, or print both at top and bottom).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6990) Add Java linting script

2015-12-04 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-6990.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Fixed by https://github.com/apache/spark/pull/9867

> Add Java linting script
> ---
>
> Key: SPARK-6990
> URL: https://issues.apache.org/jira/browse/SPARK-6990
> Project: Spark
>  Issue Type: New Feature
>  Components: Project Infra
>Reporter: Josh Rosen
>Priority: Minor
>  Labels: starter
> Fix For: 2.0.0
>
>
> It would be nice to add a {{dev/lint-java}} script to enforce style rules for 
> Spark's Java code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12152) Speed up Scalastyle by only running one SBT command instead of four

2015-12-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12152:


Assignee: Josh Rosen  (was: Apache Spark)

> Speed up Scalastyle by only running one SBT command instead of four
> ---
>
> Key: SPARK-12152
> URL: https://issues.apache.org/jira/browse/SPARK-12152
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> dev/scalastyle runs four SBT commands when only one would suffice. We should 
> fix this in order to speed up pull request builds by about 60 seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12150) numPartitions argument to sqlContext.range() should be optional

2015-12-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12150:


Assignee: (was: Apache Spark)

> numPartitions argument to sqlContext.range()  should be optional
> 
>
> Key: SPARK-12150
> URL: https://issues.apache.org/jira/browse/SPARK-12150
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Henri DF
>Priority: Minor
>
> It's a little inconsistent that the first two sqlContext.range() methods 
> don't take a numPartitions arg, while the third one does. 
> And more importantly, it's a little inconvenient that the numPartitions arg 
> is mandatory for the third range() method - it means that if you want to 
> specify a step, you suddenly have to think about partitioning - an orthogonal 
> concern.
> My suggestion would be to make numPartitions optional, like it is on the 
> sparkContext.range(..).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12150) numPartitions argument to sqlContext.range() should be optional

2015-12-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042194#comment-15042194
 ] 

Apache Spark commented on SPARK-12150:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/10149

> numPartitions argument to sqlContext.range()  should be optional
> 
>
> Key: SPARK-12150
> URL: https://issues.apache.org/jira/browse/SPARK-12150
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Henri DF
>Priority: Minor
>
> It's a little inconsistent that the first two sqlContext.range() methods 
> don't take a numPartitions arg, while the third one does. 
> And more importantly, it's a little inconvenient that the numPartitions arg 
> is mandatory for the third range() method - it means that if you want to 
> specify a step, you suddenly have to think about partitioning - an orthogonal 
> concern.
> My suggestion would be to make numPartitions optional, like it is on the 
> sparkContext.range(..).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12150) numPartitions argument to sqlContext.range() should be optional

2015-12-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12150:


Assignee: Apache Spark

> numPartitions argument to sqlContext.range()  should be optional
> 
>
> Key: SPARK-12150
> URL: https://issues.apache.org/jira/browse/SPARK-12150
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Henri DF
>Assignee: Apache Spark
>Priority: Minor
>
> It's a little inconsistent that the first two sqlContext.range() methods 
> don't take a numPartitions arg, while the third one does. 
> And more importantly, it's a little inconvenient that the numPartitions arg 
> is mandatory for the third range() method - it means that if you want to 
> specify a step, you suddenly have to think about partitioning - an orthogonal 
> concern.
> My suggestion would be to make numPartitions optional, like it is on the 
> sparkContext.range(..).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11255) R Test build should run on R 3.1.1

2015-12-04 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11255:
---
Assignee: shane knapp

> R Test build should run on R 3.1.1
> --
>
> Key: SPARK-11255
> URL: https://issues.apache.org/jira/browse/SPARK-11255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Felix Cheung
>Assignee: shane knapp
>Priority: Minor
>
> Test should run on R 3.1.1 which is the version listed as supported.
> Apparently there are few R changes that can go undetected since Jenkins Test 
> build is running something newer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12152) Speed up Scalastyle by only running one SBT command instead of four

2015-12-04 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-12152:
--

 Summary: Speed up Scalastyle by only running one SBT command 
instead of four
 Key: SPARK-12152
 URL: https://issues.apache.org/jira/browse/SPARK-12152
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Josh Rosen
Assignee: Josh Rosen


dev/scalastyle runs four SBT commands when only one would suffice. We should 
fix this in order to speed up pull request builds by about 60 seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12058) Fix KinesisStreamTests in python/pyspark/streaming/tests.py and enable it

2015-12-04 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-12058.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> Fix KinesisStreamTests in python/pyspark/streaming/tests.py and enable it
> -
>
> Key: SPARK-12058
> URL: https://issues.apache.org/jira/browse/SPARK-12058
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Burak Yavuz
> Fix For: 1.6.0
>
>
> KinesisStreamTests is disabled to unblock other PRs. After fixing it, we 
> should enable it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12138) Escape \u in the generated comments.

2015-12-04 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12138:
-
Summary: Escape \u in the generated comments.  (was: [SPARK-11352][SQL] 
Escape \u in the generated comments.)

> Escape \u in the generated comments.
> 
>
> Key: SPARK-12138
> URL: https://issues.apache.org/jira/browse/SPARK-12138
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> https://spark-tests.appspot.com/test-logs/12683942



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11823) HiveThriftBinaryServerSuite tests timing out, leaves hanging processes

2015-12-04 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042260#comment-15042260
 ] 

shane knapp commented on SPARK-11823:
-

it looks like it's this particular test that's hanging (and leaving hanging 
processes):

[info] - test jdbc cancel *** FAILED *** (1 hour, 27 minutes, 37 seconds)

i've seen this process hanging for over 3 hours as well.

> HiveThriftBinaryServerSuite tests timing out, leaves hanging processes
> --
>
> Key: SPARK-11823
> URL: https://issues.apache.org/jira/browse/SPARK-11823
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: shane knapp
> Attachments: 
> spark-jenkins-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-amp-jenkins-worker-05.out,
>  stack.log
>
>
> i've noticed on a few branches that the HiveThriftBinaryServerSuite tests 
> time out, and when that happens, the build is aborted but the tests leave 
> behind hanging processes that eat up cpu and ram.
> most recently, i discovered this happening w/the 1.6 SBT build, specifically 
> w/the hadoop 2.0 profile:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.6-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/56/console
> [~vanzin] grabbed the jstack log, which i've attached to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame

2015-12-04 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-12148:
-
Component/s: SparkR

> SparkR: rename DataFrame to SparkDataFrame
> --
>
> Key: SPARK-12148
> URL: https://issues.apache.org/jira/browse/SPARK-12148
> Project: Spark
>  Issue Type: Wish
>  Components: R, SparkR
>Reporter: Michael Lawrence
>Priority: Minor
>
> The SparkR package represents a Spark DataFrame with the class "DataFrame". 
> That conflicts with the more general DataFrame class defined in the S4Vectors 
> package. Would it not be more appropriate to use the name "SparkDataFrame" 
> instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12151) Improve PySpark MLLib prediction performance when using pickled vectors

2015-12-04 Thread holdenk (JIRA)
holdenk created SPARK-12151:
---

 Summary: Improve PySpark MLLib prediction performance when using 
pickled vectors
 Key: SPARK-12151
 URL: https://issues.apache.org/jira/browse/SPARK-12151
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: holdenk
Priority: Minor


In a number of places inside of PySpark MLLib when calling predict on an RDD we 
map the Python prediction function over the RDD, instead we could convert the 
RDD to an RDD of pickled Vectors and then use the Java prediction function. 
This would be useful for models which have optimized predicting on batches of 
objects (e.g. by broadcasting the relevant parts of the model or similar).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12150) numPartitions argument to sqlContext.range() should be optional

2015-12-04 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042156#comment-15042156
 ] 

Xiao Li commented on SPARK-12150:
-

Yeah, you are right. Will deliver a PR soon. Thanks!

> numPartitions argument to sqlContext.range()  should be optional
> 
>
> Key: SPARK-12150
> URL: https://issues.apache.org/jira/browse/SPARK-12150
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Henri DF
>Priority: Minor
>
> It's a little inconsistent that the first two sqlContext.range() methods 
> don't take a numPartitions arg, while the third one does. 
> And more importantly, it's a little inconvenient that the numPartitions arg 
> is mandatory for the third range() method - it means that if you want to 
> specify a step, you suddenly have to think about partitioning - an orthogonal 
> concern.
> My suggestion would be to make numPartitions optional, like it is on the 
> sparkContext.range(..).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10931) PySpark ML Models should contain Param values

2015-12-04 Thread Evan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042482#comment-15042482
 ] 

Evan Chen commented on SPARK-10931:
---

Hey Joseph,

Just wanted to give a status update. I should be submitting a PR soon.

Thanks,

> PySpark ML Models should contain Param values
> -
>
> Key: SPARK-10931
> URL: https://issues.apache.org/jira/browse/SPARK-10931
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> PySpark spark.ml Models are generally wrappers around Java objects and do not 
> even contain Param values.  This JIRA is for copying the Param values from 
> the Estimator to the model.
> This can likely be solved by modifying Estimator.fit to copy Param values, 
> but should also include proper unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12142) Can't request executor when container allocator is not ready

2015-12-04 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-12142.

   Resolution: Fixed
 Assignee: meiyoula
Fix Version/s: 2.0.0

> Can't request executor when container allocator is not ready
> 
>
> Key: SPARK-12142
> URL: https://issues.apache.org/jira/browse/SPARK-12142
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: meiyoula
>Assignee: meiyoula
> Fix For: 2.0.0
>
>
> Using Dynamic Allocation function, when a new AM is starting, and 
> ExecutorAllocationManager send RequestExecutor message to AM. If the 
> container allocator is not ready, the whole app will hang on



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12138) Escape \u in the generated comments.

2015-12-04 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042492#comment-15042492
 ] 

Xiao Li commented on SPARK-12138:
-

If nobody takes it, I can make a try. [~yhuai] Could you explain how to 
reproduce it? I did not hit this issue when running the tests. Thanks

> Escape \u in the generated comments.
> 
>
> Key: SPARK-12138
> URL: https://issues.apache.org/jira/browse/SPARK-12138
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> https://spark-tests.appspot.com/test-logs/12683942



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12080) Kryo - Support multiple user registrators

2015-12-04 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-12080.

   Resolution: Fixed
 Assignee: Rotem
Fix Version/s: 2.0.0

> Kryo - Support multiple user registrators
> -
>
> Key: SPARK-12080
> URL: https://issues.apache.org/jira/browse/SPARK-12080
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Rotem
>Assignee: Rotem
>Priority: Minor
>  Labels: kryo, registrator, serializers
> Fix For: 2.0.0
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Background: Currently when users need to have a custom serializer for their 
> registered classes, they use the user registrator of Kryo using the 
> spark.kryo.registrator configuration parameter.
> Problem: If the Spark user is an infrastructure itself, it may receive 
> multiple such registrators but won't be able to register them.
> Important note: Currently the single registrator supported can't reach any 
> state/configuration (it is instantiated by reflection with empty constructor)
> Using SparkEnv from user code isn't acceptable.
> Workaround:
> Create a wrapper registrator as a user, and have its implementation scan the 
> class path for multiple classes. 
> Caveat: this is inefficient and too complicated.
> Suggested solution - support multiple registrators + stay backward compatible
> Option 1:
> enhance the value of spark.kryo.registrator  to support a comma separated 
> list for class names. This will be backward compatible and won't add new 
> parameters. 
> Option 2:
> to be more logical, add spark.kryo.registrators new parameter, while keeping 
> the code handling the old one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12084) Fix codes that uses ByteBuffer.array incorrectly

2015-12-04 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-12084.

   Resolution: Fixed
Fix Version/s: 2.0.0

> Fix codes that uses ByteBuffer.array incorrectly
> 
>
> Key: SPARK-12084
> URL: https://issues.apache.org/jira/browse/SPARK-12084
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> ByteBuffer doesn't guarantee all contents in `ByteBuffer.array` are valid. 
> E.g, a ByteBuffer returned by ByteBuffer.slice. We should not use the whole 
> content of `ByteBuffer` unless we know that's correct.
> This patch fixed all places that use `ByteBuffer.array` incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11081) Make spark-core pull in Jersey and javax.ws.rs dependencies separately for easier overriding

2015-12-04 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042516#comment-15042516
 ] 

Matt Cheah commented on SPARK-11081:


Upgrading to Jersey 2 definitely sounds more reasonable. Perhaps we can discard 
this ticket and just open a "Upgrade to Jersey 2" ticket targeting Spark 2.0, 
what does everyone think?

> Make spark-core pull in Jersey and javax.ws.rs dependencies separately for 
> easier overriding
> 
>
> Key: SPARK-11081
> URL: https://issues.apache.org/jira/browse/SPARK-11081
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Reporter: Mingyu Kim
>
> As seen from this thread 
> (https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCALte62yD8H3=2KVMiFs7NZjn929oJ133JkPLrNEj=vrx-d2...@mail.gmail.com%3E),
>  Spark is incompatible with Jersey 2 especially when Spark is embedded in an 
> application running with Jersey.
> There was an in-depth discussion on options for shading and making it easier 
> for users to be able to use Jersey 2 with Spark applications: 
> https://github.com/apache/spark/pull/9615
> To recap the discussion, Jersey 1 has two issues:
> 1. It has classes listed in META-INF/services/ files that would be loaded 
> even if Jersey 2 was being loaded on the classpath in a higher precedence. 
> This means that Jersey 2 would attempt to use Jersey 1 implementations in 
> some places regardless of user attempts to override the dependency with 
> things like userClassPathFirst.
> 2. Jersey 1 packages javax.ws.rs classes inside itself, making it hard to 
> exclude just javax.ws.rs APIs and replace them with ones that Jersey 2 is 
> compatible with.
> Also discussed was the fact that plain old shading doesn't work here, since 
> you would need to shade lines in META-INF/services as well, not just classes. 
> Not to mention that shading JAX-RS annotations is tricky as well.
> To recap the discussion as what needs to happen Spark-side, we need to:
> 1. Create a "org.spark-project.jersey" artifact (loosely speaking) which is 
> the Jersey 1 jar minus all the javax.ws.rs stuff (no need to actually 
> shade/namespace the classes that way, just the artifact name)
> 2. Put all the javax.ws.rs stuff extracted from step 1 into its own artifact, 
> say "org.spark-project.javax.ws.rs". (META-INF/services/javax.ws.rs* files 
> live in this artifact as well)
> 3. Spark-core's pom depends on org.spark-project artifacts from step 1 and 2
> 4. Spark assembly excludes META-INF/services/javax.ws.rs.* - it turns out 
> these files aren't actually necessary for Jersey 1 to function properly in 
> general (we need to test this more however)
> Now a user that wants to depend on Jersey 2, and is depending on Spark maven 
> artifacts, would do the following in their application
> 1. Provide my own dependency on Jersey 2 and its transitive javax.ws.rs 
> dependencies
> 2. In my application's dependencies, exclude org.spark-project.javax.ws.rs 
> from spark-core. We keep org.spark-project.jersey because spark-core needs 
> it, but it will use the javax.ws.rs classes that my application is providing.
> 3. Set spark.executor.userClassPathFirst=true and ship Jersey 2 and new 
> javax.ws.rs jars to the executors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11081) Make spark-core pull in Jersey and javax.ws.rs dependencies separately for easier overriding

2015-12-04 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042519#comment-15042519
 ] 

Marcelo Vanzin commented on SPARK-11081:


Opening a new one or re-purposing this one, either is fine.

> Make spark-core pull in Jersey and javax.ws.rs dependencies separately for 
> easier overriding
> 
>
> Key: SPARK-11081
> URL: https://issues.apache.org/jira/browse/SPARK-11081
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Reporter: Mingyu Kim
>
> As seen from this thread 
> (https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCALte62yD8H3=2KVMiFs7NZjn929oJ133JkPLrNEj=vrx-d2...@mail.gmail.com%3E),
>  Spark is incompatible with Jersey 2 especially when Spark is embedded in an 
> application running with Jersey.
> There was an in-depth discussion on options for shading and making it easier 
> for users to be able to use Jersey 2 with Spark applications: 
> https://github.com/apache/spark/pull/9615
> To recap the discussion, Jersey 1 has two issues:
> 1. It has classes listed in META-INF/services/ files that would be loaded 
> even if Jersey 2 was being loaded on the classpath in a higher precedence. 
> This means that Jersey 2 would attempt to use Jersey 1 implementations in 
> some places regardless of user attempts to override the dependency with 
> things like userClassPathFirst.
> 2. Jersey 1 packages javax.ws.rs classes inside itself, making it hard to 
> exclude just javax.ws.rs APIs and replace them with ones that Jersey 2 is 
> compatible with.
> Also discussed was the fact that plain old shading doesn't work here, since 
> you would need to shade lines in META-INF/services as well, not just classes. 
> Not to mention that shading JAX-RS annotations is tricky as well.
> To recap the discussion as what needs to happen Spark-side, we need to:
> 1. Create a "org.spark-project.jersey" artifact (loosely speaking) which is 
> the Jersey 1 jar minus all the javax.ws.rs stuff (no need to actually 
> shade/namespace the classes that way, just the artifact name)
> 2. Put all the javax.ws.rs stuff extracted from step 1 into its own artifact, 
> say "org.spark-project.javax.ws.rs". (META-INF/services/javax.ws.rs* files 
> live in this artifact as well)
> 3. Spark-core's pom depends on org.spark-project artifacts from step 1 and 2
> 4. Spark assembly excludes META-INF/services/javax.ws.rs.* - it turns out 
> these files aren't actually necessary for Jersey 1 to function properly in 
> general (we need to test this more however)
> Now a user that wants to depend on Jersey 2, and is depending on Spark maven 
> artifacts, would do the following in their application
> 1. Provide my own dependency on Jersey 2 and its transitive javax.ws.rs 
> dependencies
> 2. In my application's dependencies, exclude org.spark-project.javax.ws.rs 
> from spark-core. We keep org.spark-project.jersey because spark-core needs 
> it, but it will use the javax.ws.rs classes that my application is providing.
> 3. Set spark.executor.userClassPathFirst=true and ship Jersey 2 and new 
> javax.ws.rs jars to the executors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11081) Make spark-core pull in Jersey and javax.ws.rs dependencies separately for easier overriding

2015-12-04 Thread Matt Cheah (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah resolved SPARK-11081.

Resolution: Not A Problem

Ok, I filed SPARK-12154 to track it.

> Make spark-core pull in Jersey and javax.ws.rs dependencies separately for 
> easier overriding
> 
>
> Key: SPARK-11081
> URL: https://issues.apache.org/jira/browse/SPARK-11081
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Reporter: Mingyu Kim
>
> As seen from this thread 
> (https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCALte62yD8H3=2KVMiFs7NZjn929oJ133JkPLrNEj=vrx-d2...@mail.gmail.com%3E),
>  Spark is incompatible with Jersey 2 especially when Spark is embedded in an 
> application running with Jersey.
> There was an in-depth discussion on options for shading and making it easier 
> for users to be able to use Jersey 2 with Spark applications: 
> https://github.com/apache/spark/pull/9615
> To recap the discussion, Jersey 1 has two issues:
> 1. It has classes listed in META-INF/services/ files that would be loaded 
> even if Jersey 2 was being loaded on the classpath in a higher precedence. 
> This means that Jersey 2 would attempt to use Jersey 1 implementations in 
> some places regardless of user attempts to override the dependency with 
> things like userClassPathFirst.
> 2. Jersey 1 packages javax.ws.rs classes inside itself, making it hard to 
> exclude just javax.ws.rs APIs and replace them with ones that Jersey 2 is 
> compatible with.
> Also discussed was the fact that plain old shading doesn't work here, since 
> you would need to shade lines in META-INF/services as well, not just classes. 
> Not to mention that shading JAX-RS annotations is tricky as well.
> To recap the discussion as what needs to happen Spark-side, we need to:
> 1. Create a "org.spark-project.jersey" artifact (loosely speaking) which is 
> the Jersey 1 jar minus all the javax.ws.rs stuff (no need to actually 
> shade/namespace the classes that way, just the artifact name)
> 2. Put all the javax.ws.rs stuff extracted from step 1 into its own artifact, 
> say "org.spark-project.javax.ws.rs". (META-INF/services/javax.ws.rs* files 
> live in this artifact as well)
> 3. Spark-core's pom depends on org.spark-project artifacts from step 1 and 2
> 4. Spark assembly excludes META-INF/services/javax.ws.rs.* - it turns out 
> these files aren't actually necessary for Jersey 1 to function properly in 
> general (we need to test this more however)
> Now a user that wants to depend on Jersey 2, and is depending on Spark maven 
> artifacts, would do the following in their application
> 1. Provide my own dependency on Jersey 2 and its transitive javax.ws.rs 
> dependencies
> 2. In my application's dependencies, exclude org.spark-project.javax.ws.rs 
> from spark-core. We keep org.spark-project.jersey because spark-core needs 
> it, but it will use the javax.ws.rs classes that my application is providing.
> 3. Set spark.executor.userClassPathFirst=true and ship Jersey 2 and new 
> javax.ws.rs jars to the executors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12112) Upgrade to SBT 0.13.9

2015-12-04 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12112.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Upgrade to SBT 0.13.9
> -
>
> Key: SPARK-12112
> URL: https://issues.apache.org/jira/browse/SPARK-12112
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> We should upgrade to SBT 0.13.9, since this is a requirement in order to use 
> SBT's new Maven-style resolution features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12138) Escape \u in the generated comments.

2015-12-04 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042539#comment-15042539
 ] 

Yin Huai commented on SPARK-12138:
--

Yeah. It will be great if you can work on it. To reproduce it, you can try it 
like 
https://github.com/apache/spark/commit/5872a9d89fe2720c2bcb1fc7494136947a72581c#diff-cf187b40d98ff322d4bde4185701baa2.
 Basically you have a predicate and one of the operand is {{\u}}.

> Escape \u in the generated comments.
> 
>
> Key: SPARK-12138
> URL: https://issues.apache.org/jira/browse/SPARK-12138
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> https://spark-tests.appspot.com/test-logs/12683942



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8118) Turn off noisy log output produced by Parquet 1.7.0

2015-12-04 Thread Justin Bailey (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042438#comment-15042438
 ] 

Justin Bailey commented on SPARK-8118:
--

Can this issue be re-opened? It's really terrible how much log output Parquet 
produces.

> Turn off noisy log output produced by Parquet 1.7.0
> ---
>
> Key: SPARK-8118
> URL: https://issues.apache.org/jira/browse/SPARK-8118
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
> Fix For: 1.5.0
>
>
> Parquet 1.7.0 renames package name to "org.apache.parquet", need to adjust 
> {{ParquetRelation.enableLogForwarding}} accordingly to avoid noisy log output.
> A better approach than simply muting these log lines is to redirect Parquet 
> logs via SLF4J, so that we can handle them consistently. In general these 
> logs are very useful. Esp. when used to diagnosing Parquet memory issue and 
> filter push-down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8118) Turn off noisy log output produced by Parquet 1.7.0

2015-12-04 Thread Justin Bailey (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042438#comment-15042438
 ] 

Justin Bailey edited comment on SPARK-8118 at 12/4/15 11:48 PM:


Can this issue be re-opened? It's really terrible how much log output Parquet 
produces. (Using spark 1.5.1, btw)


was (Author: m4dc4p):
Can this issue be re-opened? It's really terrible how much log output Parquet 
produces.

> Turn off noisy log output produced by Parquet 1.7.0
> ---
>
> Key: SPARK-8118
> URL: https://issues.apache.org/jira/browse/SPARK-8118
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
> Fix For: 1.5.0
>
>
> Parquet 1.7.0 renames package name to "org.apache.parquet", need to adjust 
> {{ParquetRelation.enableLogForwarding}} accordingly to avoid noisy log output.
> A better approach than simply muting these log lines is to redirect Parquet 
> logs via SLF4J, so that we can handle them consistently. In general these 
> logs are very useful. Esp. when used to diagnosing Parquet memory issue and 
> filter push-down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12153) Word2Vec uses a fixed length for sentences which is not reasonable for reality, and similarity functions and fields are not accessible

2015-12-04 Thread YongGang Cao (JIRA)
YongGang Cao created SPARK-12153:


 Summary: Word2Vec uses a fixed length for sentences which is not 
reasonable for reality, and similarity functions and fields are not accessible
 Key: SPARK-12153
 URL: https://issues.apache.org/jira/browse/SPARK-12153
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.2
Reporter: YongGang Cao
Priority: Minor


sentence boundary matters for sliding window, we shouldn't train model from a 
window across sentences. the current 100 word as a hard split for sentences 
doesn't really make sense.
And the cosinesimilarity functions is private which is useless for caller. 
we may need to access the vocabulary and wordindex table as well, those need 
getters

I made changes to address above issues. will send out pull request for your 
review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-12-04 Thread Andrew Philpot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042462#comment-15042462
 ] 

Andrew Philpot commented on SPARK-4036:
---

Hi, what is the maturity of this code?  Are you interested in a tester?  I have 
existing models and features for CRF++.  Would love to simply migrate them to a 
native spark implementation.

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12153) Word2Vec uses a fixed length for sentences which is not reasonable for reality, and similarity functions and fields are not accessible

2015-12-04 Thread YongGang Cao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YongGang Cao updated SPARK-12153:
-
Description: 
sentence boundary matters for sliding window, we shouldn't train model from a 
window across sentences. the current 100 word as a hard split for sentences 
doesn't really make sense.
And the cosinesimilarity functions is private which is useless for caller. 
we may need to access the vocabulary and wordindex table as well, those need 
getters

I made changes to address above issues.
here is the pull request: https://github.com/apache/spark/pull/10152

  was:
sentence boundary matters for sliding window, we shouldn't train model from a 
window across sentences. the current 100 word as a hard split for sentences 
doesn't really make sense.
And the cosinesimilarity functions is private which is useless for caller. 
we may need to access the vocabulary and wordindex table as well, those need 
getters

I made changes to address above issues. will send out pull request for your 
review.


> Word2Vec uses a fixed length for sentences which is not reasonable for 
> reality, and similarity functions and fields are not accessible
> --
>
> Key: SPARK-12153
> URL: https://issues.apache.org/jira/browse/SPARK-12153
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: YongGang Cao
>Priority: Minor
>  Labels: patch
>
> sentence boundary matters for sliding window, we shouldn't train model from a 
> window across sentences. the current 100 word as a hard split for sentences 
> doesn't really make sense.
> And the cosinesimilarity functions is private which is useless for caller. 
> we may need to access the vocabulary and wordindex table as well, those need 
> getters
> I made changes to address above issues.
> here is the pull request: https://github.com/apache/spark/pull/10152



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11046) Pass schema from R to JVM using JSON format

2015-12-04 Thread Nakul Jindal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042513#comment-15042513
 ] 

Nakul Jindal commented on SPARK-11046:
--

Hi, I am trying to look into this. 
When you say that SparkR passes a DataFrame schema from R to JVM backend using 
regular expression, do you mean this format:

map
or
array

Also, is "structField.character" the only function where this "regular 
expression" format is passed from R to JVM (using 
org.apache.spark.sql.api.r.SQLUtils", "createDF)?

> Pass schema from R to JVM using JSON format
> ---
>
> Key: SPARK-11046
> URL: https://issues.apache.org/jira/browse/SPARK-11046
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Sun Rui
>Priority: Minor
>
> Currently, SparkR passes a DataFrame schema from R to JVM backend using 
> regular expression. However, Spark now supports schmea using JSON format.   
> So enhance SparkR to use schema in JSON format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12154) Upgrade to Jersey 2

2015-12-04 Thread Matt Cheah (JIRA)
Matt Cheah created SPARK-12154:
--

 Summary: Upgrade to Jersey 2
 Key: SPARK-12154
 URL: https://issues.apache.org/jira/browse/SPARK-12154
 Project: Spark
  Issue Type: Improvement
  Components: Build, Spark Core
Affects Versions: 1.5.2
Reporter: Matt Cheah


Fairly self-explanatory, Jersey 1 is a bit old and could use an upgrade. 
Library conflicts for Jersey are difficult to workaround - see discussion on 
SPARK-11081. It's easier to upgrade Jersey entirely, but we should target Spark 
2.0 since this may be a break for users who were using Jersey 1 in their Spark 
jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12138) Escape \u in the generated comments.

2015-12-04 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042541#comment-15042541
 ] 

Xiao Li commented on SPARK-12138:
-

Sure. Will try it tonight or tomorrow. Thanks! : )

> Escape \u in the generated comments.
> 
>
> Key: SPARK-12138
> URL: https://issues.apache.org/jira/browse/SPARK-12138
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> https://spark-tests.appspot.com/test-logs/12683942



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10123) Cannot set "--deploy-mode" in default configuration

2015-12-04 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042598#comment-15042598
 ] 

Saisai Shao commented on SPARK-10123:
-

Just confirm if it is on your plan, in case duplicated :).

> Cannot set "--deploy-mode" in default configuration
> ---
>
> Key: SPARK-10123
> URL: https://issues.apache.org/jira/browse/SPARK-10123
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> There's no configuration option that is the equivalent of "--deploy-mode". So 
> it's not possible, for example, to have applications be submitted in 
> standalone cluster mode by default - you have to always use the command line 
> argument for that.
> YARN is special because it has the (somewhat deprecated) "yarn-cluster" 
> master, but it would be nice to be consistent and have a proper config option 
> for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12138) Escape \u in the generated comments.

2015-12-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042612#comment-15042612
 ] 

Apache Spark commented on SPARK-12138:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/10155

> Escape \u in the generated comments.
> 
>
> Key: SPARK-12138
> URL: https://issues.apache.org/jira/browse/SPARK-12138
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> https://spark-tests.appspot.com/test-logs/12683942



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12138) Escape \u in the generated comments.

2015-12-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12138:


Assignee: Apache Spark

> Escape \u in the generated comments.
> 
>
> Key: SPARK-12138
> URL: https://issues.apache.org/jira/browse/SPARK-12138
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> https://spark-tests.appspot.com/test-logs/12683942



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12138) Escape \u in the generated comments.

2015-12-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12138:


Assignee: (was: Apache Spark)

> Escape \u in the generated comments.
> 
>
> Key: SPARK-12138
> URL: https://issues.apache.org/jira/browse/SPARK-12138
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> https://spark-tests.appspot.com/test-logs/12683942



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12155) Execution OOM after a relative large dataset cached in the cluster.

2015-12-04 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042618#comment-15042618
 ] 

Yin Huai commented on SPARK-12155:
--

We can take a look at task {{136}}.

{code}
15/12/05 02:51:33 INFO UnifiedMemoryManager: Try to acquire 262144 bytes 
memory. But, on-heap execution memory poll only has 196608 bytes free 
memory.269746176 bytes pool size and 269549568 bytes used 
memory.(taskAttemptId: 136)
15/12/05 02:51:33 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 
8195014656, storageMemoryPool.poolSize 16659775488, storageRegionSize 
8464760832.(taskAttemptId: 136)
15/12/05 02:51:33 INFO UnifiedMemoryManager: Try to reclaim memory space from 
storage memory pool.(taskAttemptId: 136)
...
15/12/05 02:51:33 INFO UnifiedMemoryManager: Reclaimed 65536 bytes of memory 
from storage memory pool.Adding them back to 
onHeapExecutionMemoryPool.(taskAttemptId: 136)
15/12/05 02:51:33 INFO UnifiedMemoryManager: onHeapExecutionMemoryPool's size 
is 269811712 bytes. 262144 bytes are free.(taskAttemptId: 136)
15/12/05 02:51:33 INFO ExecutionMemoryPool: maxToGrant 81920, poolSize 
269811712, numActiveTasks 4, curMem 67371008, numBytes 262144, taskAttemptId 
136.
{code}

task 136 wants to acquire 262144 bytes. However, the execution memory pool only 
has {{196608}} bytes free memory. So, we decide to reclaim memory from storage 
memory and the amount of memory we want is {{65536}} bytes. 

Then, we get {{65536}} bytes and put them to execution pool. Now, we have 
{{269811712}} bytes in execution memory pool (it was {{269746176}}). However, 
in {{ExecutionMemoryPool.acquireMemory}}, we calculate {{maxGrant}} by using 
{{val maxToGrant = math.min(numBytes, math.max(0, (poolSize / numActiveTasks) - 
curMem))}}. You the memory that can be used by task 136 is actually {{269811712 
/ 4 = 67452928}} (67371008 bytes are actually used). So, the free space for 
this task ends up be {{67452928 - 67371008 = 81920}} bytes.

> Execution OOM after a relative large dataset cached in the cluster.
> ---
>
> Key: SPARK-12155
> URL: https://issues.apache.org/jira/browse/SPARK-12155
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> I have a cluster with relative 80GB of mem. Then, I cached a 43GB dataframe. 
> When I start to consume the query. I got the following exception (I added 
> more logs to the code).
> {code}
> 15/12/05 00:33:43 INFO UnifiedMemoryManager: Creating UnifedMemoryManager for 
> 4 cores with 16929521664 maxMemory, 8464760832 storageRegionSize.
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 1048576 bytes of free space for 
> block rdd_94_37(free: 3253659951, max: 16798973952)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 5142008 bytes of free space for 
> block rdd_94_37(free: 3252611375, max: 16798973952)
> 15/12/05 01:20:50 INFO Executor: Finished task 36.0 in stage 4.0 (TID 109). 
> 3028 bytes result sent to driver
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98948238 bytes of free space for 
> block rdd_94_37(free: 3314840375, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98675713 bytes of free space for 
> block rdd_94_37(free: 3215892137, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 197347565 bytes of free space 
> for block rdd_94_37(free: 3117216424, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 295995553 bytes of free space 
> for block rdd_94_37(free: 2919868859, max: 16866344960)
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 394728479 bytes of free space 
> for block rdd_94_37(free: 2687050010, max: 16929521664)
> 15/12/05 01:20:51 INFO Executor: Finished task 32.0 in stage 4.0 (TID 106). 
> 3028 bytes result sent to driver
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 591258816 bytes of free space 
> for block rdd_94_37(free: 2292321531, max: 16929521664)
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 901645182 bytes of free space 
> for block rdd_94_37(free: 1701062715, max: 16929521664)
> 15/12/05 01:20:52 INFO MemoryStore: Ensuring 1302179076 bytes of free space 
> for block rdd_94_37(free: 799417533, max: 16929521664)
> 15/12/05 01:20:52 INFO MemoryStore: Will not store rdd_94_37 as it would 
> require dropping another block from the same RDD
> 15/12/05 01:20:52 WARN MemoryStore: Not enough space to cache rdd_94_37 in 
> memory! (computed 2.4 GB so far)
> 15/12/05 01:20:52 INFO MemoryStore: Memory use = 12.6 GB (blocks) + 2.4 GB 
> (scratch space shared across 13 tasks(s)) = 15.0 GB. Storage limit = 15.8 GB.
> 15/12/05 01:20:52 INFO BlockManager: Found block rdd_94_37 locally
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 262144 bytes 
> memory. But, on-heap execution memory poll only has 0 bytes free memory.
> 

[jira] [Created] (SPARK-12155) Execution OOM after a relative large dataset cached in the cluster.

2015-12-04 Thread Yin Huai (JIRA)
Yin Huai created SPARK-12155:


 Summary: Execution OOM after a relative large dataset cached in 
the cluster.
 Key: SPARK-12155
 URL: https://issues.apache.org/jira/browse/SPARK-12155
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Reporter: Yin Huai
Priority: Blocker


I have a cluster with relative 80GB of mem. Then, I cached a 43GB dataframe. 
When I start to consume the query. I got the following exception (I added more 
logs to the code).

{code}
15/12/05 00:33:43 INFO UnifiedMemoryManager: Creating UnifedMemoryManager for 4 
cores with 16929521664 maxMemory, 8464760832 storageRegionSize.


15/12/05 01:20:50 INFO MemoryStore: Ensuring 1048576 bytes of free space for 
block rdd_94_37(free: 3253659951, max: 16798973952)
15/12/05 01:20:50 INFO MemoryStore: Ensuring 5142008 bytes of free space for 
block rdd_94_37(free: 3252611375, max: 16798973952)
15/12/05 01:20:50 INFO Executor: Finished task 36.0 in stage 4.0 (TID 109). 
3028 bytes result sent to driver
15/12/05 01:20:50 INFO MemoryStore: Ensuring 98948238 bytes of free space for 
block rdd_94_37(free: 3314840375, max: 16866344960)
15/12/05 01:20:50 INFO MemoryStore: Ensuring 98675713 bytes of free space for 
block rdd_94_37(free: 3215892137, max: 16866344960)
15/12/05 01:20:50 INFO MemoryStore: Ensuring 197347565 bytes of free space for 
block rdd_94_37(free: 3117216424, max: 16866344960)
15/12/05 01:20:50 INFO MemoryStore: Ensuring 295995553 bytes of free space for 
block rdd_94_37(free: 2919868859, max: 16866344960)
15/12/05 01:20:51 INFO MemoryStore: Ensuring 394728479 bytes of free space for 
block rdd_94_37(free: 2687050010, max: 16929521664)
15/12/05 01:20:51 INFO Executor: Finished task 32.0 in stage 4.0 (TID 106). 
3028 bytes result sent to driver
15/12/05 01:20:51 INFO MemoryStore: Ensuring 591258816 bytes of free space for 
block rdd_94_37(free: 2292321531, max: 16929521664)
15/12/05 01:20:51 INFO MemoryStore: Ensuring 901645182 bytes of free space for 
block rdd_94_37(free: 1701062715, max: 16929521664)
15/12/05 01:20:52 INFO MemoryStore: Ensuring 1302179076 bytes of free space for 
block rdd_94_37(free: 799417533, max: 16929521664)
15/12/05 01:20:52 INFO MemoryStore: Will not store rdd_94_37 as it would 
require dropping another block from the same RDD
15/12/05 01:20:52 WARN MemoryStore: Not enough space to cache rdd_94_37 in 
memory! (computed 2.4 GB so far)
15/12/05 01:20:52 INFO MemoryStore: Memory use = 12.6 GB (blocks) + 2.4 GB 
(scratch space shared across 13 tasks(s)) = 15.0 GB. Storage limit = 15.8 GB.
15/12/05 01:20:52 INFO BlockManager: Found block rdd_94_37 locally
15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 262144 bytes 
memory. But, on-heap execution memory poll only has 0 bytes free memory.
15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 
8464760832, storageMemoryPool.poolSize 16929521664, storageRegionSize 
8464760832.
15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from 
storage memory pool.
15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 262144 bytes free memory 
space from StorageMemoryPool.
15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 262144 bytes of memory 
from storage memory pool.Adding them back to onHeapExecutionMemoryPool.
15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 67108864 bytes 
memory. But, on-heap execution memory poll only has 0 bytes free memory.
15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 
8464498688, storageMemoryPool.poolSize 16929259520, storageRegionSize 
8464760832.
15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from 
storage memory pool.
15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 67108864 bytes free memory 
space from StorageMemoryPool.
15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 67108864 bytes of memory 
from storage memory pool.Adding them back to onHeapExecutionMemoryPool.
15/12/05 01:20:54 INFO Executor: Finished task 37.0 in stage 4.0 (TID 110). 
3077 bytes result sent to driver
15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 120
15/12/05 01:20:56 INFO Executor: Running task 1.0 in stage 5.0 (TID 120)
15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 124
15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 128
15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 132
15/12/05 01:20:56 INFO Executor: Running task 9.0 in stage 5.0 (TID 128)
15/12/05 01:20:56 INFO Executor: Running task 13.0 in stage 5.0 (TID 132)
15/12/05 01:20:56 INFO Executor: Running task 5.0 in stage 5.0 (TID 124)
15/12/05 01:20:56 INFO MapOutputTrackerWorker: Updating epoch to 2 and clearing 
cache
15/12/05 01:20:56 INFO TorrentBroadcast: Started reading broadcast variable 6
15/12/05 01:20:56 INFO MemoryStore: Ensuring 9471 bytes of free 

[jira] [Commented] (SPARK-12155) Execution OOM after a relative large dataset cached in the cluster.

2015-12-04 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042616#comment-15042616
 ] 

Yin Huai commented on SPARK-12155:
--

I have some new logs.
{code}
15/12/05 02:51:33 INFO MemoryConsumer: allocateArray with size 262144 bytes
15/12/05 02:51:33 INFO MemoryConsumer: allocateArray with size 262144 bytes
15/12/05 02:51:33 INFO UnifiedMemoryManager: Try to acquire 262144 bytes 
memory. But, on-heap execution memory poll only has 0 bytes free 
memory.269484032 bytes pool size and 269484032 bytes used 
memory.(taskAttemptId: 131)
15/12/05 02:51:33 INFO MemoryConsumer: allocateArray with size 262144 bytes
15/12/05 02:51:33 INFO MemoryConsumer: allocateArray with size 262144 bytes
15/12/05 02:51:33 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 
8195276800, storageMemoryPool.poolSize 16660037632, storageRegionSize 
8464760832.(taskAttemptId: 131)
15/12/05 02:51:33 INFO UnifiedMemoryManager: Try to reclaim memory space from 
storage memory pool.(taskAttemptId: 131)
15/12/05 02:51:33 INFO StorageMemoryPool: Claiming 262144 bytes free memory 
space from StorageMemoryPool.
15/12/05 02:51:33 INFO UnifiedMemoryManager: Reclaimed 262144 bytes of memory 
from storage memory pool.Adding them back to 
onHeapExecutionMemoryPool.(taskAttemptId: 131)
15/12/05 02:51:33 INFO UnifiedMemoryManager: onHeapExecutionMemoryPool's size 
is 269746176 bytes. 262144 bytes are free.(taskAttemptId: 131)
15/12/05 02:51:33 INFO ExecutionMemoryPool: maxToGrant 65536, poolSize 
269746176, numActiveTasks 4, curMem 67371008, numBytes 262144, taskAttemptId 
131.
15/12/05 02:51:33 INFO UnifiedMemoryManager: Try to acquire 262144 bytes 
memory. But, on-heap execution memory poll only has 196608 bytes free 
memory.269746176 bytes pool size and 269549568 bytes used 
memory.(taskAttemptId: 136)
15/12/05 02:51:33 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 
8195014656, storageMemoryPool.poolSize 16659775488, storageRegionSize 
8464760832.(taskAttemptId: 136)
15/12/05 02:51:33 INFO UnifiedMemoryManager: Try to reclaim memory space from 
storage memory pool.(taskAttemptId: 136)
15/12/05 02:51:33 INFO TaskMemoryManager: Task 131 acquire 64.0 KB for 
org.apache.spark.unsafe.map.BytesToBytesMap@118eb11
15/12/05 02:51:33 INFO StorageMemoryPool: Claiming 65536 bytes free memory 
space from StorageMemoryPool.
15/12/05 02:51:33 INFO UnifiedMemoryManager: Reclaimed 65536 bytes of memory 
from storage memory pool.Adding them back to 
onHeapExecutionMemoryPool.(taskAttemptId: 136)
15/12/05 02:51:33 INFO UnifiedMemoryManager: onHeapExecutionMemoryPool's size 
is 269811712 bytes. 262144 bytes are free.(taskAttemptId: 136)
15/12/05 02:51:33 INFO ExecutionMemoryPool: maxToGrant 81920, poolSize 
269811712, numActiveTasks 4, curMem 67371008, numBytes 262144, taskAttemptId 
136.
15/12/05 02:51:33 INFO UnifiedMemoryManager: Try to acquire 262144 bytes 
memory. But, on-heap execution memory poll only has 180224 bytes free 
memory.269811712 bytes pool size and 269631488 bytes used 
memory.(taskAttemptId: 135)
15/12/05 02:51:33 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 
8194949120, storageMemoryPool.poolSize 16659709952, storageRegionSize 
8464760832.(taskAttemptId: 135)
15/12/05 02:51:33 INFO TaskMemoryManager: Task 136 acquire 80.0 KB for 
org.apache.spark.unsafe.map.BytesToBytesMap@2a342d48
15/12/05 02:51:33 INFO UnifiedMemoryManager: Try to reclaim memory space from 
storage memory pool.(taskAttemptId: 135)
15/12/05 02:51:33 INFO StorageMemoryPool: Claiming 81920 bytes free memory 
space from StorageMemoryPool.
15/12/05 02:51:33 INFO UnifiedMemoryManager: Reclaimed 81920 bytes of memory 
from storage memory pool.Adding them back to 
onHeapExecutionMemoryPool.(taskAttemptId: 135)
15/12/05 02:51:33 INFO UnifiedMemoryManager: onHeapExecutionMemoryPool's size 
is 269893632 bytes. 262144 bytes are free.(taskAttemptId: 135)
15/12/05 02:51:33 INFO ExecutionMemoryPool: maxToGrant 102400, poolSize 
269893632, numActiveTasks 4, curMem 67371008, numBytes 262144, taskAttemptId 
135.
15/12/05 02:51:33 INFO UnifiedMemoryManager: Try to acquire 262144 bytes 
memory. But, on-heap execution memory poll only has 159744 bytes free 
memory.269893632 bytes pool size and 269733888 bytes used 
memory.(taskAttemptId: 119)
15/12/05 02:51:33 INFO TaskMemoryManager: Task 135 acquire 100.0 KB for 
org.apache.spark.unsafe.map.BytesToBytesMap@74e81f25
15/12/05 02:51:33 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 
8194867200, storageMemoryPool.poolSize 16659628032, storageRegionSize 
8464760832.(taskAttemptId: 119)
15/12/05 02:51:33 INFO UnifiedMemoryManager: Try to reclaim memory space from 
storage memory pool.(taskAttemptId: 119)
15/12/05 02:51:33 INFO StorageMemoryPool: Claiming 102400 bytes free memory 
space from StorageMemoryPool.
15/12/05 02:51:33 INFO UnifiedMemoryManager: Reclaimed 102400 bytes of memory 
from storage memory pool.Adding 

[jira] [Commented] (SPARK-12155) Execution OOM after a relative large dataset cached in the cluster.

2015-12-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042582#comment-15042582
 ] 

Apache Spark commented on SPARK-12155:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/10153

> Execution OOM after a relative large dataset cached in the cluster.
> ---
>
> Key: SPARK-12155
> URL: https://issues.apache.org/jira/browse/SPARK-12155
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> I have a cluster with relative 80GB of mem. Then, I cached a 43GB dataframe. 
> When I start to consume the query. I got the following exception (I added 
> more logs to the code).
> {code}
> 15/12/05 00:33:43 INFO UnifiedMemoryManager: Creating UnifedMemoryManager for 
> 4 cores with 16929521664 maxMemory, 8464760832 storageRegionSize.
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 1048576 bytes of free space for 
> block rdd_94_37(free: 3253659951, max: 16798973952)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 5142008 bytes of free space for 
> block rdd_94_37(free: 3252611375, max: 16798973952)
> 15/12/05 01:20:50 INFO Executor: Finished task 36.0 in stage 4.0 (TID 109). 
> 3028 bytes result sent to driver
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98948238 bytes of free space for 
> block rdd_94_37(free: 3314840375, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98675713 bytes of free space for 
> block rdd_94_37(free: 3215892137, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 197347565 bytes of free space 
> for block rdd_94_37(free: 3117216424, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 295995553 bytes of free space 
> for block rdd_94_37(free: 2919868859, max: 16866344960)
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 394728479 bytes of free space 
> for block rdd_94_37(free: 2687050010, max: 16929521664)
> 15/12/05 01:20:51 INFO Executor: Finished task 32.0 in stage 4.0 (TID 106). 
> 3028 bytes result sent to driver
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 591258816 bytes of free space 
> for block rdd_94_37(free: 2292321531, max: 16929521664)
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 901645182 bytes of free space 
> for block rdd_94_37(free: 1701062715, max: 16929521664)
> 15/12/05 01:20:52 INFO MemoryStore: Ensuring 1302179076 bytes of free space 
> for block rdd_94_37(free: 799417533, max: 16929521664)
> 15/12/05 01:20:52 INFO MemoryStore: Will not store rdd_94_37 as it would 
> require dropping another block from the same RDD
> 15/12/05 01:20:52 WARN MemoryStore: Not enough space to cache rdd_94_37 in 
> memory! (computed 2.4 GB so far)
> 15/12/05 01:20:52 INFO MemoryStore: Memory use = 12.6 GB (blocks) + 2.4 GB 
> (scratch space shared across 13 tasks(s)) = 15.0 GB. Storage limit = 15.8 GB.
> 15/12/05 01:20:52 INFO BlockManager: Found block rdd_94_37 locally
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 262144 bytes 
> memory. But, on-heap execution memory poll only has 0 bytes free memory.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 
> 8464760832, storageMemoryPool.poolSize 16929521664, storageRegionSize 
> 8464760832.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from 
> storage memory pool.
> 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 262144 bytes free memory 
> space from StorageMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 262144 bytes of memory 
> from storage memory pool.Adding them back to onHeapExecutionMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 67108864 bytes 
> memory. But, on-heap execution memory poll only has 0 bytes free memory.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 
> 8464498688, storageMemoryPool.poolSize 16929259520, storageRegionSize 
> 8464760832.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from 
> storage memory pool.
> 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 67108864 bytes free memory 
> space from StorageMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 67108864 bytes of 
> memory from storage memory pool.Adding them back to onHeapExecutionMemoryPool.
> 15/12/05 01:20:54 INFO Executor: Finished task 37.0 in stage 4.0 (TID 110). 
> 3077 bytes result sent to driver
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 120
> 15/12/05 01:20:56 INFO Executor: Running task 1.0 in stage 5.0 (TID 120)
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 124
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 128
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got 

[jira] [Assigned] (SPARK-12155) Execution OOM after a relative large dataset cached in the cluster.

2015-12-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12155:


Assignee: (was: Apache Spark)

> Execution OOM after a relative large dataset cached in the cluster.
> ---
>
> Key: SPARK-12155
> URL: https://issues.apache.org/jira/browse/SPARK-12155
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> I have a cluster with relative 80GB of mem. Then, I cached a 43GB dataframe. 
> When I start to consume the query. I got the following exception (I added 
> more logs to the code).
> {code}
> 15/12/05 00:33:43 INFO UnifiedMemoryManager: Creating UnifedMemoryManager for 
> 4 cores with 16929521664 maxMemory, 8464760832 storageRegionSize.
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 1048576 bytes of free space for 
> block rdd_94_37(free: 3253659951, max: 16798973952)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 5142008 bytes of free space for 
> block rdd_94_37(free: 3252611375, max: 16798973952)
> 15/12/05 01:20:50 INFO Executor: Finished task 36.0 in stage 4.0 (TID 109). 
> 3028 bytes result sent to driver
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98948238 bytes of free space for 
> block rdd_94_37(free: 3314840375, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98675713 bytes of free space for 
> block rdd_94_37(free: 3215892137, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 197347565 bytes of free space 
> for block rdd_94_37(free: 3117216424, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 295995553 bytes of free space 
> for block rdd_94_37(free: 2919868859, max: 16866344960)
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 394728479 bytes of free space 
> for block rdd_94_37(free: 2687050010, max: 16929521664)
> 15/12/05 01:20:51 INFO Executor: Finished task 32.0 in stage 4.0 (TID 106). 
> 3028 bytes result sent to driver
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 591258816 bytes of free space 
> for block rdd_94_37(free: 2292321531, max: 16929521664)
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 901645182 bytes of free space 
> for block rdd_94_37(free: 1701062715, max: 16929521664)
> 15/12/05 01:20:52 INFO MemoryStore: Ensuring 1302179076 bytes of free space 
> for block rdd_94_37(free: 799417533, max: 16929521664)
> 15/12/05 01:20:52 INFO MemoryStore: Will not store rdd_94_37 as it would 
> require dropping another block from the same RDD
> 15/12/05 01:20:52 WARN MemoryStore: Not enough space to cache rdd_94_37 in 
> memory! (computed 2.4 GB so far)
> 15/12/05 01:20:52 INFO MemoryStore: Memory use = 12.6 GB (blocks) + 2.4 GB 
> (scratch space shared across 13 tasks(s)) = 15.0 GB. Storage limit = 15.8 GB.
> 15/12/05 01:20:52 INFO BlockManager: Found block rdd_94_37 locally
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 262144 bytes 
> memory. But, on-heap execution memory poll only has 0 bytes free memory.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 
> 8464760832, storageMemoryPool.poolSize 16929521664, storageRegionSize 
> 8464760832.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from 
> storage memory pool.
> 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 262144 bytes free memory 
> space from StorageMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 262144 bytes of memory 
> from storage memory pool.Adding them back to onHeapExecutionMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 67108864 bytes 
> memory. But, on-heap execution memory poll only has 0 bytes free memory.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 
> 8464498688, storageMemoryPool.poolSize 16929259520, storageRegionSize 
> 8464760832.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from 
> storage memory pool.
> 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 67108864 bytes free memory 
> space from StorageMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 67108864 bytes of 
> memory from storage memory pool.Adding them back to onHeapExecutionMemoryPool.
> 15/12/05 01:20:54 INFO Executor: Finished task 37.0 in stage 4.0 (TID 110). 
> 3077 bytes result sent to driver
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 120
> 15/12/05 01:20:56 INFO Executor: Running task 1.0 in stage 5.0 (TID 120)
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 124
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 128
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 132
> 15/12/05 01:20:56 INFO Executor: Running task 9.0 in stage 5.0 (TID 128)
> 

[jira] [Assigned] (SPARK-12155) Execution OOM after a relative large dataset cached in the cluster.

2015-12-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12155:


Assignee: Apache Spark

> Execution OOM after a relative large dataset cached in the cluster.
> ---
>
> Key: SPARK-12155
> URL: https://issues.apache.org/jira/browse/SPARK-12155
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Blocker
>
> I have a cluster with relative 80GB of mem. Then, I cached a 43GB dataframe. 
> When I start to consume the query. I got the following exception (I added 
> more logs to the code).
> {code}
> 15/12/05 00:33:43 INFO UnifiedMemoryManager: Creating UnifedMemoryManager for 
> 4 cores with 16929521664 maxMemory, 8464760832 storageRegionSize.
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 1048576 bytes of free space for 
> block rdd_94_37(free: 3253659951, max: 16798973952)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 5142008 bytes of free space for 
> block rdd_94_37(free: 3252611375, max: 16798973952)
> 15/12/05 01:20:50 INFO Executor: Finished task 36.0 in stage 4.0 (TID 109). 
> 3028 bytes result sent to driver
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98948238 bytes of free space for 
> block rdd_94_37(free: 3314840375, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98675713 bytes of free space for 
> block rdd_94_37(free: 3215892137, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 197347565 bytes of free space 
> for block rdd_94_37(free: 3117216424, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 295995553 bytes of free space 
> for block rdd_94_37(free: 2919868859, max: 16866344960)
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 394728479 bytes of free space 
> for block rdd_94_37(free: 2687050010, max: 16929521664)
> 15/12/05 01:20:51 INFO Executor: Finished task 32.0 in stage 4.0 (TID 106). 
> 3028 bytes result sent to driver
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 591258816 bytes of free space 
> for block rdd_94_37(free: 2292321531, max: 16929521664)
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 901645182 bytes of free space 
> for block rdd_94_37(free: 1701062715, max: 16929521664)
> 15/12/05 01:20:52 INFO MemoryStore: Ensuring 1302179076 bytes of free space 
> for block rdd_94_37(free: 799417533, max: 16929521664)
> 15/12/05 01:20:52 INFO MemoryStore: Will not store rdd_94_37 as it would 
> require dropping another block from the same RDD
> 15/12/05 01:20:52 WARN MemoryStore: Not enough space to cache rdd_94_37 in 
> memory! (computed 2.4 GB so far)
> 15/12/05 01:20:52 INFO MemoryStore: Memory use = 12.6 GB (blocks) + 2.4 GB 
> (scratch space shared across 13 tasks(s)) = 15.0 GB. Storage limit = 15.8 GB.
> 15/12/05 01:20:52 INFO BlockManager: Found block rdd_94_37 locally
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 262144 bytes 
> memory. But, on-heap execution memory poll only has 0 bytes free memory.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 
> 8464760832, storageMemoryPool.poolSize 16929521664, storageRegionSize 
> 8464760832.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from 
> storage memory pool.
> 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 262144 bytes free memory 
> space from StorageMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 262144 bytes of memory 
> from storage memory pool.Adding them back to onHeapExecutionMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 67108864 bytes 
> memory. But, on-heap execution memory poll only has 0 bytes free memory.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 
> 8464498688, storageMemoryPool.poolSize 16929259520, storageRegionSize 
> 8464760832.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from 
> storage memory pool.
> 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 67108864 bytes free memory 
> space from StorageMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 67108864 bytes of 
> memory from storage memory pool.Adding them back to onHeapExecutionMemoryPool.
> 15/12/05 01:20:54 INFO Executor: Finished task 37.0 in stage 4.0 (TID 110). 
> 3077 bytes result sent to driver
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 120
> 15/12/05 01:20:56 INFO Executor: Running task 1.0 in stage 5.0 (TID 120)
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 124
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 128
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 132
> 15/12/05 01:20:56 INFO Executor: Running task 9.0 in 

[jira] [Commented] (SPARK-12149) Executor UI improvement suggestions - Color UI

2015-12-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042605#comment-15042605
 ] 

Apache Spark commented on SPARK-12149:
--

User 'ajbozarth' has created a pull request for this issue:
https://github.com/apache/spark/pull/10154

> Executor UI improvement suggestions - Color UI
> --
>
> Key: SPARK-12149
> URL: https://issues.apache.org/jira/browse/SPARK-12149
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Alexander Bozarth
>
> Splitting off the Color UI portion of the parent UI improvements task, 
> description copied below:
> Fill some of the cells with color in order to make it easier to absorb the 
> info, e.g.
> RED if Failed Tasks greater than 0 (maybe the more failed, the more intense 
> the red)
> GREEN if Active Tasks greater than 0 (maybe more intense the larger the 
> number)
> Possibly color code COMPLETE TASKS using various shades of blue (e.g., based 
> on the log(# completed)
> if dark blue then write the value in white (same for the RED and GREEN above
> Merging another idea from SPARK-2132: 
> Color GC time red when over a percentage of task time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12149) Executor UI improvement suggestions - Color UI

2015-12-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12149:


Assignee: Apache Spark

> Executor UI improvement suggestions - Color UI
> --
>
> Key: SPARK-12149
> URL: https://issues.apache.org/jira/browse/SPARK-12149
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Alexander Bozarth
>Assignee: Apache Spark
>
> Splitting off the Color UI portion of the parent UI improvements task, 
> description copied below:
> Fill some of the cells with color in order to make it easier to absorb the 
> info, e.g.
> RED if Failed Tasks greater than 0 (maybe the more failed, the more intense 
> the red)
> GREEN if Active Tasks greater than 0 (maybe more intense the larger the 
> number)
> Possibly color code COMPLETE TASKS using various shades of blue (e.g., based 
> on the log(# completed)
> if dark blue then write the value in white (same for the RED and GREEN above
> Merging another idea from SPARK-2132: 
> Color GC time red when over a percentage of task time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12149) Executor UI improvement suggestions - Color UI

2015-12-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12149:


Assignee: (was: Apache Spark)

> Executor UI improvement suggestions - Color UI
> --
>
> Key: SPARK-12149
> URL: https://issues.apache.org/jira/browse/SPARK-12149
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Alexander Bozarth
>
> Splitting off the Color UI portion of the parent UI improvements task, 
> description copied below:
> Fill some of the cells with color in order to make it easier to absorb the 
> info, e.g.
> RED if Failed Tasks greater than 0 (maybe the more failed, the more intense 
> the red)
> GREEN if Active Tasks greater than 0 (maybe more intense the larger the 
> number)
> Possibly color code COMPLETE TASKS using various shades of blue (e.g., based 
> on the log(# completed)
> if dark blue then write the value in white (same for the RED and GREEN above
> Merging another idea from SPARK-2132: 
> Color GC time red when over a percentage of task time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12140) Support Streaming UI in HistoryServer

2015-12-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-12140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042681#comment-15042681
 ] 

Jean-Baptiste Onofré commented on SPARK-12140:
--

It's just to know if you didn't already start something ;)

> Support Streaming UI in HistoryServer
> -
>
> Key: SPARK-12140
> URL: https://issues.apache.org/jira/browse/SPARK-12140
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>
> SPARK-11206 added infrastructure that would allow the streaming UI to be 
> shown in the History Server. We should add the necessary code to make that 
> happen, although it requires some changes to how events and listeners are 
> used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12102) Cast a non-nullable struct field to a nullable field during analysis

2015-12-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12102:


Assignee: (was: Apache Spark)

> Cast a non-nullable struct field to a nullable field during analysis
> 
>
> Key: SPARK-12102
> URL: https://issues.apache.org/jira/browse/SPARK-12102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> If you try {{sqlContext.sql("select case when 1>0 then struct(1, 2, 3, 
> cast(hash(4) as int)) else struct(1, 2, 3, 4) end").printSchema}}, you will 
> see {{org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (1 > 
> 0) THEN 
> struct(1,2,3,cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(4)
>  as int)) ELSE struct(1,2,3,4)' due to data type mismatch: THEN and ELSE 
> expressions should all be same type or coercible to a common type; line 1 pos 
> 85}}.
> The problem is the nullability difference between {{4}} (non-nullable) and 
> {{hash(4)}} (nullable).
> Seems it makes sense to cast the nullability in the analysis. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12102) Cast a non-nullable struct field to a nullable field during analysis

2015-12-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042717#comment-15042717
 ] 

Apache Spark commented on SPARK-12102:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/10156

> Cast a non-nullable struct field to a nullable field during analysis
> 
>
> Key: SPARK-12102
> URL: https://issues.apache.org/jira/browse/SPARK-12102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> If you try {{sqlContext.sql("select case when 1>0 then struct(1, 2, 3, 
> cast(hash(4) as int)) else struct(1, 2, 3, 4) end").printSchema}}, you will 
> see {{org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (1 > 
> 0) THEN 
> struct(1,2,3,cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(4)
>  as int)) ELSE struct(1,2,3,4)' due to data type mismatch: THEN and ELSE 
> expressions should all be same type or coercible to a common type; line 1 pos 
> 85}}.
> The problem is the nullability difference between {{4}} (non-nullable) and 
> {{hash(4)}} (nullable).
> Seems it makes sense to cast the nullability in the analysis. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12102) Cast a non-nullable struct field to a nullable field during analysis

2015-12-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12102:


Assignee: Apache Spark

> Cast a non-nullable struct field to a nullable field during analysis
> 
>
> Key: SPARK-12102
> URL: https://issues.apache.org/jira/browse/SPARK-12102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> If you try {{sqlContext.sql("select case when 1>0 then struct(1, 2, 3, 
> cast(hash(4) as int)) else struct(1, 2, 3, 4) end").printSchema}}, you will 
> see {{org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (1 > 
> 0) THEN 
> struct(1,2,3,cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(4)
>  as int)) ELSE struct(1,2,3,4)' due to data type mismatch: THEN and ELSE 
> expressions should all be same type or coercible to a common type; line 1 pos 
> 85}}.
> The problem is the nullability difference between {{4}} (non-nullable) and 
> {{hash(4)}} (nullable).
> Seems it makes sense to cast the nullability in the analysis. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12149) Executor UI improvement suggestions - Color UI

2015-12-04 Thread Alexander Bozarth (JIRA)
Alexander Bozarth created SPARK-12149:
-

 Summary: Executor UI improvement suggestions - Color UI
 Key: SPARK-12149
 URL: https://issues.apache.org/jira/browse/SPARK-12149
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Reporter: Alexander Bozarth


Splitting off the Color UI portion of the parent UI improvements task, 
description copied below:

Fill some of the cells with color in order to make it easier to absorb the 
info, e.g.
RED if Failed Tasks greater than 0 (maybe the more failed, the more intense the 
red)
GREEN if Active Tasks greater than 0 (maybe more intense the larger the number)
Possibly color code COMPLETE TASKS using various shades of blue (e.g., based on 
the log(# completed)
if dark blue then write the value in white (same for the RED and GREEN above

Merging another idea from SPARK-2132: 
Color GC time red when over a percentage of task time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10509) Excessive param boiler plate code

2015-12-04 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041937#comment-15041937
 ] 

holdenk commented on SPARK-10509:
-

So I'm thinking a pretty simple option would be just having a reset owner or 
similar function (we could have it only work if the owner was dummy).

> Excessive param boiler plate code
> -
>
> Key: SPARK-10509
> URL: https://issues.apache.org/jira/browse/SPARK-10509
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> We have a lot of dummy params we re-set in the init code, this results in a 
> bunch of duplicated code. We should fix this at some point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1832) Executor UI improvement suggestions

2015-12-04 Thread Alexander Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041936#comment-15041936
 ] 

Alexander Bozarth commented on SPARK-1832:
--

I've split off a subtask for the Color improvements that I will be submitting a 
PR for shortly.

Also what do you mean by "the MASTER task" in the description?

> Executor UI improvement suggestions
> ---
>
> Key: SPARK-1832
> URL: https://issues.apache.org/jira/browse/SPARK-1832
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>
> I received some suggestions from a user for the /executors UI page to make it 
> more helpful. This gets more important when you have a really large number of 
> executors.
>  Fill some of the cells with color in order to make it easier to absorb 
> the info, e.g.
> RED if Failed Tasks greater than 0 (maybe the more failed, the more intense 
> the red)
> GREEN if Active Tasks greater than 0 (maybe more intense the larger the 
> number)
> Possibly color code COMPLETE TASKS using various shades of blue (e.g., based 
> on the log(# completed)
> - if dark blue then write the value in white (same for the RED and GREEN above
> Maybe mark the MASTER task somehow
>  
> Report the TOTALS in each column (do this at the TOP so no need to scroll 
> to the bottom, or print both at top and bottom).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12128) Multiplication on decimals in dataframe returns null

2015-12-04 Thread kevin yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041954#comment-15041954
 ] 

kevin yu commented on SPARK-12128:
--

Hello Philip: Thanks for reporting this problem, this looks like bug for me. I 
can recreate the problem also. Are you planning to fix this problem? If not, I 
can look into the code. Thanks.

> Multiplication on decimals in dataframe returns null
> 
>
> Key: SPARK-12128
> URL: https://issues.apache.org/jira/browse/SPARK-12128
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
> Environment: Scala 2.11/Spark 1.5.0/1.5.1/1.5.2
>Reporter: Philip Dodds
>
> I hit a weird issue when I tried to multiply to decimals in a select (either 
> in scala or as SQL), and Im assuming I must be missing the point.
> The issue is fairly easy to recreate with something like the following:
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.implicits._
> import org.apache.spark.sql.types.Decimal
> case class Trade(quantity: Decimal,price: Decimal)
> val data = Seq.fill(100) {
>   val price = Decimal(20+scala.util.Random.nextInt(10))
> val quantity = Decimal(20+scala.util.Random.nextInt(10))
>   Trade(quantity, price)
> }
> val trades = sc.parallelize(data).toDF()
> trades.registerTempTable("trades")
> trades.select(trades("price")*trades("quantity")).show
> sqlContext.sql("select 
> price/quantity,price*quantity,price+quantity,price-quantity from trades").show
> {code}
> The odd part is if you run it you will see that the addition/division and 
> subtraction works but the multiplication returns a null.
> Tested on 1.5.1/1.5.2 (Scala 2.10 and 2.11)
> ie. 
> {code}
> +--+
> |(price * quantity)|
> +--+
> |  null|
> |  null|
> |  null|
> |  null|
> |  null|
> +--+
> +++++
> | _c0| _c1| _c2| _c3|
> +++++
> |0.952380952380952381|null|41.00...|-1.00...|
> |1.380952380952380952|null|50.00...|8.00|
> |1.272727272727272727|null|50.00...|6.00|
> |0.83|null|44.00...|-4.00...|
> |1.00|null|58.00...|   0E-18|
> +++++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder

2015-12-04 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12089:
-
Fix Version/s: (was: 2.0.0)

> java.lang.NegativeArraySizeException when growing BufferHolder
> --
>
> Key: SPARK-12089
> URL: https://issues.apache.org/jira/browse/SPARK-12089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Erik Selin
>Priority: Blocker
> Fix For: 1.6.0
>
>
> When running a large spark sql query including multiple joins I see tasks 
> failing with the following trace:
> {code}
> java.lang.NegativeArraySizeException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From the spark code it looks like this is due to a integer overflow when 
> growing a buffer length. The offending line {{BufferHolder.java:36}} is the 
> following in the version I'm running:
> {code}
> final byte[] tmp = new byte[length * 2];
> {code}
> This seems to indicate to me that this buffer will never be able to hold more 
> then 2G worth of data. And likely will hold even less since any length > 
> 1073741824 will cause a integer overflow and turn the new buffer size 
> negative.
> I hope I'm simply missing some critical config setting but it still seems 
> weird that we have a (rather low) upper limit on these buffers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >