[jira] [Created] (SPARK-12145) Command "Set Role [ADMIN|NONE|ALL]" doesn't work in SQL based authorization
Ferdinand Xu created SPARK-12145: Summary: Command "Set Role [ADMIN|NONE|ALL]" doesn't work in SQL based authorization Key: SPARK-12145 URL: https://issues.apache.org/jira/browse/SPARK-12145 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Ferdinand Xu "Set role [ADMIN|NONE|ALL]" is not working since it's treated as "SET key=value" command. Also the user information is missing when creating sessions which is required to initialize the authorization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9838) Support Poisson family in SparkR:::glm
[ https://issues.apache.org/jira/browse/SPARK-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041258#comment-15041258 ] Xusen Yin commented on SPARK-9838: -- Hi [~mengxr], can I work on this? > Support Poisson family in SparkR:::glm > -- > > Key: SPARK-9838 > URL: https://issues.apache.org/jira/browse/SPARK-9838 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Xiangrui Meng > > Support Poisson family in SparkR::glm(). This task might need further > refinements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10123) Cannot set "--deploy-mode" in default configuration
[ https://issues.apache.org/jira/browse/SPARK-10123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041260#comment-15041260 ] Saisai Shao commented on SPARK-10123: - Hi [~vanzin], would you mind letting me take a crack on this? > Cannot set "--deploy-mode" in default configuration > --- > > Key: SPARK-10123 > URL: https://issues.apache.org/jira/browse/SPARK-10123 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Marcelo Vanzin >Priority: Minor > > There's no configuration option that is the equivalent of "--deploy-mode". So > it's not possible, for example, to have applications be submitted in > standalone cluster mode by default - you have to always use the command line > argument for that. > YARN is special because it has the (somewhat deprecated) "yarn-cluster" > master, but it would be nice to be consistent and have a proper config option > for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12145) Command "Set Role [ADMIN|NONE|ALL]" doesn't work in SQL based authorization
[ https://issues.apache.org/jira/browse/SPARK-12145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12145: Assignee: (was: Apache Spark) > Command "Set Role [ADMIN|NONE|ALL]" doesn't work in SQL based authorization > --- > > Key: SPARK-12145 > URL: https://issues.apache.org/jira/browse/SPARK-12145 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Ferdinand Xu > > "Set role [ADMIN|NONE|ALL]" is not working since it's treated as "SET > key=value" command. Also the user information is missing when creating > sessions which is required to initialize the authorization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12145) Command "Set Role [ADMIN|NONE|ALL]" doesn't work in SQL based authorization
[ https://issues.apache.org/jira/browse/SPARK-12145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041254#comment-15041254 ] Apache Spark commented on SPARK-12145: -- User 'winningsix' has created a pull request for this issue: https://github.com/apache/spark/pull/10144 > Command "Set Role [ADMIN|NONE|ALL]" doesn't work in SQL based authorization > --- > > Key: SPARK-12145 > URL: https://issues.apache.org/jira/browse/SPARK-12145 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Ferdinand Xu > > "Set role [ADMIN|NONE|ALL]" is not working since it's treated as "SET > key=value" command. Also the user information is missing when creating > sessions which is required to initialize the authorization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12145) Command "Set Role [ADMIN|NONE|ALL]" doesn't work in SQL based authorization
[ https://issues.apache.org/jira/browse/SPARK-12145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12145: Assignee: Apache Spark > Command "Set Role [ADMIN|NONE|ALL]" doesn't work in SQL based authorization > --- > > Key: SPARK-12145 > URL: https://issues.apache.org/jira/browse/SPARK-12145 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Ferdinand Xu >Assignee: Apache Spark > > "Set role [ADMIN|NONE|ALL]" is not working since it's treated as "SET > key=value" command. Also the user information is missing when creating > sessions which is required to initialize the authorization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12146) SparkR jsonFile should support multiple input files
Yanbo Liang created SPARK-12146: --- Summary: SparkR jsonFile should support multiple input files Key: SPARK-12146 URL: https://issues.apache.org/jira/browse/SPARK-12146 Project: Spark Issue Type: Bug Components: SparkR Reporter: Yanbo Liang SparkR jsonFile should support multiple input files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041301#comment-15041301 ] Saisai Shao commented on SPARK-12103: - I think I had a proposal of message handler (receiver interceptor) for receiver-based Kafka stream to support your scenario, but it was obsolete for a long time and only adopted in direct Kafka stream. Currently adding back this feature to receiver based Kafka stream will break a lot things, also it is not so meaningful (I think lots of users are shifting to direct API now). > KafkaUtils createStream with multiple topics -- does not work as expected > - > > Key: SPARK-12103 > URL: https://issues.apache.org/jira/browse/SPARK-12103 > Project: Spark > Issue Type: Improvement > Components: Documentation, Streaming >Affects Versions: 1.4.1 >Reporter: Dan Dutrow >Priority: Minor > Fix For: 1.4.2 > > > (Note: yes, there is a Direct API that may be better, but it's not the > easiest thing to get started with. The Kafka Receiver API still needs to > work, especially for newcomers) > When creating a receiver stream using KafkaUtils, there is a valid use case > where you would want to use one (or a few) Kafka Streaming Receiver to pool > resources. I have 10+ topics and don't want to dedicate 10 cores to > processing all of them. However, when reading the data procuced by > KafkaUtils.createStream, the DStream[(String,String)] does not properly > insert the topic name into the tuple. The left-key always null, making it > impossible to know what topic that data came from other than stashing your > key into the value. Is there a way around that problem? > CODE > val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, > "topicE" -> 1, "topicF" -> 1, ...) > val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( > i => > KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( > ssc, consumerProperties, > topics, > StorageLevel.MEMORY_ONLY_SER)) > val unioned :DStream[(String,String)] = ssc.union(streams) > unioned.flatMap(x => { >val (key, value) = x > // key is always null! > // value has data from any one of my topics > key match ... { > .. > } > } > END CODE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12146) SparkR jsonFile should support multiple input files
[ https://issues.apache.org/jira/browse/SPARK-12146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-12146: Description: This bug is easy to reproduce, jsonFile did not support character vector as arguments. {code} > path <- c("/path/to/dir1","/path/to/dir2") > raw.terror<-jsonFile(sqlContext,path) 15/12/03 15:59:55 ERROR RBackendHandler: jsonFile on 1 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : java.io.IOException: No input paths specified in job at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2 {code} was: > path <- c("/path/to/dir1","/path/to/dir2") > raw.terror<-jsonFile(sqlContext,path) 15/12/03 15:59:55 ERROR RBackendHandler: jsonFile on 1 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : java.io.IOException: No input paths specified in job at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2 > SparkR jsonFile should support multiple input files > --- > > Key: SPARK-12146 > URL: https://issues.apache.org/jira/browse/SPARK-12146 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Yanbo Liang > > This bug is easy to reproduce, jsonFile did not support character vector as > arguments. > {code} > > path <- c("/path/to/dir1","/path/to/dir2") > > raw.terror<-jsonFile(sqlContext,path) > 15/12/03 15:59:55 ERROR RBackendHandler: jsonFile on 1 failed > Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : > java.io.IOException: No input paths specified in job > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12137) Spark Streaming State Recovery limitations
[ https://issues.apache.org/jira/browse/SPARK-12137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041500#comment-15041500 ] Sean Owen commented on SPARK-12137: --- The checkpoint mechanism works for recovering after an app fails and is restarted. There's no way to guarantee it works for an arbitrarily different new version of your app. It may work if you've designed the state you serialize appropriately. This is no different from the issue of deserializing objects serialized from an old version of POJOs, for instance. What's the issue here -- are you just asking the question? I believe this is the answer. > Spark Streaming State Recovery limitations > -- > > Key: SPARK-12137 > URL: https://issues.apache.org/jira/browse/SPARK-12137 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.4.1 >Reporter: Ravindar >Priority: Critical > > There was multiple threads in forums asking similar question without a clear > answer and hence entering it here. > We have a streaming application that goes through multi-step processing. In > some of these steps stateful operations like *updateStateByKey* are used to > maintain an accumulated running state (and other state info) with incoming > RDD streams. As streaming application is incremental, it is imperative that > we recover/restore from previous known state in the following two scenarios > 1. On spark driver/streaming application failure. > In this scenario the driver/streaming application shutdown and > restarted. The recommended approach is enable the *checkpoint(checkpointDir)* > and use *StreamingContext.getOrCreate* to restore the context from checkpoint > state. > 2. Upgrade driver/streaming application with additional steps in the > processing > In this scenario, we introduced new steps with downstream processing for > new functionality without changes to existing steps. Upgrading the streaming > application with the new fails on *StreamingContext.getOrCreate* as there is > mismatch in checkpoint saved. > Both of the above scenarios needs a unified approach where accumulated state > has to be saved and restored. The first approach of restoring from checkpoint > works for driver failure but not code upgrade. When the application code > changed, there is a recommendation to delete checkpoint data when new code is > deployed. If so, how do you reconstitute all of the stateful (e.g: > updateStateByKey) information from the last run. Every streaming application > has to save up-to-date state for each session represented by key and then > initialize it from this when a new session starts for the same key. Does > every application have to create their own mechanism given this is very > similar to current state checkpointing to HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11868) wrong results returned from dataframe create from Rows without consistent schma on pyspark
[ https://issues.apache.org/jira/browse/SPARK-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041497#comment-15041497 ] Yuval Tanny commented on SPARK-11868: - No. I think there should be a decision on the desired behaviour beforehand. I provided here details that I found when trying to understand how to reproduce the problem I've encountered. > wrong results returned from dataframe create from Rows without consistent > schma on pyspark > -- > > Key: SPARK-11868 > URL: https://issues.apache.org/jira/browse/SPARK-11868 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.2 > Environment: pyspark >Reporter: Yuval Tanny > > When schema is inconsistent (but is the sames for the 10 first rows), it's > possible to create a dataframe form dictionaries and if a key is missing, its > value is None. But when trying to create dataframe from corresponding rows, > we get inconsistent behavior (wrong values for keys) without exception. See > example below. > The problems seems to be: > 1. Not verifying all rows in schema. > 2. In pyspark.sql.types._create_converter, None is being set when converting > dictionary and field is not exist: > {code} > return tuple([conv(d.get(name)) for name, conv in zip(names, converters)]) > {code} > But for Rows, it is just assumed that the number of fields in tuple is equal > the number of in the inferred schema, and we place wrong values for wrong > keys otherwise: > {code} > return tuple(conv(v) for v, conv in zip(obj, converters)) > {code} > Thanks. > example: > {code} > dicts = [{'1':1,'2':2,'3':3}]*10+[{'1':1,'3':3}] > rows = [pyspark.sql.Row(**r) for r in dicts] > rows_rdd = sc.parallelize(rows) > dicts_rdd = sc.parallelize(dicts) > rows_df = sqlContext.createDataFrame(rows_rdd) > dicts_df = sqlContext.createDataFrame(dicts_rdd) > print(rows_df.select(['2']).collect()[10]) > print(dicts_df.select(['2']).collect()[10]) > {code} > output: > {code} > Row(2=3) > Row(2=None) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12146) SparkR jsonFile should support multiple input files
[ https://issues.apache.org/jira/browse/SPARK-12146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-12146: Description: > path <- c("/path/to/dir1","/path/to/dir2") > raw.terror<-jsonFile(sqlContext,path) 15/12/03 15:59:55 ERROR RBackendHandler: jsonFile on 1 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : java.io.IOException: No input paths specified in job at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2 was:SparkR jsonFile should support multiple input files. > SparkR jsonFile should support multiple input files > --- > > Key: SPARK-12146 > URL: https://issues.apache.org/jira/browse/SPARK-12146 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Yanbo Liang > > > path <- c("/path/to/dir1","/path/to/dir2") > > raw.terror<-jsonFile(sqlContext,path) > 15/12/03 15:59:55 ERROR RBackendHandler: jsonFile on 1 failed > Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : > java.io.IOException: No input paths specified in job > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12122) Recovered streaming context can sometimes run a batch twice
[ https://issues.apache.org/jira/browse/SPARK-12122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-12122. --- Resolution: Fixed Fix Version/s: 1.6.0 > Recovered streaming context can sometimes run a batch twice > --- > > Key: SPARK-12122 > URL: https://issues.apache.org/jira/browse/SPARK-12122 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Blocker > Fix For: 1.6.0 > > > After recovering from checkpoint, the JobGenerator figures out which batches > to run again. That can sometimes lead to a batch being submitted twice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12146) SparkR jsonFile should support multiple input files
[ https://issues.apache.org/jira/browse/SPARK-12146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12146: Assignee: (was: Apache Spark) > SparkR jsonFile should support multiple input files > --- > > Key: SPARK-12146 > URL: https://issues.apache.org/jira/browse/SPARK-12146 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Yanbo Liang > > SparkR jsonFile should support multiple input files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12146) SparkR jsonFile should support multiple input files
[ https://issues.apache.org/jira/browse/SPARK-12146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041320#comment-15041320 ] Apache Spark commented on SPARK-12146: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/10145 > SparkR jsonFile should support multiple input files > --- > > Key: SPARK-12146 > URL: https://issues.apache.org/jira/browse/SPARK-12146 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Yanbo Liang > > SparkR jsonFile should support multiple input files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12146) SparkR jsonFile should support multiple input files
[ https://issues.apache.org/jira/browse/SPARK-12146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12146: Assignee: Apache Spark > SparkR jsonFile should support multiple input files > --- > > Key: SPARK-12146 > URL: https://issues.apache.org/jira/browse/SPARK-12146 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Yanbo Liang >Assignee: Apache Spark > > SparkR jsonFile should support multiple input files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12147) Off heap storage and dynamicAllocation operation
Rares Mirica created SPARK-12147: Summary: Off heap storage and dynamicAllocation operation Key: SPARK-12147 URL: https://issues.apache.org/jira/browse/SPARK-12147 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.2 Environment: Cloudera Hadoop 2.6.0-cdh5.4.8 Tachyon 0.7.1 Yarn Reporter: Rares Mirica For the purpose of increasing computation density and efficiency I set up to test off-heap storage (using Tachyon) with dynamicAllocation enabled. Following the available documentation (programming-guide for Spark 1.5.2) I was expecting data to be cached in Tachyon for the lifetime of the application (driver instance) or until unpersist() is called. This belief was supported by the doc: "Cached data is not lost if individual executors crash." where with crash I also assimilate Graceful Decommission. Furthermore, in the GD description documented in the job-scheduling document cached data preservation through off-heap storage is also hinted at. Seeing how Tachyon is now in a state where these promises of a better future are well within reach, I consider it a bug that upon graceful decommission of an executor the off-heap data is deleted (presumably as part of the cleanup phase). Needless to say, enabling the preservation of the off-heap persisted data after graceful decommission for dynamic allocation would yield significant improvements in resource allocation, especially over yarn where executors use up compute "slots" even if idle. After a long, expensive, computation where we take advantage of the dynamically scaled executors, the rest of the spark jobs can use the cached data while releasing the compute resources for other cluster tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2489) Unsupported parquet datatype optional fixed_len_byte_array
[ https://issues.apache.org/jira/browse/SPARK-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041626#comment-15041626 ] Jorge Sanchez commented on SPARK-2489: -- This functionality would be very helpful. Is there any way we can help to fix it? > Unsupported parquet datatype optional fixed_len_byte_array > -- > > Key: SPARK-2489 > URL: https://issues.apache.org/jira/browse/SPARK-2489 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Pei-Lun Lee > > tested against commit 9fe693b5 > {noformat} > scala> sqlContext.parquetFile("/tmp/foo") > java.lang.RuntimeException: Unsupported parquet datatype optional > fixed_len_byte_array(4) b > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:58) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:109) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:282) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:279) > {noformat} > example avro schema > {noformat} > protocol Test { > fixed Bytes4(4); > record Foo { > union {null, Bytes4} b; > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11707) StreamCorruptedException if authentication is enabled
[ https://issues.apache.org/jira/browse/SPARK-11707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041623#comment-15041623 ] Sean Owen commented on SPARK-11707: --- [~jlewandowski] any chance you can try this with the latest code? > StreamCorruptedException if authentication is enabled > - > > Key: SPARK-11707 > URL: https://issues.apache.org/jira/browse/SPARK-11707 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Jacek Lewandowski > > When authentication (and encryption) is enabled (at least in standalone > mode), the following code (in Spark shell): > {code} > sc.makeRDD(1 to 10, 10).map(x => x*x).map(_.toString).reduce(_ + _) > {code} > finishes with exception: > {noformat} > [Stage 0:> (0 + 8) / > 10]15/11/12 20:36:29 ERROR TransportRequestHandler: Error while invoking > RpcHandler#receive() on RPC id 5750598674048943239 > java.io.StreamCorruptedException: invalid type code: 30 > at > java.io.ObjectInputStream$BlockDataInputStream.readBlockHeader(ObjectInputStream.java:2508) > at > java.io.ObjectInputStream$BlockDataInputStream.refill(ObjectInputStream.java:2543) > at > java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2702) > at java.io.ObjectInputStream.read(ObjectInputStream.java:865) > at > java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385) > at > org.apache.spark.util.SerializableBuffer$$anonfun$readObject$1.apply(SerializableBuffer.scala:38) > at > org.apache.spark.util.SerializableBuffer$$anonfun$readObject$1.apply(SerializableBuffer.scala:32) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1186) > at > org.apache.spark.util.SerializableBuffer.readObject(SerializableBuffer.scala:32) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109) > at > org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:248) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:296) > at > org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:247) > at > org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:448) > at > org.apache.spark.network.sasl.SaslRpcHandler.receive(SaslRpcHandler.java:76) > at > org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:122) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:94) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:101) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at >
[jira] [Updated] (SPARK-12147) Off heap storage and dynamicAllocation operation
[ https://issues.apache.org/jira/browse/SPARK-12147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12147: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) (Make the title more specific?) I disagree it's a 'bug'. Cached data is not lost when an executor is in the sense that it was just a cached copy. The cached copy is lost, but can be recreated. Tachyon is not required, and doesn't look like it's going to be, so I'm not sure in general it's a solution to something. Your suggestion requires doubling the amount of storage for cached data: now things live in memory or on local disk and also in Tachyon or something. Right? It's also not true that after executors are decommissioned that others can keep using the cached data. There aren't enough executors to keep the cached data live any more. I think this has problems but maybe you can clarify what you mean. > Off heap storage and dynamicAllocation operation > > > Key: SPARK-12147 > URL: https://issues.apache.org/jira/browse/SPARK-12147 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.2 > Environment: Cloudera Hadoop 2.6.0-cdh5.4.8 > Tachyon 0.7.1 > Yarn >Reporter: Rares Mirica >Priority: Minor > Attachments: spark-defaults.conf > > > For the purpose of increasing computation density and efficiency I set up to > test off-heap storage (using Tachyon) with dynamicAllocation enabled. > Following the available documentation (programming-guide for Spark 1.5.2) I > was expecting data to be cached in Tachyon for the lifetime of the > application (driver instance) or until unpersist() is called. This belief was > supported by the doc: "Cached data is not lost if individual executors > crash." where with crash I also assimilate Graceful Decommission. > Furthermore, in the GD description documented in the job-scheduling document > cached data preservation through off-heap storage is also hinted at. > Seeing how Tachyon is now in a state where these promises of a better future > are well within reach, I consider it a bug that upon graceful decommission of > an executor the off-heap data is deleted (presumably as part of the cleanup > phase). > Needless to say, enabling the preservation of the off-heap persisted data > after graceful decommission for dynamic allocation would yield significant > improvements in resource allocation, especially over yarn where executors use > up compute "slots" even if idle. After a long, expensive, computation where > we take advantage of the dynamically scaled executors, the rest of the spark > jobs can use the cached data while releasing the compute resources for other > cluster tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12147) Off heap storage and dynamicAllocation operation
[ https://issues.apache.org/jira/browse/SPARK-12147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rares Mirica updated SPARK-12147: - Attachment: spark-defaults.conf > Off heap storage and dynamicAllocation operation > > > Key: SPARK-12147 > URL: https://issues.apache.org/jira/browse/SPARK-12147 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 > Environment: Cloudera Hadoop 2.6.0-cdh5.4.8 > Tachyon 0.7.1 > Yarn >Reporter: Rares Mirica > Attachments: spark-defaults.conf > > > For the purpose of increasing computation density and efficiency I set up to > test off-heap storage (using Tachyon) with dynamicAllocation enabled. > Following the available documentation (programming-guide for Spark 1.5.2) I > was expecting data to be cached in Tachyon for the lifetime of the > application (driver instance) or until unpersist() is called. This belief was > supported by the doc: "Cached data is not lost if individual executors > crash." where with crash I also assimilate Graceful Decommission. > Furthermore, in the GD description documented in the job-scheduling document > cached data preservation through off-heap storage is also hinted at. > Seeing how Tachyon is now in a state where these promises of a better future > are well within reach, I consider it a bug that upon graceful decommission of > an executor the off-heap data is deleted (presumably as part of the cleanup > phase). > Needless to say, enabling the preservation of the off-heap persisted data > after graceful decommission for dynamic allocation would yield significant > improvements in resource allocation, especially over yarn where executors use > up compute "slots" even if idle. After a long, expensive, computation where > we take advantage of the dynamically scaled executors, the rest of the spark > jobs can use the cached data while releasing the compute resources for other > cluster tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12147) Off heap storage and dynamicAllocation operation
[ https://issues.apache.org/jira/browse/SPARK-12147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041679#comment-15041679 ] Rares Mirica commented on SPARK-12147: -- Sorry, I wasn't specific enough about the use-case and how to trigger/take advantage of this. There is no need to cache data in the traditional sense (by calling .cache() on the RDD) so no on-heap space is required. One only needs to append .persist(OFF_HEAP) after the computation to take advantage of this. All of the data should therefore reside in OFF-HEAP storage (for the time being this is Tachyon). There is no alternative off-heap implementation so for taking advantage of this Tachyon is required, the only alternative would be to serialise the result of the expensive computation to disk (through a .saveX call) and then re-load the RDD through sparkContext.textFile (or equivalent, using parquet, java serialised objects). The data should only live in one place: tachyon, and should be considered persisted (as it would through serialising and saving to hdfs) for the lifetime of the application. If this would be the case death or decommission of an executor would be completely decoupled from the data originatin in that executor and "cached" in tacyhon. > Off heap storage and dynamicAllocation operation > > > Key: SPARK-12147 > URL: https://issues.apache.org/jira/browse/SPARK-12147 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.2 > Environment: Cloudera Hadoop 2.6.0-cdh5.4.8 > Tachyon 0.7.1 > Yarn >Reporter: Rares Mirica >Priority: Minor > Attachments: spark-defaults.conf > > > For the purpose of increasing computation density and efficiency I set up to > test off-heap storage (using Tachyon) with dynamicAllocation enabled. > Following the available documentation (programming-guide for Spark 1.5.2) I > was expecting data to be cached in Tachyon for the lifetime of the > application (driver instance) or until unpersist() is called. This belief was > supported by the doc: "Cached data is not lost if individual executors > crash." where with crash I also assimilate Graceful Decommission. > Furthermore, in the GD description documented in the job-scheduling document > cached data preservation through off-heap storage is also hinted at. > Seeing how Tachyon is now in a state where these promises of a better future > are well within reach, I consider it a bug that upon graceful decommission of > an executor the off-heap data is deleted (presumably as part of the cleanup > phase). > Needless to say, enabling the preservation of the off-heap persisted data > after graceful decommission for dynamic allocation would yield significant > improvements in resource allocation, especially over yarn where executors use > up compute "slots" even if idle. After a long, expensive, computation where > we take advantage of the dynamically scaled executors, the rest of the spark > jobs can use the cached data while releasing the compute resources for other > cluster tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12026) ChiSqTest gets slower and slower over time when number of features is large
[ https://issues.apache.org/jira/browse/SPARK-12026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12026: Assignee: Apache Spark > ChiSqTest gets slower and slower over time when number of features is large > --- > > Key: SPARK-12026 > URL: https://issues.apache.org/jira/browse/SPARK-12026 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Hunter Kelly >Assignee: Apache Spark > Labels: mllib, stats > Attachments: First Stages.png, Latest Stages.png > > > I've been running a ChiSqTest to pick features for feature reduction. My > understanding is that internally it creates jobs to run on batches of 1000 > features at a time. > I was under the impression that the features are treated as independant, but > this does not appear to be the case. When the number of features is large > (160k in my case), each batch gets slower and slower. As an example, running > on 25 m3.2xlarges on Amazon EMR, it started at just over 1 minute per batch. > By the end, batches were taking over 30 minutes per batch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12026) ChiSqTest gets slower and slower over time when number of features is large
[ https://issues.apache.org/jira/browse/SPARK-12026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12026: Assignee: (was: Apache Spark) > ChiSqTest gets slower and slower over time when number of features is large > --- > > Key: SPARK-12026 > URL: https://issues.apache.org/jira/browse/SPARK-12026 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Hunter Kelly > Labels: mllib, stats > Attachments: First Stages.png, Latest Stages.png > > > I've been running a ChiSqTest to pick features for feature reduction. My > understanding is that internally it creates jobs to run on batches of 1000 > features at a time. > I was under the impression that the features are treated as independant, but > this does not appear to be the case. When the number of features is large > (160k in my case), each batch gets slower and slower. As an example, running > on 25 m3.2xlarges on Amazon EMR, it started at just over 1 minute per batch. > By the end, batches were taking over 30 minutes per batch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7999) Graph complement function in GraphX
[ https://issues.apache.org/jira/browse/SPARK-7999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7999. -- Resolution: Won't Fix > Graph complement function in GraphX > --- > > Key: SPARK-7999 > URL: https://issues.apache.org/jira/browse/SPARK-7999 > Project: Spark > Issue Type: Sub-task > Components: GraphX >Reporter: Tarek Auel >Priority: Minor > > This task is for implementing the complement operation (compare to parent > task). > http://techieme.in/complex-graph-operations/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7894. -- Resolution: Won't Fix > Graph Union Operator > > > Key: SPARK-7894 > URL: https://issues.apache.org/jira/browse/SPARK-7894 > Project: Spark > Issue Type: Sub-task > Components: GraphX >Reporter: Andy Huang > Labels: graph, union > Attachments: union_operator.png > > > This operator aims to union two graphs and generate a new graph directly. The > union of two graphs is the union of their vertex sets and their edge > families.Vertexes and edges which are included in either graph will be part > of the new graph. > bq. G ∪ H = (VG ∪ VH, EG ∪ EH). > The below image shows a union of graph G and graph H > !union_operator.png|width=600px,align=center! > A Simple interface would be: > bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] > However, inevitably vertexes and edges overlapping will happen between > borders of graphs. For vertex, it's quite nature to just make a union and > remove those duplicate ones. But for edges, a mergeEdges function seems to be > more reasonable. > bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: > (ED, ED) => ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7893. -- Resolution: Won't Fix > Complex Operators between Graphs > > > Key: SPARK-7893 > URL: https://issues.apache.org/jira/browse/SPARK-7893 > Project: Spark > Issue Type: Umbrella > Components: GraphX >Reporter: Andy Huang > Labels: complex, graph, join, operators, union > > Currently there are 30+ operators in GraphX, while few of them consider > operators between graphs. The only one is _*mask*_, which takes another graph > as a parameter and return a new graph. > In many complex case,such as _*streaming graph, small graph merge into huge > graph*_, higher level operators of graphs can help users to focus and think > in graph. Performance optimization can be done internally and be transparent > to them. > Complex graph operator list is > here:[complex_graph_operations|http://techieme.in/complex-graph-operations/] > * Union of Graphs ( G ∪ H ) > * Intersection of Graphs( G ∩ H) > * Graph Join > * Difference of Graphs(G – H) > * Graph Complement > * Line Graph ( L(G) ) > This issue will be index of all these operators -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12026) ChiSqTest gets slower and slower over time when number of features is large
[ https://issues.apache.org/jira/browse/SPARK-12026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041620#comment-15041620 ] Apache Spark commented on SPARK-12026: -- User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/10146 > ChiSqTest gets slower and slower over time when number of features is large > --- > > Key: SPARK-12026 > URL: https://issues.apache.org/jira/browse/SPARK-12026 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Hunter Kelly > Labels: mllib, stats > Attachments: First Stages.png, Latest Stages.png > > > I've been running a ChiSqTest to pick features for feature reduction. My > understanding is that internally it creates jobs to run on batches of 1000 > features at a time. > I was under the impression that the features are treated as independant, but > this does not appear to be the case. When the number of features is large > (160k in my case), each batch gets slower and slower. As an example, running > on 25 m3.2xlarges on Amazon EMR, it started at just over 1 minute per batch. > By the end, batches were taking over 30 minutes per batch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12147) Off heap storage and dynamicAllocation operation
[ https://issues.apache.org/jira/browse/SPARK-12147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041789#comment-15041789 ] Rares Mirica commented on SPARK-12147: -- Yes, I am talking about the executor stopping as part of scaling down on dynamic allocation. I am observing this in am actual test, I was reading the docs just to test my assumption. > Off heap storage and dynamicAllocation operation > > > Key: SPARK-12147 > URL: https://issues.apache.org/jira/browse/SPARK-12147 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.2 > Environment: Cloudera Hadoop 2.6.0-cdh5.4.8 > Tachyon 0.7.1 > Yarn >Reporter: Rares Mirica >Priority: Minor > Attachments: spark-defaults.conf > > > For the purpose of increasing computation density and efficiency I set up to > test off-heap storage (using Tachyon) with dynamicAllocation enabled. > Following the available documentation (programming-guide for Spark 1.5.2) I > was expecting data to be cached in Tachyon for the lifetime of the > application (driver instance) or until unpersist() is called. This belief was > supported by the doc: "Cached data is not lost if individual executors > crash." where with crash I also assimilate Graceful Decommission. > Furthermore, in the GD description documented in the job-scheduling document > cached data preservation through off-heap storage is also hinted at. > Seeing how Tachyon is now in a state where these promises of a better future > are well within reach, I consider it a bug that upon graceful decommission of > an executor the off-heap data is deleted (presumably as part of the cleanup > phase). > Needless to say, enabling the preservation of the off-heap persisted data > after graceful decommission for dynamic allocation would yield significant > improvements in resource allocation, especially over yarn where executors use > up compute "slots" even if idle. After a long, expensive, computation where > we take advantage of the dynamically scaled executors, the rest of the spark > jobs can use the cached data while releasing the compute resources for other > cluster tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12140) Support Streaming UI in HistoryServer
[ https://issues.apache.org/jira/browse/SPARK-12140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041820#comment-15041820 ] Marcelo Vanzin commented on SPARK-12140: You don't need to ask permission to work on things. > Support Streaming UI in HistoryServer > - > > Key: SPARK-12140 > URL: https://issues.apache.org/jira/browse/SPARK-12140 > Project: Spark > Issue Type: New Feature > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin > > SPARK-11206 added infrastructure that would allow the streaming UI to be > shown in the History Server. We should add the necessary code to make that > happen, although it requires some changes to how events and listeners are > used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10123) Cannot set "--deploy-mode" in default configuration
[ https://issues.apache.org/jira/browse/SPARK-10123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041817#comment-15041817 ] Marcelo Vanzin commented on SPARK-10123: Bug is not assigned to anyone, so no need to ask permission... > Cannot set "--deploy-mode" in default configuration > --- > > Key: SPARK-10123 > URL: https://issues.apache.org/jira/browse/SPARK-10123 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Marcelo Vanzin >Priority: Minor > > There's no configuration option that is the equivalent of "--deploy-mode". So > it's not possible, for example, to have applications be submitted in > standalone cluster mode by default - you have to always use the command line > argument for that. > YARN is special because it has the (somewhat deprecated) "yarn-cluster" > master, but it would be nice to be consistent and have a proper config option > for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8513) _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition
[ https://issues.apache.org/jira/browse/SPARK-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041742#comment-15041742 ] Jiri Syrovy edited comment on SPARK-8513 at 12/4/15 4:47 PM: - The same thing happens when writing partitioned files with coalesce(1) and spark.speculation set to "false". I've tried multiple different hadoop versions (2.6 -> 2.7.1), multiple writing modes (Overwrite, append, ...), but the result is always almost the same. {code:java} DataFrameWriter writer = df.coalesce(1).write().format("parquet"); writer = perTemplate ? writer.partitionBy("templateId", "definitionId") : writer.partitionBy("definitionId"); writer.mode(SaveMode.Append).save(ConfConsts.STORAGE_PREFIX + location); {code} {noformat} [2015-12-04 16:14:59,821] WARN .apache.hadoop.fs.FileUtil [] [akka://JobServer/user/context-supervisor/CSCONTEXT] - Failed to delete file or dir [/data/build/result_12349.parquet.stats/_temporary/0/task_201512041614_5493_m_00/templateId=2918/definitionId=4/.part-r-0-cee32e28-fa7c-43ec-bbe1-63b639deb395.snappy.parquet.crc]: it still exists. [2015-12-04 16:14:59,821] ERROR InsertIntoHadoopFsRelation [] [akka://JobServer/user/context-supervisor/CSCONTEXT] - Aborting job. java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus{path=file:/data/build/result_12349.parquet.stats/_temporary/0/task_201512041614_5493_m_00/templateId=2918; isDirectory=true; modification_time=1449245693000; access_time=0; owner=; group=; permission=rwxrwxrwx; isSymlink=false} to file:/data/build/result_12349.parquet.stats/templateId=2918 at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.renameOrMerge(FileOutputCommitter.java:397) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:388) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326) at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46) {noformat} was (Author: xjrk): The same thing happens when writing partitioned files with coalesce(1) and spark.speculation set to "false": {code:java} DataFrameWriter writer = df.coalesce(1).write().format("parquet"); writer = perTemplate ? writer.partitionBy("templateId", "definitionId") : writer.partitionBy("definitionId"); writer.mode(SaveMode.Append).save(ConfConsts.STORAGE_PREFIX + location); {code} {noformat} [2015-12-04 16:14:59,821] WARN .apache.hadoop.fs.FileUtil [] [akka://JobServer/user/context-supervisor/CSCONTEXT] - Failed to delete file or dir [/data/build/result_12349.parquet.stats/_temporary/0/task_201512041614_5493_m_00/templateId=2918/definitionId=4/.part-r-0-cee32e28-fa7c-43ec-bbe1-63b639deb395.snappy.parquet.crc]: it still exists. [2015-12-04 16:14:59,821] ERROR InsertIntoHadoopFsRelation [] [akka://JobServer/user/context-supervisor/CSCONTEXT] - Aborting job. java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus{path=file:/data/build/result_12349.parquet.stats/_temporary/0/task_201512041614_5493_m_00/templateId=2918; isDirectory=true; modification_time=1449245693000; access_time=0; owner=; group=; permission=rwxrwxrwx; isSymlink=false} to file:/data/build/result_12349.parquet.stats/templateId=2918 at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.renameOrMerge(FileOutputCommitter.java:397) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:388) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326) at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46) {noformat} > _temporary may be left undeleted when a write job committed with > FileOutputCommitter fails due to a race condition > -- > > Key: SPARK-8513 > URL: https://issues.apache.org/jira/browse/SPARK-8513 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.2.2, 1.3.1, 1.4.0 >Reporter: Cheng Lian > > To reproduce this issue, we need a node with relatively more cores, say 32 > (e.g., Spark Jenkins builder is a good candidate). With such a node, the > following code should be relatively easy to reproduce this issue: > {code} >
[jira] [Created] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame
Michael Lawrence created SPARK-12148: Summary: SparkR: rename DataFrame to SparkDataFrame Key: SPARK-12148 URL: https://issues.apache.org/jira/browse/SPARK-12148 Project: Spark Issue Type: Wish Components: R Reporter: Michael Lawrence Priority: Minor The SparkR package represents a Spark DataFrame with the class "DataFrame". That conflicts with the more general DataFrame class defined in the S4Vectors package. Would it not be more appropriate to use the name "SparkDataFrame" instead? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12147) Off heap storage and dynamicAllocation operation
[ https://issues.apache.org/jira/browse/SPARK-12147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041683#comment-15041683 ] Rares Mirica commented on SPARK-12147: -- I would also like to object on setting this as minor, this is a massive improvement in usability of Spark in multi-tenant environments or interactive-use environments where a large number of executors is needed to prepare an RDD for later use (eg: exploratory research) and caching is needed to avoid resource waste. The only alternative is to permanently persist the RDD, the api for which is quite a bit more complicated and also puts the responsibility of cleaning and maintaining the data on the shoulders of the user (instead of treating the data as ephemeral and only available for the lifetime of the current application). > Off heap storage and dynamicAllocation operation > > > Key: SPARK-12147 > URL: https://issues.apache.org/jira/browse/SPARK-12147 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.2 > Environment: Cloudera Hadoop 2.6.0-cdh5.4.8 > Tachyon 0.7.1 > Yarn >Reporter: Rares Mirica >Priority: Minor > Attachments: spark-defaults.conf > > > For the purpose of increasing computation density and efficiency I set up to > test off-heap storage (using Tachyon) with dynamicAllocation enabled. > Following the available documentation (programming-guide for Spark 1.5.2) I > was expecting data to be cached in Tachyon for the lifetime of the > application (driver instance) or until unpersist() is called. This belief was > supported by the doc: "Cached data is not lost if individual executors > crash." where with crash I also assimilate Graceful Decommission. > Furthermore, in the GD description documented in the job-scheduling document > cached data preservation through off-heap storage is also hinted at. > Seeing how Tachyon is now in a state where these promises of a better future > are well within reach, I consider it a bug that upon graceful decommission of > an executor the off-heap data is deleted (presumably as part of the cleanup > phase). > Needless to say, enabling the preservation of the off-heap persisted data > after graceful decommission for dynamic allocation would yield significant > improvements in resource allocation, especially over yarn where executors use > up compute "slots" even if idle. After a long, expensive, computation where > we take advantage of the dynamically scaled executors, the rest of the spark > jobs can use the cached data while releasing the compute resources for other > cluster tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8513) _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition
[ https://issues.apache.org/jira/browse/SPARK-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041742#comment-15041742 ] Jiri Syrovy commented on SPARK-8513: The same thing happens when writing partitioned files with coalesce(1) and spark.speculation set to "false": {code:java} DataFrameWriter writer = df.coalesce(1).write().format("parquet"); writer = perTemplate ? writer.partitionBy("templateId", "definitionId") : writer.partitionBy("definitionId"); writer.mode(SaveMode.Append).save(ConfConsts.STORAGE_PREFIX + location); {code:java} {noformat} [2015-12-04 16:14:59,821] WARN .apache.hadoop.fs.FileUtil [] [akka://JobServer/user/context-supervisor/CSCONTEXT] - Failed to delete file or dir [/data/build/result_12349.parquet.stats/_temporary/0/task_201512041614_5493_m_00/templateId=2918/definitionId=4/.part-r-0-cee32e28-fa7c-43ec-bbe1-63b639deb395.snappy.parquet.crc]: it still exists. [2015-12-04 16:14:59,821] ERROR InsertIntoHadoopFsRelation [] [akka://JobServer/user/context-supervisor/CSCONTEXT] - Aborting job. java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus{path=file:/data/build/result_12349.parquet.stats/_temporary/0/task_201512041614_5493_m_00/templateId=2918; isDirectory=true; modification_time=1449245693000; access_time=0; owner=; group=; permission=rwxrwxrwx; isSymlink=false} to file:/data/build/result_12349.parquet.stats/templateId=2918 at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.renameOrMerge(FileOutputCommitter.java:397) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:388) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326) at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46) {noformat} > _temporary may be left undeleted when a write job committed with > FileOutputCommitter fails due to a race condition > -- > > Key: SPARK-8513 > URL: https://issues.apache.org/jira/browse/SPARK-8513 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.2.2, 1.3.1, 1.4.0 >Reporter: Cheng Lian > > To reproduce this issue, we need a node with relatively more cores, say 32 > (e.g., Spark Jenkins builder is a good candidate). With such a node, the > following code should be relatively easy to reproduce this issue: > {code} > sqlContext.range(0, 10).repartition(32).select('id / > 0).write.mode("overwrite").parquet("file:///tmp/foo") > {code} > You may observe similar log lines as below: > {noformat} > 01:58:27.682 pool-1-thread-1-ScalaTest-running-CommitFailureTestRelationSuite > WARN FileUtil: Failed to delete file or dir > [/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-a918b285-fa59-4a29-857e-a95e38fa355a/_temporary/0/_temporary]: > it still exists. > {noformat} > The reason is that, for a Spark job with multiple tasks, when a task fails > after multiple retries, the job gets canceled on driver side. At the same > time, all child tasks of this job also get canceled. However, task > cancelation is asynchronous. This means, some tasks may still be running > when the job is already killed on driver side. > With this in mind, the following execution order may cause the log line > mentioned above: > # Job {{A}} spawns 32 tasks to write the Parquet file > Since {{ParquetOutputCommitter}} is a subclass of {{FileOutputClass}}, a > temporary directory {{D1}} is created to hold output files of different task > attempts. > # Task {{a1}} fails after several retries first because of the division by > zero error > # Task {{a1}} aborts the Parquet write task and tries to remove its task > attempt output directory {{d1}} (a sub-directory of {{D1}}) > # Job {{A}} gets canceled on driver side, all the other 31 tasks also get > canceled *asynchronously* > # {{ParquetOutputCommitter.abortJob()}} tries to remove {{D1}} by first > removing all its child files/directories first > Note that when testing with local directory, {{RawLocalFileSystem}} simply > calls {{java.io.File.delete()}} to deletion, and only empty directories can > be deleted. > # Because tasks are canceled asynchronously, some other task, say {{a2}}, may > just get scheduled and create its own task attempt directory {{d2}} under > {{D1}} > # Now {{ParquetOutputCommitter.abortJob()}} tries to finally remove {{D1}} > itself, but fails because {{d2}} makes {{D1}} non-empty again > Notice that this bug affects
[jira] [Commented] (SPARK-12147) Off heap storage and dynamicAllocation operation
[ https://issues.apache.org/jira/browse/SPARK-12147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041740#comment-15041740 ] Sean Owen commented on SPARK-12147: --- To be honest, the priority doesn't matter much at all. It won't make anyone work on it. You're talking about a feature that's not generally used in Spark (Tachyon) and may be moving out of the core, so I dunno, it doesn't seem "Major". I understand that this is about RDDs that are cached off-heap only now. You're talking about an executor being shut down, not the app terminating right? I would not think an executor stopping changes off-heap cached data, but I don't know the details -- maybe there is a locality aspect to it, and if there is then that's the reason its fate is like that of on-heap storage. To double check, are you saying you observe this happening or this is just your reading of the docs? the passage you quote is about something else as far as I can tell. > Off heap storage and dynamicAllocation operation > > > Key: SPARK-12147 > URL: https://issues.apache.org/jira/browse/SPARK-12147 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.2 > Environment: Cloudera Hadoop 2.6.0-cdh5.4.8 > Tachyon 0.7.1 > Yarn >Reporter: Rares Mirica >Priority: Minor > Attachments: spark-defaults.conf > > > For the purpose of increasing computation density and efficiency I set up to > test off-heap storage (using Tachyon) with dynamicAllocation enabled. > Following the available documentation (programming-guide for Spark 1.5.2) I > was expecting data to be cached in Tachyon for the lifetime of the > application (driver instance) or until unpersist() is called. This belief was > supported by the doc: "Cached data is not lost if individual executors > crash." where with crash I also assimilate Graceful Decommission. > Furthermore, in the GD description documented in the job-scheduling document > cached data preservation through off-heap storage is also hinted at. > Seeing how Tachyon is now in a state where these promises of a better future > are well within reach, I consider it a bug that upon graceful decommission of > an executor the off-heap data is deleted (presumably as part of the cleanup > phase). > Needless to say, enabling the preservation of the off-heap persisted data > after graceful decommission for dynamic allocation would yield significant > improvements in resource allocation, especially over yarn where executors use > up compute "slots" even if idle. After a long, expensive, computation where > we take advantage of the dynamically scaled executors, the rest of the spark > jobs can use the cached data while releasing the compute resources for other > cluster tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8513) _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition
[ https://issues.apache.org/jira/browse/SPARK-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041742#comment-15041742 ] Jiri Syrovy edited comment on SPARK-8513 at 12/4/15 4:45 PM: - The same thing happens when writing partitioned files with coalesce(1) and spark.speculation set to "false": {code:java} DataFrameWriter writer = df.coalesce(1).write().format("parquet"); writer = perTemplate ? writer.partitionBy("templateId", "definitionId") : writer.partitionBy("definitionId"); writer.mode(SaveMode.Append).save(ConfConsts.STORAGE_PREFIX + location); {code} {noformat} [2015-12-04 16:14:59,821] WARN .apache.hadoop.fs.FileUtil [] [akka://JobServer/user/context-supervisor/CSCONTEXT] - Failed to delete file or dir [/data/build/result_12349.parquet.stats/_temporary/0/task_201512041614_5493_m_00/templateId=2918/definitionId=4/.part-r-0-cee32e28-fa7c-43ec-bbe1-63b639deb395.snappy.parquet.crc]: it still exists. [2015-12-04 16:14:59,821] ERROR InsertIntoHadoopFsRelation [] [akka://JobServer/user/context-supervisor/CSCONTEXT] - Aborting job. java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus{path=file:/data/build/result_12349.parquet.stats/_temporary/0/task_201512041614_5493_m_00/templateId=2918; isDirectory=true; modification_time=1449245693000; access_time=0; owner=; group=; permission=rwxrwxrwx; isSymlink=false} to file:/data/build/result_12349.parquet.stats/templateId=2918 at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.renameOrMerge(FileOutputCommitter.java:397) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:388) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326) at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46) {noformat} was (Author: xjrk): The same thing happens when writing partitioned files with coalesce(1) and spark.speculation set to "false": {code:java} DataFrameWriter writer = df.coalesce(1).write().format("parquet"); writer = perTemplate ? writer.partitionBy("templateId", "definitionId") : writer.partitionBy("definitionId"); writer.mode(SaveMode.Append).save(ConfConsts.STORAGE_PREFIX + location); {code:java} {noformat} [2015-12-04 16:14:59,821] WARN .apache.hadoop.fs.FileUtil [] [akka://JobServer/user/context-supervisor/CSCONTEXT] - Failed to delete file or dir [/data/build/result_12349.parquet.stats/_temporary/0/task_201512041614_5493_m_00/templateId=2918/definitionId=4/.part-r-0-cee32e28-fa7c-43ec-bbe1-63b639deb395.snappy.parquet.crc]: it still exists. [2015-12-04 16:14:59,821] ERROR InsertIntoHadoopFsRelation [] [akka://JobServer/user/context-supervisor/CSCONTEXT] - Aborting job. java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus{path=file:/data/build/result_12349.parquet.stats/_temporary/0/task_201512041614_5493_m_00/templateId=2918; isDirectory=true; modification_time=1449245693000; access_time=0; owner=; group=; permission=rwxrwxrwx; isSymlink=false} to file:/data/build/result_12349.parquet.stats/templateId=2918 at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.renameOrMerge(FileOutputCommitter.java:397) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:388) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326) at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46) {noformat} > _temporary may be left undeleted when a write job committed with > FileOutputCommitter fails due to a race condition > -- > > Key: SPARK-8513 > URL: https://issues.apache.org/jira/browse/SPARK-8513 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.2.2, 1.3.1, 1.4.0 >Reporter: Cheng Lian > > To reproduce this issue, we need a node with relatively more cores, say 32 > (e.g., Spark Jenkins builder is a good candidate). With such a node, the > following code should be relatively easy to reproduce this issue: > {code} > sqlContext.range(0, 10).repartition(32).select('id / > 0).write.mode("overwrite").parquet("file:///tmp/foo") > {code} > You may observe similar log
[jira] [Commented] (SPARK-12144) Implement DataFrameReader and DataFrameWriter API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042103#comment-15042103 ] Felix Cheung commented on SPARK-12144: -- +1 [~shivaram] The style {code} read.format("json").option(options).load("people.json") {code} doesn't really fit R instead, why don't we have {code} read(format, options_named_list, ...) {code} in fact we could leave/add convenience functions like {code} read.csv(options) read.json(options) read.parquet(options) {code} > Implement DataFrameReader and DataFrameWriter API in SparkR > --- > > Key: SPARK-12144 > URL: https://issues.apache.org/jira/browse/SPARK-12144 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 1.5.2 >Reporter: Sun Rui > > DataFrameReader API: > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader > DataFrameWriter API: > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12038) Add support for pickle protocol to graphite sink
[ https://issues.apache.org/jira/browse/SPARK-12038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042124#comment-15042124 ] Apache Spark commented on SPARK-12038: -- User 'alberskib' has created a pull request for this issue: https://github.com/apache/spark/pull/10148 > Add support for pickle protocol to graphite sink > > > Key: SPARK-12038 > URL: https://issues.apache.org/jira/browse/SPARK-12038 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Bartlomiej Alberski >Priority: Minor > > Currently GraphiteSink support TCP sender and UDP sender. UDP sender does not > guarantee that graphtie will receive metrics. On the other hand, usage of TCP > sender could be problematic in case that several jobs are sending lot of > different metrics with high frequency - it could lead to network congestion. > I think that implementing support for pickle protocol for sending metrics to > graphite could be helpfull in minimalizing network traffic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12038) Add support for pickle protocol to graphite sink
[ https://issues.apache.org/jira/browse/SPARK-12038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12038: Assignee: (was: Apache Spark) > Add support for pickle protocol to graphite sink > > > Key: SPARK-12038 > URL: https://issues.apache.org/jira/browse/SPARK-12038 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Bartlomiej Alberski >Priority: Minor > > Currently GraphiteSink support TCP sender and UDP sender. UDP sender does not > guarantee that graphtie will receive metrics. On the other hand, usage of TCP > sender could be problematic in case that several jobs are sending lot of > different metrics with high frequency - it could lead to network congestion. > I think that implementing support for pickle protocol for sending metrics to > graphite could be helpfull in minimalizing network traffic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12038) Add support for pickle protocol to graphite sink
[ https://issues.apache.org/jira/browse/SPARK-12038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12038: Assignee: Apache Spark > Add support for pickle protocol to graphite sink > > > Key: SPARK-12038 > URL: https://issues.apache.org/jira/browse/SPARK-12038 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Bartlomiej Alberski >Assignee: Apache Spark >Priority: Minor > > Currently GraphiteSink support TCP sender and UDP sender. UDP sender does not > guarantee that graphtie will receive metrics. On the other hand, usage of TCP > sender could be problematic in case that several jobs are sending lot of > different metrics with high frequency - it could lead to network congestion. > I think that implementing support for pickle protocol for sending metrics to > graphite could be helpfull in minimalizing network traffic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11944) Python API for mllib.clustering.BisectingKMeans
[ https://issues.apache.org/jira/browse/SPARK-11944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11944: Assignee: Apache Spark > Python API for mllib.clustering.BisectingKMeans > --- > > Key: SPARK-11944 > URL: https://issues.apache.org/jira/browse/SPARK-11944 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib, PySpark >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Minor > > Add Python API for mllib.clustering.BisectingKMeans. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11944) Python API for mllib.clustering.BisectingKMeans
[ https://issues.apache.org/jira/browse/SPARK-11944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11944: Assignee: (was: Apache Spark) > Python API for mllib.clustering.BisectingKMeans > --- > > Key: SPARK-11944 > URL: https://issues.apache.org/jira/browse/SPARK-11944 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib, PySpark >Reporter: Yanbo Liang >Priority: Minor > > Add Python API for mllib.clustering.BisectingKMeans. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11944) Python API for mllib.clustering.BisectingKMeans
[ https://issues.apache.org/jira/browse/SPARK-11944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042339#comment-15042339 ] Apache Spark commented on SPARK-11944: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/10150 > Python API for mllib.clustering.BisectingKMeans > --- > > Key: SPARK-11944 > URL: https://issues.apache.org/jira/browse/SPARK-11944 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib, PySpark >Reporter: Yanbo Liang >Priority: Minor > > Add Python API for mllib.clustering.BisectingKMeans. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12152) Speed up Scalastyle by only running one SBT command instead of four
[ https://issues.apache.org/jira/browse/SPARK-12152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042381#comment-15042381 ] Apache Spark commented on SPARK-12152: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/10151 > Speed up Scalastyle by only running one SBT command instead of four > --- > > Key: SPARK-12152 > URL: https://issues.apache.org/jira/browse/SPARK-12152 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Reporter: Josh Rosen >Assignee: Josh Rosen > > dev/scalastyle runs four SBT commands when only one would suffice. We should > fix this in order to speed up pull request builds by about 60 seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12152) Speed up Scalastyle by only running one SBT command instead of four
[ https://issues.apache.org/jira/browse/SPARK-12152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12152: Assignee: Apache Spark (was: Josh Rosen) > Speed up Scalastyle by only running one SBT command instead of four > --- > > Key: SPARK-12152 > URL: https://issues.apache.org/jira/browse/SPARK-12152 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Reporter: Josh Rosen >Assignee: Apache Spark > > dev/scalastyle runs four SBT commands when only one would suffice. We should > fix this in order to speed up pull request builds by about 60 seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1832) Executor UI improvement suggestions
[ https://issues.apache.org/jira/browse/SPARK-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041936#comment-15041936 ] Alexander Bozarth edited comment on SPARK-1832 at 12/4/15 8:49 PM: --- I've split off a subtask for the Color improvements that I will be submitting a PR for shortly. Also what do you mean by "the MASTER task" in the description? I will also continue to work on the column totals task was (Author: ajbozarth): I've split off a subtask for the Color improvements that I will be submitting a PR for shortly. Also what do you mean by "the MASTER task" in the description? > Executor UI improvement suggestions > --- > > Key: SPARK-1832 > URL: https://issues.apache.org/jira/browse/SPARK-1832 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.0.0 >Reporter: Thomas Graves > > I received some suggestions from a user for the /executors UI page to make it > more helpful. This gets more important when you have a really large number of > executors. > Fill some of the cells with color in order to make it easier to absorb > the info, e.g. > RED if Failed Tasks greater than 0 (maybe the more failed, the more intense > the red) > GREEN if Active Tasks greater than 0 (maybe more intense the larger the > number) > Possibly color code COMPLETE TASKS using various shades of blue (e.g., based > on the log(# completed) > - if dark blue then write the value in white (same for the RED and GREEN above > Maybe mark the MASTER task somehow > > Report the TOTALS in each column (do this at the TOP so no need to scroll > to the bottom, or print both at top and bottom). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6990) Add Java linting script
[ https://issues.apache.org/jira/browse/SPARK-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-6990. --- Resolution: Fixed Fix Version/s: 2.0.0 Fixed by https://github.com/apache/spark/pull/9867 > Add Java linting script > --- > > Key: SPARK-6990 > URL: https://issues.apache.org/jira/browse/SPARK-6990 > Project: Spark > Issue Type: New Feature > Components: Project Infra >Reporter: Josh Rosen >Priority: Minor > Labels: starter > Fix For: 2.0.0 > > > It would be nice to add a {{dev/lint-java}} script to enforce style rules for > Spark's Java code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12152) Speed up Scalastyle by only running one SBT command instead of four
[ https://issues.apache.org/jira/browse/SPARK-12152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12152: Assignee: Josh Rosen (was: Apache Spark) > Speed up Scalastyle by only running one SBT command instead of four > --- > > Key: SPARK-12152 > URL: https://issues.apache.org/jira/browse/SPARK-12152 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Reporter: Josh Rosen >Assignee: Josh Rosen > > dev/scalastyle runs four SBT commands when only one would suffice. We should > fix this in order to speed up pull request builds by about 60 seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12150) numPartitions argument to sqlContext.range() should be optional
[ https://issues.apache.org/jira/browse/SPARK-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12150: Assignee: (was: Apache Spark) > numPartitions argument to sqlContext.range() should be optional > > > Key: SPARK-12150 > URL: https://issues.apache.org/jira/browse/SPARK-12150 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Henri DF >Priority: Minor > > It's a little inconsistent that the first two sqlContext.range() methods > don't take a numPartitions arg, while the third one does. > And more importantly, it's a little inconvenient that the numPartitions arg > is mandatory for the third range() method - it means that if you want to > specify a step, you suddenly have to think about partitioning - an orthogonal > concern. > My suggestion would be to make numPartitions optional, like it is on the > sparkContext.range(..). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12150) numPartitions argument to sqlContext.range() should be optional
[ https://issues.apache.org/jira/browse/SPARK-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042194#comment-15042194 ] Apache Spark commented on SPARK-12150: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/10149 > numPartitions argument to sqlContext.range() should be optional > > > Key: SPARK-12150 > URL: https://issues.apache.org/jira/browse/SPARK-12150 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Henri DF >Priority: Minor > > It's a little inconsistent that the first two sqlContext.range() methods > don't take a numPartitions arg, while the third one does. > And more importantly, it's a little inconvenient that the numPartitions arg > is mandatory for the third range() method - it means that if you want to > specify a step, you suddenly have to think about partitioning - an orthogonal > concern. > My suggestion would be to make numPartitions optional, like it is on the > sparkContext.range(..). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12150) numPartitions argument to sqlContext.range() should be optional
[ https://issues.apache.org/jira/browse/SPARK-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12150: Assignee: Apache Spark > numPartitions argument to sqlContext.range() should be optional > > > Key: SPARK-12150 > URL: https://issues.apache.org/jira/browse/SPARK-12150 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Henri DF >Assignee: Apache Spark >Priority: Minor > > It's a little inconsistent that the first two sqlContext.range() methods > don't take a numPartitions arg, while the third one does. > And more importantly, it's a little inconvenient that the numPartitions arg > is mandatory for the third range() method - it means that if you want to > specify a step, you suddenly have to think about partitioning - an orthogonal > concern. > My suggestion would be to make numPartitions optional, like it is on the > sparkContext.range(..). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11255) R Test build should run on R 3.1.1
[ https://issues.apache.org/jira/browse/SPARK-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-11255: --- Assignee: shane knapp > R Test build should run on R 3.1.1 > -- > > Key: SPARK-11255 > URL: https://issues.apache.org/jira/browse/SPARK-11255 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Felix Cheung >Assignee: shane knapp >Priority: Minor > > Test should run on R 3.1.1 which is the version listed as supported. > Apparently there are few R changes that can go undetected since Jenkins Test > build is running something newer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12152) Speed up Scalastyle by only running one SBT command instead of four
Josh Rosen created SPARK-12152: -- Summary: Speed up Scalastyle by only running one SBT command instead of four Key: SPARK-12152 URL: https://issues.apache.org/jira/browse/SPARK-12152 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Josh Rosen Assignee: Josh Rosen dev/scalastyle runs four SBT commands when only one would suffice. We should fix this in order to speed up pull request builds by about 60 seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12058) Fix KinesisStreamTests in python/pyspark/streaming/tests.py and enable it
[ https://issues.apache.org/jira/browse/SPARK-12058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-12058. --- Resolution: Fixed Fix Version/s: 1.6.0 > Fix KinesisStreamTests in python/pyspark/streaming/tests.py and enable it > - > > Key: SPARK-12058 > URL: https://issues.apache.org/jira/browse/SPARK-12058 > Project: Spark > Issue Type: Test > Components: Tests >Reporter: Shixiong Zhu >Assignee: Burak Yavuz > Fix For: 1.6.0 > > > KinesisStreamTests is disabled to unblock other PRs. After fixing it, we > should enable it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12138) Escape \u in the generated comments.
[ https://issues.apache.org/jira/browse/SPARK-12138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-12138: - Summary: Escape \u in the generated comments. (was: [SPARK-11352][SQL] Escape \u in the generated comments.) > Escape \u in the generated comments. > > > Key: SPARK-12138 > URL: https://issues.apache.org/jira/browse/SPARK-12138 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > https://spark-tests.appspot.com/test-logs/12683942 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11823) HiveThriftBinaryServerSuite tests timing out, leaves hanging processes
[ https://issues.apache.org/jira/browse/SPARK-11823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042260#comment-15042260 ] shane knapp commented on SPARK-11823: - it looks like it's this particular test that's hanging (and leaving hanging processes): [info] - test jdbc cancel *** FAILED *** (1 hour, 27 minutes, 37 seconds) i've seen this process hanging for over 3 hours as well. > HiveThriftBinaryServerSuite tests timing out, leaves hanging processes > -- > > Key: SPARK-11823 > URL: https://issues.apache.org/jira/browse/SPARK-11823 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: shane knapp > Attachments: > spark-jenkins-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-amp-jenkins-worker-05.out, > stack.log > > > i've noticed on a few branches that the HiveThriftBinaryServerSuite tests > time out, and when that happens, the build is aborted but the tests leave > behind hanging processes that eat up cpu and ram. > most recently, i discovered this happening w/the 1.6 SBT build, specifically > w/the hadoop 2.0 profile: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.6-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/56/console > [~vanzin] grabbed the jstack log, which i've attached to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame
[ https://issues.apache.org/jira/browse/SPARK-12148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-12148: - Component/s: SparkR > SparkR: rename DataFrame to SparkDataFrame > -- > > Key: SPARK-12148 > URL: https://issues.apache.org/jira/browse/SPARK-12148 > Project: Spark > Issue Type: Wish > Components: R, SparkR >Reporter: Michael Lawrence >Priority: Minor > > The SparkR package represents a Spark DataFrame with the class "DataFrame". > That conflicts with the more general DataFrame class defined in the S4Vectors > package. Would it not be more appropriate to use the name "SparkDataFrame" > instead? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12151) Improve PySpark MLLib prediction performance when using pickled vectors
holdenk created SPARK-12151: --- Summary: Improve PySpark MLLib prediction performance when using pickled vectors Key: SPARK-12151 URL: https://issues.apache.org/jira/browse/SPARK-12151 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: holdenk Priority: Minor In a number of places inside of PySpark MLLib when calling predict on an RDD we map the Python prediction function over the RDD, instead we could convert the RDD to an RDD of pickled Vectors and then use the Java prediction function. This would be useful for models which have optimized predicting on batches of objects (e.g. by broadcasting the relevant parts of the model or similar). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12150) numPartitions argument to sqlContext.range() should be optional
[ https://issues.apache.org/jira/browse/SPARK-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042156#comment-15042156 ] Xiao Li commented on SPARK-12150: - Yeah, you are right. Will deliver a PR soon. Thanks! > numPartitions argument to sqlContext.range() should be optional > > > Key: SPARK-12150 > URL: https://issues.apache.org/jira/browse/SPARK-12150 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Henri DF >Priority: Minor > > It's a little inconsistent that the first two sqlContext.range() methods > don't take a numPartitions arg, while the third one does. > And more importantly, it's a little inconvenient that the numPartitions arg > is mandatory for the third range() method - it means that if you want to > specify a step, you suddenly have to think about partitioning - an orthogonal > concern. > My suggestion would be to make numPartitions optional, like it is on the > sparkContext.range(..). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10931) PySpark ML Models should contain Param values
[ https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042482#comment-15042482 ] Evan Chen commented on SPARK-10931: --- Hey Joseph, Just wanted to give a status update. I should be submitting a PR soon. Thanks, > PySpark ML Models should contain Param values > - > > Key: SPARK-10931 > URL: https://issues.apache.org/jira/browse/SPARK-10931 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley > > PySpark spark.ml Models are generally wrappers around Java objects and do not > even contain Param values. This JIRA is for copying the Param values from > the Estimator to the model. > This can likely be solved by modifying Estimator.fit to copy Param values, > but should also include proper unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12142) Can't request executor when container allocator is not ready
[ https://issues.apache.org/jira/browse/SPARK-12142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-12142. Resolution: Fixed Assignee: meiyoula Fix Version/s: 2.0.0 > Can't request executor when container allocator is not ready > > > Key: SPARK-12142 > URL: https://issues.apache.org/jira/browse/SPARK-12142 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: meiyoula >Assignee: meiyoula > Fix For: 2.0.0 > > > Using Dynamic Allocation function, when a new AM is starting, and > ExecutorAllocationManager send RequestExecutor message to AM. If the > container allocator is not ready, the whole app will hang on -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12138) Escape \u in the generated comments.
[ https://issues.apache.org/jira/browse/SPARK-12138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042492#comment-15042492 ] Xiao Li commented on SPARK-12138: - If nobody takes it, I can make a try. [~yhuai] Could you explain how to reproduce it? I did not hit this issue when running the tests. Thanks > Escape \u in the generated comments. > > > Key: SPARK-12138 > URL: https://issues.apache.org/jira/browse/SPARK-12138 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > https://spark-tests.appspot.com/test-logs/12683942 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12080) Kryo - Support multiple user registrators
[ https://issues.apache.org/jira/browse/SPARK-12080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-12080. Resolution: Fixed Assignee: Rotem Fix Version/s: 2.0.0 > Kryo - Support multiple user registrators > - > > Key: SPARK-12080 > URL: https://issues.apache.org/jira/browse/SPARK-12080 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: Rotem >Assignee: Rotem >Priority: Minor > Labels: kryo, registrator, serializers > Fix For: 2.0.0 > > Original Estimate: 72h > Remaining Estimate: 72h > > Background: Currently when users need to have a custom serializer for their > registered classes, they use the user registrator of Kryo using the > spark.kryo.registrator configuration parameter. > Problem: If the Spark user is an infrastructure itself, it may receive > multiple such registrators but won't be able to register them. > Important note: Currently the single registrator supported can't reach any > state/configuration (it is instantiated by reflection with empty constructor) > Using SparkEnv from user code isn't acceptable. > Workaround: > Create a wrapper registrator as a user, and have its implementation scan the > class path for multiple classes. > Caveat: this is inefficient and too complicated. > Suggested solution - support multiple registrators + stay backward compatible > Option 1: > enhance the value of spark.kryo.registrator to support a comma separated > list for class names. This will be backward compatible and won't add new > parameters. > Option 2: > to be more logical, add spark.kryo.registrators new parameter, while keeping > the code handling the old one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12084) Fix codes that uses ByteBuffer.array incorrectly
[ https://issues.apache.org/jira/browse/SPARK-12084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-12084. Resolution: Fixed Fix Version/s: 2.0.0 > Fix codes that uses ByteBuffer.array incorrectly > > > Key: SPARK-12084 > URL: https://issues.apache.org/jira/browse/SPARK-12084 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.0 > > > ByteBuffer doesn't guarantee all contents in `ByteBuffer.array` are valid. > E.g, a ByteBuffer returned by ByteBuffer.slice. We should not use the whole > content of `ByteBuffer` unless we know that's correct. > This patch fixed all places that use `ByteBuffer.array` incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11081) Make spark-core pull in Jersey and javax.ws.rs dependencies separately for easier overriding
[ https://issues.apache.org/jira/browse/SPARK-11081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042516#comment-15042516 ] Matt Cheah commented on SPARK-11081: Upgrading to Jersey 2 definitely sounds more reasonable. Perhaps we can discard this ticket and just open a "Upgrade to Jersey 2" ticket targeting Spark 2.0, what does everyone think? > Make spark-core pull in Jersey and javax.ws.rs dependencies separately for > easier overriding > > > Key: SPARK-11081 > URL: https://issues.apache.org/jira/browse/SPARK-11081 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Reporter: Mingyu Kim > > As seen from this thread > (https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCALte62yD8H3=2KVMiFs7NZjn929oJ133JkPLrNEj=vrx-d2...@mail.gmail.com%3E), > Spark is incompatible with Jersey 2 especially when Spark is embedded in an > application running with Jersey. > There was an in-depth discussion on options for shading and making it easier > for users to be able to use Jersey 2 with Spark applications: > https://github.com/apache/spark/pull/9615 > To recap the discussion, Jersey 1 has two issues: > 1. It has classes listed in META-INF/services/ files that would be loaded > even if Jersey 2 was being loaded on the classpath in a higher precedence. > This means that Jersey 2 would attempt to use Jersey 1 implementations in > some places regardless of user attempts to override the dependency with > things like userClassPathFirst. > 2. Jersey 1 packages javax.ws.rs classes inside itself, making it hard to > exclude just javax.ws.rs APIs and replace them with ones that Jersey 2 is > compatible with. > Also discussed was the fact that plain old shading doesn't work here, since > you would need to shade lines in META-INF/services as well, not just classes. > Not to mention that shading JAX-RS annotations is tricky as well. > To recap the discussion as what needs to happen Spark-side, we need to: > 1. Create a "org.spark-project.jersey" artifact (loosely speaking) which is > the Jersey 1 jar minus all the javax.ws.rs stuff (no need to actually > shade/namespace the classes that way, just the artifact name) > 2. Put all the javax.ws.rs stuff extracted from step 1 into its own artifact, > say "org.spark-project.javax.ws.rs". (META-INF/services/javax.ws.rs* files > live in this artifact as well) > 3. Spark-core's pom depends on org.spark-project artifacts from step 1 and 2 > 4. Spark assembly excludes META-INF/services/javax.ws.rs.* - it turns out > these files aren't actually necessary for Jersey 1 to function properly in > general (we need to test this more however) > Now a user that wants to depend on Jersey 2, and is depending on Spark maven > artifacts, would do the following in their application > 1. Provide my own dependency on Jersey 2 and its transitive javax.ws.rs > dependencies > 2. In my application's dependencies, exclude org.spark-project.javax.ws.rs > from spark-core. We keep org.spark-project.jersey because spark-core needs > it, but it will use the javax.ws.rs classes that my application is providing. > 3. Set spark.executor.userClassPathFirst=true and ship Jersey 2 and new > javax.ws.rs jars to the executors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11081) Make spark-core pull in Jersey and javax.ws.rs dependencies separately for easier overriding
[ https://issues.apache.org/jira/browse/SPARK-11081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042519#comment-15042519 ] Marcelo Vanzin commented on SPARK-11081: Opening a new one or re-purposing this one, either is fine. > Make spark-core pull in Jersey and javax.ws.rs dependencies separately for > easier overriding > > > Key: SPARK-11081 > URL: https://issues.apache.org/jira/browse/SPARK-11081 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Reporter: Mingyu Kim > > As seen from this thread > (https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCALte62yD8H3=2KVMiFs7NZjn929oJ133JkPLrNEj=vrx-d2...@mail.gmail.com%3E), > Spark is incompatible with Jersey 2 especially when Spark is embedded in an > application running with Jersey. > There was an in-depth discussion on options for shading and making it easier > for users to be able to use Jersey 2 with Spark applications: > https://github.com/apache/spark/pull/9615 > To recap the discussion, Jersey 1 has two issues: > 1. It has classes listed in META-INF/services/ files that would be loaded > even if Jersey 2 was being loaded on the classpath in a higher precedence. > This means that Jersey 2 would attempt to use Jersey 1 implementations in > some places regardless of user attempts to override the dependency with > things like userClassPathFirst. > 2. Jersey 1 packages javax.ws.rs classes inside itself, making it hard to > exclude just javax.ws.rs APIs and replace them with ones that Jersey 2 is > compatible with. > Also discussed was the fact that plain old shading doesn't work here, since > you would need to shade lines in META-INF/services as well, not just classes. > Not to mention that shading JAX-RS annotations is tricky as well. > To recap the discussion as what needs to happen Spark-side, we need to: > 1. Create a "org.spark-project.jersey" artifact (loosely speaking) which is > the Jersey 1 jar minus all the javax.ws.rs stuff (no need to actually > shade/namespace the classes that way, just the artifact name) > 2. Put all the javax.ws.rs stuff extracted from step 1 into its own artifact, > say "org.spark-project.javax.ws.rs". (META-INF/services/javax.ws.rs* files > live in this artifact as well) > 3. Spark-core's pom depends on org.spark-project artifacts from step 1 and 2 > 4. Spark assembly excludes META-INF/services/javax.ws.rs.* - it turns out > these files aren't actually necessary for Jersey 1 to function properly in > general (we need to test this more however) > Now a user that wants to depend on Jersey 2, and is depending on Spark maven > artifacts, would do the following in their application > 1. Provide my own dependency on Jersey 2 and its transitive javax.ws.rs > dependencies > 2. In my application's dependencies, exclude org.spark-project.javax.ws.rs > from spark-core. We keep org.spark-project.jersey because spark-core needs > it, but it will use the javax.ws.rs classes that my application is providing. > 3. Set spark.executor.userClassPathFirst=true and ship Jersey 2 and new > javax.ws.rs jars to the executors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11081) Make spark-core pull in Jersey and javax.ws.rs dependencies separately for easier overriding
[ https://issues.apache.org/jira/browse/SPARK-11081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Cheah resolved SPARK-11081. Resolution: Not A Problem Ok, I filed SPARK-12154 to track it. > Make spark-core pull in Jersey and javax.ws.rs dependencies separately for > easier overriding > > > Key: SPARK-11081 > URL: https://issues.apache.org/jira/browse/SPARK-11081 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Reporter: Mingyu Kim > > As seen from this thread > (https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCALte62yD8H3=2KVMiFs7NZjn929oJ133JkPLrNEj=vrx-d2...@mail.gmail.com%3E), > Spark is incompatible with Jersey 2 especially when Spark is embedded in an > application running with Jersey. > There was an in-depth discussion on options for shading and making it easier > for users to be able to use Jersey 2 with Spark applications: > https://github.com/apache/spark/pull/9615 > To recap the discussion, Jersey 1 has two issues: > 1. It has classes listed in META-INF/services/ files that would be loaded > even if Jersey 2 was being loaded on the classpath in a higher precedence. > This means that Jersey 2 would attempt to use Jersey 1 implementations in > some places regardless of user attempts to override the dependency with > things like userClassPathFirst. > 2. Jersey 1 packages javax.ws.rs classes inside itself, making it hard to > exclude just javax.ws.rs APIs and replace them with ones that Jersey 2 is > compatible with. > Also discussed was the fact that plain old shading doesn't work here, since > you would need to shade lines in META-INF/services as well, not just classes. > Not to mention that shading JAX-RS annotations is tricky as well. > To recap the discussion as what needs to happen Spark-side, we need to: > 1. Create a "org.spark-project.jersey" artifact (loosely speaking) which is > the Jersey 1 jar minus all the javax.ws.rs stuff (no need to actually > shade/namespace the classes that way, just the artifact name) > 2. Put all the javax.ws.rs stuff extracted from step 1 into its own artifact, > say "org.spark-project.javax.ws.rs". (META-INF/services/javax.ws.rs* files > live in this artifact as well) > 3. Spark-core's pom depends on org.spark-project artifacts from step 1 and 2 > 4. Spark assembly excludes META-INF/services/javax.ws.rs.* - it turns out > these files aren't actually necessary for Jersey 1 to function properly in > general (we need to test this more however) > Now a user that wants to depend on Jersey 2, and is depending on Spark maven > artifacts, would do the following in their application > 1. Provide my own dependency on Jersey 2 and its transitive javax.ws.rs > dependencies > 2. In my application's dependencies, exclude org.spark-project.javax.ws.rs > from spark-core. We keep org.spark-project.jersey because spark-core needs > it, but it will use the javax.ws.rs classes that my application is providing. > 3. Set spark.executor.userClassPathFirst=true and ship Jersey 2 and new > javax.ws.rs jars to the executors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12112) Upgrade to SBT 0.13.9
[ https://issues.apache.org/jira/browse/SPARK-12112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-12112. - Resolution: Fixed Fix Version/s: 2.0.0 > Upgrade to SBT 0.13.9 > - > > Key: SPARK-12112 > URL: https://issues.apache.org/jira/browse/SPARK-12112 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.0 > > > We should upgrade to SBT 0.13.9, since this is a requirement in order to use > SBT's new Maven-style resolution features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12138) Escape \u in the generated comments.
[ https://issues.apache.org/jira/browse/SPARK-12138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042539#comment-15042539 ] Yin Huai commented on SPARK-12138: -- Yeah. It will be great if you can work on it. To reproduce it, you can try it like https://github.com/apache/spark/commit/5872a9d89fe2720c2bcb1fc7494136947a72581c#diff-cf187b40d98ff322d4bde4185701baa2. Basically you have a predicate and one of the operand is {{\u}}. > Escape \u in the generated comments. > > > Key: SPARK-12138 > URL: https://issues.apache.org/jira/browse/SPARK-12138 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > https://spark-tests.appspot.com/test-logs/12683942 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8118) Turn off noisy log output produced by Parquet 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042438#comment-15042438 ] Justin Bailey commented on SPARK-8118: -- Can this issue be re-opened? It's really terrible how much log output Parquet produces. > Turn off noisy log output produced by Parquet 1.7.0 > --- > > Key: SPARK-8118 > URL: https://issues.apache.org/jira/browse/SPARK-8118 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.4.1, 1.5.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > Fix For: 1.5.0 > > > Parquet 1.7.0 renames package name to "org.apache.parquet", need to adjust > {{ParquetRelation.enableLogForwarding}} accordingly to avoid noisy log output. > A better approach than simply muting these log lines is to redirect Parquet > logs via SLF4J, so that we can handle them consistently. In general these > logs are very useful. Esp. when used to diagnosing Parquet memory issue and > filter push-down. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8118) Turn off noisy log output produced by Parquet 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042438#comment-15042438 ] Justin Bailey edited comment on SPARK-8118 at 12/4/15 11:48 PM: Can this issue be re-opened? It's really terrible how much log output Parquet produces. (Using spark 1.5.1, btw) was (Author: m4dc4p): Can this issue be re-opened? It's really terrible how much log output Parquet produces. > Turn off noisy log output produced by Parquet 1.7.0 > --- > > Key: SPARK-8118 > URL: https://issues.apache.org/jira/browse/SPARK-8118 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.4.1, 1.5.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > Fix For: 1.5.0 > > > Parquet 1.7.0 renames package name to "org.apache.parquet", need to adjust > {{ParquetRelation.enableLogForwarding}} accordingly to avoid noisy log output. > A better approach than simply muting these log lines is to redirect Parquet > logs via SLF4J, so that we can handle them consistently. In general these > logs are very useful. Esp. when used to diagnosing Parquet memory issue and > filter push-down. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12153) Word2Vec uses a fixed length for sentences which is not reasonable for reality, and similarity functions and fields are not accessible
YongGang Cao created SPARK-12153: Summary: Word2Vec uses a fixed length for sentences which is not reasonable for reality, and similarity functions and fields are not accessible Key: SPARK-12153 URL: https://issues.apache.org/jira/browse/SPARK-12153 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.5.2 Reporter: YongGang Cao Priority: Minor sentence boundary matters for sliding window, we shouldn't train model from a window across sentences. the current 100 word as a hard split for sentences doesn't really make sense. And the cosinesimilarity functions is private which is useless for caller. we may need to access the vocabulary and wordindex table as well, those need getters I made changes to address above issues. will send out pull request for your review. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042462#comment-15042462 ] Andrew Philpot commented on SPARK-4036: --- Hi, what is the maturity of this code? Are you interested in a tester? I have existing models and features for CRF++. Would love to simply migrate them to a native spark implementation. > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12153) Word2Vec uses a fixed length for sentences which is not reasonable for reality, and similarity functions and fields are not accessible
[ https://issues.apache.org/jira/browse/SPARK-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YongGang Cao updated SPARK-12153: - Description: sentence boundary matters for sliding window, we shouldn't train model from a window across sentences. the current 100 word as a hard split for sentences doesn't really make sense. And the cosinesimilarity functions is private which is useless for caller. we may need to access the vocabulary and wordindex table as well, those need getters I made changes to address above issues. here is the pull request: https://github.com/apache/spark/pull/10152 was: sentence boundary matters for sliding window, we shouldn't train model from a window across sentences. the current 100 word as a hard split for sentences doesn't really make sense. And the cosinesimilarity functions is private which is useless for caller. we may need to access the vocabulary and wordindex table as well, those need getters I made changes to address above issues. will send out pull request for your review. > Word2Vec uses a fixed length for sentences which is not reasonable for > reality, and similarity functions and fields are not accessible > -- > > Key: SPARK-12153 > URL: https://issues.apache.org/jira/browse/SPARK-12153 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.2 >Reporter: YongGang Cao >Priority: Minor > Labels: patch > > sentence boundary matters for sliding window, we shouldn't train model from a > window across sentences. the current 100 word as a hard split for sentences > doesn't really make sense. > And the cosinesimilarity functions is private which is useless for caller. > we may need to access the vocabulary and wordindex table as well, those need > getters > I made changes to address above issues. > here is the pull request: https://github.com/apache/spark/pull/10152 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11046) Pass schema from R to JVM using JSON format
[ https://issues.apache.org/jira/browse/SPARK-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042513#comment-15042513 ] Nakul Jindal commented on SPARK-11046: -- Hi, I am trying to look into this. When you say that SparkR passes a DataFrame schema from R to JVM backend using regular expression, do you mean this format: mapor array Also, is "structField.character" the only function where this "regular expression" format is passed from R to JVM (using org.apache.spark.sql.api.r.SQLUtils", "createDF)? > Pass schema from R to JVM using JSON format > --- > > Key: SPARK-11046 > URL: https://issues.apache.org/jira/browse/SPARK-11046 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Sun Rui >Priority: Minor > > Currently, SparkR passes a DataFrame schema from R to JVM backend using > regular expression. However, Spark now supports schmea using JSON format. > So enhance SparkR to use schema in JSON format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12154) Upgrade to Jersey 2
Matt Cheah created SPARK-12154: -- Summary: Upgrade to Jersey 2 Key: SPARK-12154 URL: https://issues.apache.org/jira/browse/SPARK-12154 Project: Spark Issue Type: Improvement Components: Build, Spark Core Affects Versions: 1.5.2 Reporter: Matt Cheah Fairly self-explanatory, Jersey 1 is a bit old and could use an upgrade. Library conflicts for Jersey are difficult to workaround - see discussion on SPARK-11081. It's easier to upgrade Jersey entirely, but we should target Spark 2.0 since this may be a break for users who were using Jersey 1 in their Spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12138) Escape \u in the generated comments.
[ https://issues.apache.org/jira/browse/SPARK-12138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042541#comment-15042541 ] Xiao Li commented on SPARK-12138: - Sure. Will try it tonight or tomorrow. Thanks! : ) > Escape \u in the generated comments. > > > Key: SPARK-12138 > URL: https://issues.apache.org/jira/browse/SPARK-12138 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > https://spark-tests.appspot.com/test-logs/12683942 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10123) Cannot set "--deploy-mode" in default configuration
[ https://issues.apache.org/jira/browse/SPARK-10123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042598#comment-15042598 ] Saisai Shao commented on SPARK-10123: - Just confirm if it is on your plan, in case duplicated :). > Cannot set "--deploy-mode" in default configuration > --- > > Key: SPARK-10123 > URL: https://issues.apache.org/jira/browse/SPARK-10123 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Marcelo Vanzin >Priority: Minor > > There's no configuration option that is the equivalent of "--deploy-mode". So > it's not possible, for example, to have applications be submitted in > standalone cluster mode by default - you have to always use the command line > argument for that. > YARN is special because it has the (somewhat deprecated) "yarn-cluster" > master, but it would be nice to be consistent and have a proper config option > for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12138) Escape \u in the generated comments.
[ https://issues.apache.org/jira/browse/SPARK-12138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042612#comment-15042612 ] Apache Spark commented on SPARK-12138: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/10155 > Escape \u in the generated comments. > > > Key: SPARK-12138 > URL: https://issues.apache.org/jira/browse/SPARK-12138 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > https://spark-tests.appspot.com/test-logs/12683942 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12138) Escape \u in the generated comments.
[ https://issues.apache.org/jira/browse/SPARK-12138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12138: Assignee: Apache Spark > Escape \u in the generated comments. > > > Key: SPARK-12138 > URL: https://issues.apache.org/jira/browse/SPARK-12138 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > > https://spark-tests.appspot.com/test-logs/12683942 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12138) Escape \u in the generated comments.
[ https://issues.apache.org/jira/browse/SPARK-12138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12138: Assignee: (was: Apache Spark) > Escape \u in the generated comments. > > > Key: SPARK-12138 > URL: https://issues.apache.org/jira/browse/SPARK-12138 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > https://spark-tests.appspot.com/test-logs/12683942 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12155) Execution OOM after a relative large dataset cached in the cluster.
[ https://issues.apache.org/jira/browse/SPARK-12155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042618#comment-15042618 ] Yin Huai commented on SPARK-12155: -- We can take a look at task {{136}}. {code} 15/12/05 02:51:33 INFO UnifiedMemoryManager: Try to acquire 262144 bytes memory. But, on-heap execution memory poll only has 196608 bytes free memory.269746176 bytes pool size and 269549568 bytes used memory.(taskAttemptId: 136) 15/12/05 02:51:33 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 8195014656, storageMemoryPool.poolSize 16659775488, storageRegionSize 8464760832.(taskAttemptId: 136) 15/12/05 02:51:33 INFO UnifiedMemoryManager: Try to reclaim memory space from storage memory pool.(taskAttemptId: 136) ... 15/12/05 02:51:33 INFO UnifiedMemoryManager: Reclaimed 65536 bytes of memory from storage memory pool.Adding them back to onHeapExecutionMemoryPool.(taskAttemptId: 136) 15/12/05 02:51:33 INFO UnifiedMemoryManager: onHeapExecutionMemoryPool's size is 269811712 bytes. 262144 bytes are free.(taskAttemptId: 136) 15/12/05 02:51:33 INFO ExecutionMemoryPool: maxToGrant 81920, poolSize 269811712, numActiveTasks 4, curMem 67371008, numBytes 262144, taskAttemptId 136. {code} task 136 wants to acquire 262144 bytes. However, the execution memory pool only has {{196608}} bytes free memory. So, we decide to reclaim memory from storage memory and the amount of memory we want is {{65536}} bytes. Then, we get {{65536}} bytes and put them to execution pool. Now, we have {{269811712}} bytes in execution memory pool (it was {{269746176}}). However, in {{ExecutionMemoryPool.acquireMemory}}, we calculate {{maxGrant}} by using {{val maxToGrant = math.min(numBytes, math.max(0, (poolSize / numActiveTasks) - curMem))}}. You the memory that can be used by task 136 is actually {{269811712 / 4 = 67452928}} (67371008 bytes are actually used). So, the free space for this task ends up be {{67452928 - 67371008 = 81920}} bytes. > Execution OOM after a relative large dataset cached in the cluster. > --- > > Key: SPARK-12155 > URL: https://issues.apache.org/jira/browse/SPARK-12155 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Reporter: Yin Huai >Priority: Blocker > > I have a cluster with relative 80GB of mem. Then, I cached a 43GB dataframe. > When I start to consume the query. I got the following exception (I added > more logs to the code). > {code} > 15/12/05 00:33:43 INFO UnifiedMemoryManager: Creating UnifedMemoryManager for > 4 cores with 16929521664 maxMemory, 8464760832 storageRegionSize. > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 1048576 bytes of free space for > block rdd_94_37(free: 3253659951, max: 16798973952) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 5142008 bytes of free space for > block rdd_94_37(free: 3252611375, max: 16798973952) > 15/12/05 01:20:50 INFO Executor: Finished task 36.0 in stage 4.0 (TID 109). > 3028 bytes result sent to driver > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98948238 bytes of free space for > block rdd_94_37(free: 3314840375, max: 16866344960) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98675713 bytes of free space for > block rdd_94_37(free: 3215892137, max: 16866344960) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 197347565 bytes of free space > for block rdd_94_37(free: 3117216424, max: 16866344960) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 295995553 bytes of free space > for block rdd_94_37(free: 2919868859, max: 16866344960) > 15/12/05 01:20:51 INFO MemoryStore: Ensuring 394728479 bytes of free space > for block rdd_94_37(free: 2687050010, max: 16929521664) > 15/12/05 01:20:51 INFO Executor: Finished task 32.0 in stage 4.0 (TID 106). > 3028 bytes result sent to driver > 15/12/05 01:20:51 INFO MemoryStore: Ensuring 591258816 bytes of free space > for block rdd_94_37(free: 2292321531, max: 16929521664) > 15/12/05 01:20:51 INFO MemoryStore: Ensuring 901645182 bytes of free space > for block rdd_94_37(free: 1701062715, max: 16929521664) > 15/12/05 01:20:52 INFO MemoryStore: Ensuring 1302179076 bytes of free space > for block rdd_94_37(free: 799417533, max: 16929521664) > 15/12/05 01:20:52 INFO MemoryStore: Will not store rdd_94_37 as it would > require dropping another block from the same RDD > 15/12/05 01:20:52 WARN MemoryStore: Not enough space to cache rdd_94_37 in > memory! (computed 2.4 GB so far) > 15/12/05 01:20:52 INFO MemoryStore: Memory use = 12.6 GB (blocks) + 2.4 GB > (scratch space shared across 13 tasks(s)) = 15.0 GB. Storage limit = 15.8 GB. > 15/12/05 01:20:52 INFO BlockManager: Found block rdd_94_37 locally > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 262144 bytes > memory. But, on-heap execution memory poll only has 0 bytes free memory. >
[jira] [Created] (SPARK-12155) Execution OOM after a relative large dataset cached in the cluster.
Yin Huai created SPARK-12155: Summary: Execution OOM after a relative large dataset cached in the cluster. Key: SPARK-12155 URL: https://issues.apache.org/jira/browse/SPARK-12155 Project: Spark Issue Type: Bug Components: Spark Core, SQL Reporter: Yin Huai Priority: Blocker I have a cluster with relative 80GB of mem. Then, I cached a 43GB dataframe. When I start to consume the query. I got the following exception (I added more logs to the code). {code} 15/12/05 00:33:43 INFO UnifiedMemoryManager: Creating UnifedMemoryManager for 4 cores with 16929521664 maxMemory, 8464760832 storageRegionSize. 15/12/05 01:20:50 INFO MemoryStore: Ensuring 1048576 bytes of free space for block rdd_94_37(free: 3253659951, max: 16798973952) 15/12/05 01:20:50 INFO MemoryStore: Ensuring 5142008 bytes of free space for block rdd_94_37(free: 3252611375, max: 16798973952) 15/12/05 01:20:50 INFO Executor: Finished task 36.0 in stage 4.0 (TID 109). 3028 bytes result sent to driver 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98948238 bytes of free space for block rdd_94_37(free: 3314840375, max: 16866344960) 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98675713 bytes of free space for block rdd_94_37(free: 3215892137, max: 16866344960) 15/12/05 01:20:50 INFO MemoryStore: Ensuring 197347565 bytes of free space for block rdd_94_37(free: 3117216424, max: 16866344960) 15/12/05 01:20:50 INFO MemoryStore: Ensuring 295995553 bytes of free space for block rdd_94_37(free: 2919868859, max: 16866344960) 15/12/05 01:20:51 INFO MemoryStore: Ensuring 394728479 bytes of free space for block rdd_94_37(free: 2687050010, max: 16929521664) 15/12/05 01:20:51 INFO Executor: Finished task 32.0 in stage 4.0 (TID 106). 3028 bytes result sent to driver 15/12/05 01:20:51 INFO MemoryStore: Ensuring 591258816 bytes of free space for block rdd_94_37(free: 2292321531, max: 16929521664) 15/12/05 01:20:51 INFO MemoryStore: Ensuring 901645182 bytes of free space for block rdd_94_37(free: 1701062715, max: 16929521664) 15/12/05 01:20:52 INFO MemoryStore: Ensuring 1302179076 bytes of free space for block rdd_94_37(free: 799417533, max: 16929521664) 15/12/05 01:20:52 INFO MemoryStore: Will not store rdd_94_37 as it would require dropping another block from the same RDD 15/12/05 01:20:52 WARN MemoryStore: Not enough space to cache rdd_94_37 in memory! (computed 2.4 GB so far) 15/12/05 01:20:52 INFO MemoryStore: Memory use = 12.6 GB (blocks) + 2.4 GB (scratch space shared across 13 tasks(s)) = 15.0 GB. Storage limit = 15.8 GB. 15/12/05 01:20:52 INFO BlockManager: Found block rdd_94_37 locally 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 262144 bytes memory. But, on-heap execution memory poll only has 0 bytes free memory. 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 8464760832, storageMemoryPool.poolSize 16929521664, storageRegionSize 8464760832. 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from storage memory pool. 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 262144 bytes free memory space from StorageMemoryPool. 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 262144 bytes of memory from storage memory pool.Adding them back to onHeapExecutionMemoryPool. 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 67108864 bytes memory. But, on-heap execution memory poll only has 0 bytes free memory. 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 8464498688, storageMemoryPool.poolSize 16929259520, storageRegionSize 8464760832. 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from storage memory pool. 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 67108864 bytes free memory space from StorageMemoryPool. 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 67108864 bytes of memory from storage memory pool.Adding them back to onHeapExecutionMemoryPool. 15/12/05 01:20:54 INFO Executor: Finished task 37.0 in stage 4.0 (TID 110). 3077 bytes result sent to driver 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 120 15/12/05 01:20:56 INFO Executor: Running task 1.0 in stage 5.0 (TID 120) 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 124 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 128 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 132 15/12/05 01:20:56 INFO Executor: Running task 9.0 in stage 5.0 (TID 128) 15/12/05 01:20:56 INFO Executor: Running task 13.0 in stage 5.0 (TID 132) 15/12/05 01:20:56 INFO Executor: Running task 5.0 in stage 5.0 (TID 124) 15/12/05 01:20:56 INFO MapOutputTrackerWorker: Updating epoch to 2 and clearing cache 15/12/05 01:20:56 INFO TorrentBroadcast: Started reading broadcast variable 6 15/12/05 01:20:56 INFO MemoryStore: Ensuring 9471 bytes of free
[jira] [Commented] (SPARK-12155) Execution OOM after a relative large dataset cached in the cluster.
[ https://issues.apache.org/jira/browse/SPARK-12155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042616#comment-15042616 ] Yin Huai commented on SPARK-12155: -- I have some new logs. {code} 15/12/05 02:51:33 INFO MemoryConsumer: allocateArray with size 262144 bytes 15/12/05 02:51:33 INFO MemoryConsumer: allocateArray with size 262144 bytes 15/12/05 02:51:33 INFO UnifiedMemoryManager: Try to acquire 262144 bytes memory. But, on-heap execution memory poll only has 0 bytes free memory.269484032 bytes pool size and 269484032 bytes used memory.(taskAttemptId: 131) 15/12/05 02:51:33 INFO MemoryConsumer: allocateArray with size 262144 bytes 15/12/05 02:51:33 INFO MemoryConsumer: allocateArray with size 262144 bytes 15/12/05 02:51:33 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 8195276800, storageMemoryPool.poolSize 16660037632, storageRegionSize 8464760832.(taskAttemptId: 131) 15/12/05 02:51:33 INFO UnifiedMemoryManager: Try to reclaim memory space from storage memory pool.(taskAttemptId: 131) 15/12/05 02:51:33 INFO StorageMemoryPool: Claiming 262144 bytes free memory space from StorageMemoryPool. 15/12/05 02:51:33 INFO UnifiedMemoryManager: Reclaimed 262144 bytes of memory from storage memory pool.Adding them back to onHeapExecutionMemoryPool.(taskAttemptId: 131) 15/12/05 02:51:33 INFO UnifiedMemoryManager: onHeapExecutionMemoryPool's size is 269746176 bytes. 262144 bytes are free.(taskAttemptId: 131) 15/12/05 02:51:33 INFO ExecutionMemoryPool: maxToGrant 65536, poolSize 269746176, numActiveTasks 4, curMem 67371008, numBytes 262144, taskAttemptId 131. 15/12/05 02:51:33 INFO UnifiedMemoryManager: Try to acquire 262144 bytes memory. But, on-heap execution memory poll only has 196608 bytes free memory.269746176 bytes pool size and 269549568 bytes used memory.(taskAttemptId: 136) 15/12/05 02:51:33 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 8195014656, storageMemoryPool.poolSize 16659775488, storageRegionSize 8464760832.(taskAttemptId: 136) 15/12/05 02:51:33 INFO UnifiedMemoryManager: Try to reclaim memory space from storage memory pool.(taskAttemptId: 136) 15/12/05 02:51:33 INFO TaskMemoryManager: Task 131 acquire 64.0 KB for org.apache.spark.unsafe.map.BytesToBytesMap@118eb11 15/12/05 02:51:33 INFO StorageMemoryPool: Claiming 65536 bytes free memory space from StorageMemoryPool. 15/12/05 02:51:33 INFO UnifiedMemoryManager: Reclaimed 65536 bytes of memory from storage memory pool.Adding them back to onHeapExecutionMemoryPool.(taskAttemptId: 136) 15/12/05 02:51:33 INFO UnifiedMemoryManager: onHeapExecutionMemoryPool's size is 269811712 bytes. 262144 bytes are free.(taskAttemptId: 136) 15/12/05 02:51:33 INFO ExecutionMemoryPool: maxToGrant 81920, poolSize 269811712, numActiveTasks 4, curMem 67371008, numBytes 262144, taskAttemptId 136. 15/12/05 02:51:33 INFO UnifiedMemoryManager: Try to acquire 262144 bytes memory. But, on-heap execution memory poll only has 180224 bytes free memory.269811712 bytes pool size and 269631488 bytes used memory.(taskAttemptId: 135) 15/12/05 02:51:33 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 8194949120, storageMemoryPool.poolSize 16659709952, storageRegionSize 8464760832.(taskAttemptId: 135) 15/12/05 02:51:33 INFO TaskMemoryManager: Task 136 acquire 80.0 KB for org.apache.spark.unsafe.map.BytesToBytesMap@2a342d48 15/12/05 02:51:33 INFO UnifiedMemoryManager: Try to reclaim memory space from storage memory pool.(taskAttemptId: 135) 15/12/05 02:51:33 INFO StorageMemoryPool: Claiming 81920 bytes free memory space from StorageMemoryPool. 15/12/05 02:51:33 INFO UnifiedMemoryManager: Reclaimed 81920 bytes of memory from storage memory pool.Adding them back to onHeapExecutionMemoryPool.(taskAttemptId: 135) 15/12/05 02:51:33 INFO UnifiedMemoryManager: onHeapExecutionMemoryPool's size is 269893632 bytes. 262144 bytes are free.(taskAttemptId: 135) 15/12/05 02:51:33 INFO ExecutionMemoryPool: maxToGrant 102400, poolSize 269893632, numActiveTasks 4, curMem 67371008, numBytes 262144, taskAttemptId 135. 15/12/05 02:51:33 INFO UnifiedMemoryManager: Try to acquire 262144 bytes memory. But, on-heap execution memory poll only has 159744 bytes free memory.269893632 bytes pool size and 269733888 bytes used memory.(taskAttemptId: 119) 15/12/05 02:51:33 INFO TaskMemoryManager: Task 135 acquire 100.0 KB for org.apache.spark.unsafe.map.BytesToBytesMap@74e81f25 15/12/05 02:51:33 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 8194867200, storageMemoryPool.poolSize 16659628032, storageRegionSize 8464760832.(taskAttemptId: 119) 15/12/05 02:51:33 INFO UnifiedMemoryManager: Try to reclaim memory space from storage memory pool.(taskAttemptId: 119) 15/12/05 02:51:33 INFO StorageMemoryPool: Claiming 102400 bytes free memory space from StorageMemoryPool. 15/12/05 02:51:33 INFO UnifiedMemoryManager: Reclaimed 102400 bytes of memory from storage memory pool.Adding
[jira] [Commented] (SPARK-12155) Execution OOM after a relative large dataset cached in the cluster.
[ https://issues.apache.org/jira/browse/SPARK-12155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042582#comment-15042582 ] Apache Spark commented on SPARK-12155: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/10153 > Execution OOM after a relative large dataset cached in the cluster. > --- > > Key: SPARK-12155 > URL: https://issues.apache.org/jira/browse/SPARK-12155 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Reporter: Yin Huai >Priority: Blocker > > I have a cluster with relative 80GB of mem. Then, I cached a 43GB dataframe. > When I start to consume the query. I got the following exception (I added > more logs to the code). > {code} > 15/12/05 00:33:43 INFO UnifiedMemoryManager: Creating UnifedMemoryManager for > 4 cores with 16929521664 maxMemory, 8464760832 storageRegionSize. > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 1048576 bytes of free space for > block rdd_94_37(free: 3253659951, max: 16798973952) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 5142008 bytes of free space for > block rdd_94_37(free: 3252611375, max: 16798973952) > 15/12/05 01:20:50 INFO Executor: Finished task 36.0 in stage 4.0 (TID 109). > 3028 bytes result sent to driver > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98948238 bytes of free space for > block rdd_94_37(free: 3314840375, max: 16866344960) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98675713 bytes of free space for > block rdd_94_37(free: 3215892137, max: 16866344960) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 197347565 bytes of free space > for block rdd_94_37(free: 3117216424, max: 16866344960) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 295995553 bytes of free space > for block rdd_94_37(free: 2919868859, max: 16866344960) > 15/12/05 01:20:51 INFO MemoryStore: Ensuring 394728479 bytes of free space > for block rdd_94_37(free: 2687050010, max: 16929521664) > 15/12/05 01:20:51 INFO Executor: Finished task 32.0 in stage 4.0 (TID 106). > 3028 bytes result sent to driver > 15/12/05 01:20:51 INFO MemoryStore: Ensuring 591258816 bytes of free space > for block rdd_94_37(free: 2292321531, max: 16929521664) > 15/12/05 01:20:51 INFO MemoryStore: Ensuring 901645182 bytes of free space > for block rdd_94_37(free: 1701062715, max: 16929521664) > 15/12/05 01:20:52 INFO MemoryStore: Ensuring 1302179076 bytes of free space > for block rdd_94_37(free: 799417533, max: 16929521664) > 15/12/05 01:20:52 INFO MemoryStore: Will not store rdd_94_37 as it would > require dropping another block from the same RDD > 15/12/05 01:20:52 WARN MemoryStore: Not enough space to cache rdd_94_37 in > memory! (computed 2.4 GB so far) > 15/12/05 01:20:52 INFO MemoryStore: Memory use = 12.6 GB (blocks) + 2.4 GB > (scratch space shared across 13 tasks(s)) = 15.0 GB. Storage limit = 15.8 GB. > 15/12/05 01:20:52 INFO BlockManager: Found block rdd_94_37 locally > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 262144 bytes > memory. But, on-heap execution memory poll only has 0 bytes free memory. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage > 8464760832, storageMemoryPool.poolSize 16929521664, storageRegionSize > 8464760832. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from > storage memory pool. > 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 262144 bytes free memory > space from StorageMemoryPool. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 262144 bytes of memory > from storage memory pool.Adding them back to onHeapExecutionMemoryPool. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 67108864 bytes > memory. But, on-heap execution memory poll only has 0 bytes free memory. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage > 8464498688, storageMemoryPool.poolSize 16929259520, storageRegionSize > 8464760832. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from > storage memory pool. > 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 67108864 bytes free memory > space from StorageMemoryPool. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 67108864 bytes of > memory from storage memory pool.Adding them back to onHeapExecutionMemoryPool. > 15/12/05 01:20:54 INFO Executor: Finished task 37.0 in stage 4.0 (TID 110). > 3077 bytes result sent to driver > 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 120 > 15/12/05 01:20:56 INFO Executor: Running task 1.0 in stage 5.0 (TID 120) > 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 124 > 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 128 > 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got
[jira] [Assigned] (SPARK-12155) Execution OOM after a relative large dataset cached in the cluster.
[ https://issues.apache.org/jira/browse/SPARK-12155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12155: Assignee: (was: Apache Spark) > Execution OOM after a relative large dataset cached in the cluster. > --- > > Key: SPARK-12155 > URL: https://issues.apache.org/jira/browse/SPARK-12155 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Reporter: Yin Huai >Priority: Blocker > > I have a cluster with relative 80GB of mem. Then, I cached a 43GB dataframe. > When I start to consume the query. I got the following exception (I added > more logs to the code). > {code} > 15/12/05 00:33:43 INFO UnifiedMemoryManager: Creating UnifedMemoryManager for > 4 cores with 16929521664 maxMemory, 8464760832 storageRegionSize. > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 1048576 bytes of free space for > block rdd_94_37(free: 3253659951, max: 16798973952) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 5142008 bytes of free space for > block rdd_94_37(free: 3252611375, max: 16798973952) > 15/12/05 01:20:50 INFO Executor: Finished task 36.0 in stage 4.0 (TID 109). > 3028 bytes result sent to driver > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98948238 bytes of free space for > block rdd_94_37(free: 3314840375, max: 16866344960) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98675713 bytes of free space for > block rdd_94_37(free: 3215892137, max: 16866344960) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 197347565 bytes of free space > for block rdd_94_37(free: 3117216424, max: 16866344960) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 295995553 bytes of free space > for block rdd_94_37(free: 2919868859, max: 16866344960) > 15/12/05 01:20:51 INFO MemoryStore: Ensuring 394728479 bytes of free space > for block rdd_94_37(free: 2687050010, max: 16929521664) > 15/12/05 01:20:51 INFO Executor: Finished task 32.0 in stage 4.0 (TID 106). > 3028 bytes result sent to driver > 15/12/05 01:20:51 INFO MemoryStore: Ensuring 591258816 bytes of free space > for block rdd_94_37(free: 2292321531, max: 16929521664) > 15/12/05 01:20:51 INFO MemoryStore: Ensuring 901645182 bytes of free space > for block rdd_94_37(free: 1701062715, max: 16929521664) > 15/12/05 01:20:52 INFO MemoryStore: Ensuring 1302179076 bytes of free space > for block rdd_94_37(free: 799417533, max: 16929521664) > 15/12/05 01:20:52 INFO MemoryStore: Will not store rdd_94_37 as it would > require dropping another block from the same RDD > 15/12/05 01:20:52 WARN MemoryStore: Not enough space to cache rdd_94_37 in > memory! (computed 2.4 GB so far) > 15/12/05 01:20:52 INFO MemoryStore: Memory use = 12.6 GB (blocks) + 2.4 GB > (scratch space shared across 13 tasks(s)) = 15.0 GB. Storage limit = 15.8 GB. > 15/12/05 01:20:52 INFO BlockManager: Found block rdd_94_37 locally > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 262144 bytes > memory. But, on-heap execution memory poll only has 0 bytes free memory. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage > 8464760832, storageMemoryPool.poolSize 16929521664, storageRegionSize > 8464760832. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from > storage memory pool. > 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 262144 bytes free memory > space from StorageMemoryPool. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 262144 bytes of memory > from storage memory pool.Adding them back to onHeapExecutionMemoryPool. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 67108864 bytes > memory. But, on-heap execution memory poll only has 0 bytes free memory. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage > 8464498688, storageMemoryPool.poolSize 16929259520, storageRegionSize > 8464760832. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from > storage memory pool. > 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 67108864 bytes free memory > space from StorageMemoryPool. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 67108864 bytes of > memory from storage memory pool.Adding them back to onHeapExecutionMemoryPool. > 15/12/05 01:20:54 INFO Executor: Finished task 37.0 in stage 4.0 (TID 110). > 3077 bytes result sent to driver > 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 120 > 15/12/05 01:20:56 INFO Executor: Running task 1.0 in stage 5.0 (TID 120) > 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 124 > 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 128 > 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 132 > 15/12/05 01:20:56 INFO Executor: Running task 9.0 in stage 5.0 (TID 128) >
[jira] [Assigned] (SPARK-12155) Execution OOM after a relative large dataset cached in the cluster.
[ https://issues.apache.org/jira/browse/SPARK-12155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12155: Assignee: Apache Spark > Execution OOM after a relative large dataset cached in the cluster. > --- > > Key: SPARK-12155 > URL: https://issues.apache.org/jira/browse/SPARK-12155 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Blocker > > I have a cluster with relative 80GB of mem. Then, I cached a 43GB dataframe. > When I start to consume the query. I got the following exception (I added > more logs to the code). > {code} > 15/12/05 00:33:43 INFO UnifiedMemoryManager: Creating UnifedMemoryManager for > 4 cores with 16929521664 maxMemory, 8464760832 storageRegionSize. > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 1048576 bytes of free space for > block rdd_94_37(free: 3253659951, max: 16798973952) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 5142008 bytes of free space for > block rdd_94_37(free: 3252611375, max: 16798973952) > 15/12/05 01:20:50 INFO Executor: Finished task 36.0 in stage 4.0 (TID 109). > 3028 bytes result sent to driver > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98948238 bytes of free space for > block rdd_94_37(free: 3314840375, max: 16866344960) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98675713 bytes of free space for > block rdd_94_37(free: 3215892137, max: 16866344960) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 197347565 bytes of free space > for block rdd_94_37(free: 3117216424, max: 16866344960) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 295995553 bytes of free space > for block rdd_94_37(free: 2919868859, max: 16866344960) > 15/12/05 01:20:51 INFO MemoryStore: Ensuring 394728479 bytes of free space > for block rdd_94_37(free: 2687050010, max: 16929521664) > 15/12/05 01:20:51 INFO Executor: Finished task 32.0 in stage 4.0 (TID 106). > 3028 bytes result sent to driver > 15/12/05 01:20:51 INFO MemoryStore: Ensuring 591258816 bytes of free space > for block rdd_94_37(free: 2292321531, max: 16929521664) > 15/12/05 01:20:51 INFO MemoryStore: Ensuring 901645182 bytes of free space > for block rdd_94_37(free: 1701062715, max: 16929521664) > 15/12/05 01:20:52 INFO MemoryStore: Ensuring 1302179076 bytes of free space > for block rdd_94_37(free: 799417533, max: 16929521664) > 15/12/05 01:20:52 INFO MemoryStore: Will not store rdd_94_37 as it would > require dropping another block from the same RDD > 15/12/05 01:20:52 WARN MemoryStore: Not enough space to cache rdd_94_37 in > memory! (computed 2.4 GB so far) > 15/12/05 01:20:52 INFO MemoryStore: Memory use = 12.6 GB (blocks) + 2.4 GB > (scratch space shared across 13 tasks(s)) = 15.0 GB. Storage limit = 15.8 GB. > 15/12/05 01:20:52 INFO BlockManager: Found block rdd_94_37 locally > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 262144 bytes > memory. But, on-heap execution memory poll only has 0 bytes free memory. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage > 8464760832, storageMemoryPool.poolSize 16929521664, storageRegionSize > 8464760832. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from > storage memory pool. > 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 262144 bytes free memory > space from StorageMemoryPool. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 262144 bytes of memory > from storage memory pool.Adding them back to onHeapExecutionMemoryPool. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 67108864 bytes > memory. But, on-heap execution memory poll only has 0 bytes free memory. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage > 8464498688, storageMemoryPool.poolSize 16929259520, storageRegionSize > 8464760832. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from > storage memory pool. > 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 67108864 bytes free memory > space from StorageMemoryPool. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 67108864 bytes of > memory from storage memory pool.Adding them back to onHeapExecutionMemoryPool. > 15/12/05 01:20:54 INFO Executor: Finished task 37.0 in stage 4.0 (TID 110). > 3077 bytes result sent to driver > 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 120 > 15/12/05 01:20:56 INFO Executor: Running task 1.0 in stage 5.0 (TID 120) > 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 124 > 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 128 > 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 132 > 15/12/05 01:20:56 INFO Executor: Running task 9.0 in
[jira] [Commented] (SPARK-12149) Executor UI improvement suggestions - Color UI
[ https://issues.apache.org/jira/browse/SPARK-12149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042605#comment-15042605 ] Apache Spark commented on SPARK-12149: -- User 'ajbozarth' has created a pull request for this issue: https://github.com/apache/spark/pull/10154 > Executor UI improvement suggestions - Color UI > -- > > Key: SPARK-12149 > URL: https://issues.apache.org/jira/browse/SPARK-12149 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Reporter: Alexander Bozarth > > Splitting off the Color UI portion of the parent UI improvements task, > description copied below: > Fill some of the cells with color in order to make it easier to absorb the > info, e.g. > RED if Failed Tasks greater than 0 (maybe the more failed, the more intense > the red) > GREEN if Active Tasks greater than 0 (maybe more intense the larger the > number) > Possibly color code COMPLETE TASKS using various shades of blue (e.g., based > on the log(# completed) > if dark blue then write the value in white (same for the RED and GREEN above > Merging another idea from SPARK-2132: > Color GC time red when over a percentage of task time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12149) Executor UI improvement suggestions - Color UI
[ https://issues.apache.org/jira/browse/SPARK-12149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12149: Assignee: Apache Spark > Executor UI improvement suggestions - Color UI > -- > > Key: SPARK-12149 > URL: https://issues.apache.org/jira/browse/SPARK-12149 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Reporter: Alexander Bozarth >Assignee: Apache Spark > > Splitting off the Color UI portion of the parent UI improvements task, > description copied below: > Fill some of the cells with color in order to make it easier to absorb the > info, e.g. > RED if Failed Tasks greater than 0 (maybe the more failed, the more intense > the red) > GREEN if Active Tasks greater than 0 (maybe more intense the larger the > number) > Possibly color code COMPLETE TASKS using various shades of blue (e.g., based > on the log(# completed) > if dark blue then write the value in white (same for the RED and GREEN above > Merging another idea from SPARK-2132: > Color GC time red when over a percentage of task time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12149) Executor UI improvement suggestions - Color UI
[ https://issues.apache.org/jira/browse/SPARK-12149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12149: Assignee: (was: Apache Spark) > Executor UI improvement suggestions - Color UI > -- > > Key: SPARK-12149 > URL: https://issues.apache.org/jira/browse/SPARK-12149 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Reporter: Alexander Bozarth > > Splitting off the Color UI portion of the parent UI improvements task, > description copied below: > Fill some of the cells with color in order to make it easier to absorb the > info, e.g. > RED if Failed Tasks greater than 0 (maybe the more failed, the more intense > the red) > GREEN if Active Tasks greater than 0 (maybe more intense the larger the > number) > Possibly color code COMPLETE TASKS using various shades of blue (e.g., based > on the log(# completed) > if dark blue then write the value in white (same for the RED and GREEN above > Merging another idea from SPARK-2132: > Color GC time red when over a percentage of task time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12140) Support Streaming UI in HistoryServer
[ https://issues.apache.org/jira/browse/SPARK-12140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042681#comment-15042681 ] Jean-Baptiste Onofré commented on SPARK-12140: -- It's just to know if you didn't already start something ;) > Support Streaming UI in HistoryServer > - > > Key: SPARK-12140 > URL: https://issues.apache.org/jira/browse/SPARK-12140 > Project: Spark > Issue Type: New Feature > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin > > SPARK-11206 added infrastructure that would allow the streaming UI to be > shown in the History Server. We should add the necessary code to make that > happen, although it requires some changes to how events and listeners are > used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12102) Cast a non-nullable struct field to a nullable field during analysis
[ https://issues.apache.org/jira/browse/SPARK-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12102: Assignee: (was: Apache Spark) > Cast a non-nullable struct field to a nullable field during analysis > > > Key: SPARK-12102 > URL: https://issues.apache.org/jira/browse/SPARK-12102 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > If you try {{sqlContext.sql("select case when 1>0 then struct(1, 2, 3, > cast(hash(4) as int)) else struct(1, 2, 3, 4) end").printSchema}}, you will > see {{org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (1 > > 0) THEN > struct(1,2,3,cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(4) > as int)) ELSE struct(1,2,3,4)' due to data type mismatch: THEN and ELSE > expressions should all be same type or coercible to a common type; line 1 pos > 85}}. > The problem is the nullability difference between {{4}} (non-nullable) and > {{hash(4)}} (nullable). > Seems it makes sense to cast the nullability in the analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12102) Cast a non-nullable struct field to a nullable field during analysis
[ https://issues.apache.org/jira/browse/SPARK-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042717#comment-15042717 ] Apache Spark commented on SPARK-12102: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/10156 > Cast a non-nullable struct field to a nullable field during analysis > > > Key: SPARK-12102 > URL: https://issues.apache.org/jira/browse/SPARK-12102 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > If you try {{sqlContext.sql("select case when 1>0 then struct(1, 2, 3, > cast(hash(4) as int)) else struct(1, 2, 3, 4) end").printSchema}}, you will > see {{org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (1 > > 0) THEN > struct(1,2,3,cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(4) > as int)) ELSE struct(1,2,3,4)' due to data type mismatch: THEN and ELSE > expressions should all be same type or coercible to a common type; line 1 pos > 85}}. > The problem is the nullability difference between {{4}} (non-nullable) and > {{hash(4)}} (nullable). > Seems it makes sense to cast the nullability in the analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12102) Cast a non-nullable struct field to a nullable field during analysis
[ https://issues.apache.org/jira/browse/SPARK-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12102: Assignee: Apache Spark > Cast a non-nullable struct field to a nullable field during analysis > > > Key: SPARK-12102 > URL: https://issues.apache.org/jira/browse/SPARK-12102 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > > If you try {{sqlContext.sql("select case when 1>0 then struct(1, 2, 3, > cast(hash(4) as int)) else struct(1, 2, 3, 4) end").printSchema}}, you will > see {{org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (1 > > 0) THEN > struct(1,2,3,cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(4) > as int)) ELSE struct(1,2,3,4)' due to data type mismatch: THEN and ELSE > expressions should all be same type or coercible to a common type; line 1 pos > 85}}. > The problem is the nullability difference between {{4}} (non-nullable) and > {{hash(4)}} (nullable). > Seems it makes sense to cast the nullability in the analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12149) Executor UI improvement suggestions - Color UI
Alexander Bozarth created SPARK-12149: - Summary: Executor UI improvement suggestions - Color UI Key: SPARK-12149 URL: https://issues.apache.org/jira/browse/SPARK-12149 Project: Spark Issue Type: Sub-task Components: Web UI Reporter: Alexander Bozarth Splitting off the Color UI portion of the parent UI improvements task, description copied below: Fill some of the cells with color in order to make it easier to absorb the info, e.g. RED if Failed Tasks greater than 0 (maybe the more failed, the more intense the red) GREEN if Active Tasks greater than 0 (maybe more intense the larger the number) Possibly color code COMPLETE TASKS using various shades of blue (e.g., based on the log(# completed) if dark blue then write the value in white (same for the RED and GREEN above Merging another idea from SPARK-2132: Color GC time red when over a percentage of task time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10509) Excessive param boiler plate code
[ https://issues.apache.org/jira/browse/SPARK-10509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041937#comment-15041937 ] holdenk commented on SPARK-10509: - So I'm thinking a pretty simple option would be just having a reset owner or similar function (we could have it only work if the owner was dummy). > Excessive param boiler plate code > - > > Key: SPARK-10509 > URL: https://issues.apache.org/jira/browse/SPARK-10509 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Minor > > We have a lot of dummy params we re-set in the init code, this results in a > bunch of duplicated code. We should fix this at some point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1832) Executor UI improvement suggestions
[ https://issues.apache.org/jira/browse/SPARK-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041936#comment-15041936 ] Alexander Bozarth commented on SPARK-1832: -- I've split off a subtask for the Color improvements that I will be submitting a PR for shortly. Also what do you mean by "the MASTER task" in the description? > Executor UI improvement suggestions > --- > > Key: SPARK-1832 > URL: https://issues.apache.org/jira/browse/SPARK-1832 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.0.0 >Reporter: Thomas Graves > > I received some suggestions from a user for the /executors UI page to make it > more helpful. This gets more important when you have a really large number of > executors. > Fill some of the cells with color in order to make it easier to absorb > the info, e.g. > RED if Failed Tasks greater than 0 (maybe the more failed, the more intense > the red) > GREEN if Active Tasks greater than 0 (maybe more intense the larger the > number) > Possibly color code COMPLETE TASKS using various shades of blue (e.g., based > on the log(# completed) > - if dark blue then write the value in white (same for the RED and GREEN above > Maybe mark the MASTER task somehow > > Report the TOTALS in each column (do this at the TOP so no need to scroll > to the bottom, or print both at top and bottom). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12128) Multiplication on decimals in dataframe returns null
[ https://issues.apache.org/jira/browse/SPARK-12128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041954#comment-15041954 ] kevin yu commented on SPARK-12128: -- Hello Philip: Thanks for reporting this problem, this looks like bug for me. I can recreate the problem also. Are you planning to fix this problem? If not, I can look into the code. Thanks. > Multiplication on decimals in dataframe returns null > > > Key: SPARK-12128 > URL: https://issues.apache.org/jira/browse/SPARK-12128 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0, 1.5.1, 1.5.2 > Environment: Scala 2.11/Spark 1.5.0/1.5.1/1.5.2 >Reporter: Philip Dodds > > I hit a weird issue when I tried to multiply to decimals in a select (either > in scala or as SQL), and Im assuming I must be missing the point. > The issue is fairly easy to recreate with something like the following: > {code:java} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.implicits._ > import org.apache.spark.sql.types.Decimal > case class Trade(quantity: Decimal,price: Decimal) > val data = Seq.fill(100) { > val price = Decimal(20+scala.util.Random.nextInt(10)) > val quantity = Decimal(20+scala.util.Random.nextInt(10)) > Trade(quantity, price) > } > val trades = sc.parallelize(data).toDF() > trades.registerTempTable("trades") > trades.select(trades("price")*trades("quantity")).show > sqlContext.sql("select > price/quantity,price*quantity,price+quantity,price-quantity from trades").show > {code} > The odd part is if you run it you will see that the addition/division and > subtraction works but the multiplication returns a null. > Tested on 1.5.1/1.5.2 (Scala 2.10 and 2.11) > ie. > {code} > +--+ > |(price * quantity)| > +--+ > | null| > | null| > | null| > | null| > | null| > +--+ > +++++ > | _c0| _c1| _c2| _c3| > +++++ > |0.952380952380952381|null|41.00...|-1.00...| > |1.380952380952380952|null|50.00...|8.00| > |1.272727272727272727|null|50.00...|6.00| > |0.83|null|44.00...|-4.00...| > |1.00|null|58.00...| 0E-18| > +++++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder
[ https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-12089: - Fix Version/s: (was: 2.0.0) > java.lang.NegativeArraySizeException when growing BufferHolder > -- > > Key: SPARK-12089 > URL: https://issues.apache.org/jira/browse/SPARK-12089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Erik Selin >Priority: Blocker > Fix For: 1.6.0 > > > When running a large spark sql query including multiple joins I see tasks > failing with the following trace: > {code} > java.lang.NegativeArraySizeException > at > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36) > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > From the spark code it looks like this is due to a integer overflow when > growing a buffer length. The offending line {{BufferHolder.java:36}} is the > following in the version I'm running: > {code} > final byte[] tmp = new byte[length * 2]; > {code} > This seems to indicate to me that this buffer will never be able to hold more > then 2G worth of data. And likely will hold even less since any length > > 1073741824 will cause a integer overflow and turn the new buffer size > negative. > I hope I'm simply missing some critical config setting but it still seems > weird that we have a (rather low) upper limit on these buffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org