date:20141016


[ 
https://issues.apache.org/jira/browse/SPARK-3807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173430#comment-14173430
 ] 

Apache Spark commented on SPARK-3807:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/2821

 SparkSql does not work for tables created using custom serde
 

 Key: SPARK-3807
 URL: https://issues.apache.org/jira/browse/SPARK-3807
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: chirag aggarwal
 Fix For: 1.1.1, 1.2.0


 SparkSql crashes on selecting tables using custom serde. 
 Example:
 
 CREATE EXTERNAL TABLE table_name PARTITIONED BY ( a int) ROW FORMAT 'SERDE 
 org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer with 
 serdeproperties(serialization.format=org.apache.thrift.protocol.TBinaryProtocol,serialization.class=ser_class)
  STORED AS SEQUENCEFILE;
 The following exception is seen on running a query like 'select * from 
 table_name limit 1': 
 ERROR CliDriver: org.apache.hadoop.hive.serde2.SerDeException: 
 java.lang.NullPointerException 
 at 
 org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer.initialize(ThriftDeserializer.java:68)
  
 at 
 org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:80) 
 at 
 org.apache.spark.sql.hive.execution.HiveTableScan.addColumnMetadataToConf(HiveTableScan.scala:86)
  
 at 
 org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:100)
  
 at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
  
 at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
  
 at 
 org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364)
  
 at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184)
  
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
  
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
  
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
  
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
  
 at 
 org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:280)
  
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
  
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
  
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
  
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
  
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)
  
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
  
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
  
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:406)
  
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59)
  
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
  
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) 
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
  
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
  
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) 
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) 
 at java.lang.reflect.Method.invoke(Unknown Source) 
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) 
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) 
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
 Caused by: java.lang.NullPointerException
 After fixing this issue, when some columns in the table were referred in the 
 query, sparksql could not resolve those references.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job

2014-10-16 Thread Davis Shepherd (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Shepherd updated SPARK-3882:
--
Comment: was deleted

(was: This is also a serious memory leak that will cause long running drivers 
for spark streaming jobs to exhaust their heap.)

 JobProgressListener gets permanently out of sync with long running job
 --

 Key: SPARK-3882
 URL: https://issues.apache.org/jira/browse/SPARK-3882
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.2
Reporter: Davis Shepherd
 Attachments: Screen Shot 2014-10-03 at 12.50.59 PM.png


 A long running spark context (non-streaming) will eventually start throwing 
 the following in the driver:
 java.util.NoSuchElementException: key not found: 12771
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at scala.collection.AbstractMap.default(Map.scala:58)
   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
   at 
 org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)
 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR 
 org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener 
 threw an exception
 java.util.NoSuchElementException: key not found: 12782
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at scala.collection.AbstractMap.default(Map.scala:58)
   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
   at 
 org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
   at

[jira] [Commented] (SPARK-3965) Spark assembly for hadoop2 contains avro-mapred for hadoop1


[ 
https://issues.apache.org/jira/browse/SPARK-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173445#comment-14173445
 ] 

Apache Spark commented on SPARK-3965:
-

User 'dajac' has created a pull request for this issue:
https://github.com/apache/spark/pull/2822

 Spark assembly for hadoop2 contains avro-mapred for hadoop1
 ---

 Key: SPARK-3965
 URL: https://issues.apache.org/jira/browse/SPARK-3965
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.0.2, 1.1.0
 Environment: hadoop2, HDP2.1
Reporter: David Jacot

 When building Spark assembly for hadoop2, org.apache.avro:avro-mapred for 
 hadoop1 is picked and added to the assembly which leads to following 
 exception at runtime.
 {code}
 java.lang.IncompatibleClassChangeError: Found interface 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
 at 
 org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
 at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111)
 ...
 {code}
 The patch for SPARK-3039 works well at compile time but artefact's classifier 
 is not applied when assembly is built. I'm not a maven expert but I don't 
 think that classifiers are applied on transitive dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3963) Support getting task-scoped properties from TaskContext

2014-10-16 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173451#comment-14173451
 ] 

Reynold Xin commented on SPARK-3963:


Would it make sense to support arbitrary data types also? Also should we 
consider merging this with TaskMetrics?

 Support getting task-scoped properties from TaskContext
 ---

 Key: SPARK-3963
 URL: https://issues.apache.org/jira/browse/SPARK-3963
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell

 This is a proposal for a minor feature. Given stabilization of the 
 TaskContext API, it would be nice to have a mechanism for Spark jobs to 
 access properties that are defined based on task-level scope by Spark RDD's. 
 I'd like to propose adding a simple properties hash map with some standard 
 spark properties that users can access. Later it would be nice to support 
 users setting these properties, but for now to keep it simple in 1.2. I'd 
 prefer users not be able to set them.
 The main use case is providing the file name from Hadoop RDD's, a very common 
 request. But I'd imagine us using this for other things later on. We could 
 also use this to expose some of the taskMetrics, such as e.g. the input bytes.
 {code}
 val data = sc.textFile(s3n//..2014/*/*/*.json)
 data.mapPartitions { 
   val tc = TaskContext.get
   val filename = tc.getProperty(TaskContext.HADOOP_FILE_NAME)
   val parts = fileName.split(/)
   val (year, month, day) = (parts[3], parts[4], parts[5])
   ...
 }
 {code}
 Internally we'd have a method called setProperty, but this wouldn't be 
 exposed initially. This is structured as a simple (String, String) hash map 
 for ease of porting to python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3873) Scala style: check import ordering

2014-10-16 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3873:
---
Assignee: (was: Marcelo Vanzin)

 Scala style: check import ordering
 --

 Key: SPARK-3873
 URL: https://issues.apache.org/jira/browse/SPARK-3873
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Reynold Xin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3888) Limit the memory used by python worker


[ 
https://issues.apache.org/jira/browse/SPARK-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173454#comment-14173454
 ] 

Apache Spark commented on SPARK-3888:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2743

 Limit the memory used by python worker
 --

 Key: SPARK-3888
 URL: https://issues.apache.org/jira/browse/SPARK-3888
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu
Assignee: Davies Liu

 Right now, we did not limit the memory by Python workers, then it maybe run 
 out of memory and freeze the OS. it's safe to have a configurable hard 
 limitation for it, which should be large than spark.executor.python.memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3948) Sort-based shuffle can lead to assorted stream-corruption exceptions

2014-10-16 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173489#comment-14173489
 ] 

Mridul Muralidharan commented on SPARK-3948:



Damn, this sucks : the transferTo is not using the position of the channel to 
write output to ... while it is doing so when append is true (which is 
effectively setting position to end of file on call to getChannel).
The state of the channel, based on what we see above, is the same in both cases 
- since we can see the position is updated - and is persisted and returned when 
we call getChannel in next invocation of copyStreams.
So there is some other set of issues at play which we might not be able to 
workaround from the jvm.


Given this, I think we should 
a) add a logError when initialPosition == finalPosition when inChannel.size  0 
asking users to upgrade to a newer linux kernal
b) ofcourse use append = true : to workaround immediate issues.

(a) will ensure that developers and users/admins will be notified of issues in 
case other codepaths (currently or in future) hit the same issue.



 Sort-based shuffle can lead to assorted stream-corruption exceptions
 

 Key: SPARK-3948
 URL: https://issues.apache.org/jira/browse/SPARK-3948
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Saisai Shao
Assignee: Saisai Shao

 Several exceptions occurred when running TPC-DS queries against latest master 
 branch with sort-based shuffle enable, like PARSING_ERROR(2) in snappy, 
 deserializing error in Kryo and offset out-range in FileManagedBuffer, all 
 these exceptions are gone when we changed to hash-based shuffle.
 With deep investigation, we found that some shuffle output file is 
 unexpectedly smaller than the others, as the log shows:
 {noformat}
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_9_11, offset: 3055635, length: 236708, file length: 47274167
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_10_11, offset: 2986484, length: 222755, file length: 47174539
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_11_11, offset: 2995341, length: 259871, file length: 383405
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_12_11, offset: 2991030, length: 268191, file length: 47478892
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_13_11, offset: 3016292, length: 230694, file length: 47420826
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_14_11, offset: 3061400, length: 241136, file length: 47395509
 {noformat}
 As you can see the total file length of shuffle_6_11_11 is much smaller than 
 other same stage map output results.
 And we also dump the map outputs in map side to see if this small size output 
 is correct or not, below is the log:
 {noformat}
  In bypass merge sort, file name: /mnt/DP_disk1/animal/spark/spark-local-
 20141014182142-8345/22/shuffle_6_11_0.data, file length: 383405length:
  274722 262597 291290 272902 264941 270358 291005 295285 252482 
 287142 232617 259871 233734 241439 228897 234282 253834 235619 
 233803 255532 270739 253825 262087 266404 234273 250120 262983 
 257024 255947 254971 258908 247862 221613 258566 245399 251684 
 274843 226150 264278 245279 225656 235084 239466 212851 242245 
 218781 222191 215500 211548 234256 208601 204113 191923 217895 
 227020 215331 212313 223725 250876 256875 239276 266777 235520 
 237462 234063 242270 246825 255888 235937 236956 233099 264508 
 260303 233294 239061 254856 257475 230105 246553 260412 210355 
 211201 219572 206636 226866 209937 226618 218208 206255 248069 
 221717 222112 215734 248088 239207 246125 239056 241133 253091 
 246738 233128 242794 231606 255737 221123 252115 247286 229688 
 251087 250047 237579 263079 256251 238214 208641 201120 204009 
 200825 211965 200600 194492 226471 194887 226975 215072 206008 
 233288 222132 208860 219064 218162 237126 220465 201343 225711 
 232178 233786 212767 211462 213671 215853 227822 233782 214727 
 247001 228968 247413 222674 214241 184122 215643 207665 219079 
 215185 207718 212723 201613 216600 212591 208174 204195 208099 
 229079 230274 223373 214999 256626 228895 231821 383405 229646 
 220212 245495 245960 227556 213266 237203 203805 240509 239306 
 242365 218416 238487 219397 240026 251011 258369 255365 259811 
 283313 248450 264286 264562 257485 279459 249187 257609 274964 
 292369 273826
 {noformat}
 Here I dump the file name, length and each partition's length, obviously the 
 sum of all partition lengths is not equal to file length. So I think there 
 may be a situation paritionWriter in ExternalSorter not always append to the 
 end of

[jira] [Comment Edited] (SPARK-3948) Sort-based shuffle can lead to assorted stream-corruption exceptions

2014-10-16 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173489#comment-14173489
 ] 

Mridul Muralidharan edited comment on SPARK-3948 at 10/16/14 7:39 AM:
--

Damn, this sucks : the transferTo is not using the position of the channel to 
write output to ... while it is doing so when append is true (which is 
effectively setting position to end of file on call to getChannel).
The state of the channel, based on what we see above, is the same in both cases 
- since we can see the position is updated - and is persisted and returned when 
we call getChannel in next invocation of copyStreams. Not sure if sync() will 
help ...
So there is some other set of issues at play which we might not be able to 
workaround from the jvm.


Given this, I think we should 
a) add a logError when initialPosition == finalPosition when inChannel.size  0 
asking users to upgrade to a newer linux kernal
b) ofcourse use append = true : to workaround immediate issues.

(a) will ensure that developers and users/admins will be notified of issues in 
case other codepaths (currently or in future) hit the same issue.




was (Author: mridulm80):

Damn, this sucks : the transferTo is not using the position of the channel to 
write output to ... while it is doing so when append is true (which is 
effectively setting position to end of file on call to getChannel).
The state of the channel, based on what we see above, is the same in both cases 
- since we can see the position is updated - and is persisted and returned when 
we call getChannel in next invocation of copyStreams.
So there is some other set of issues at play which we might not be able to 
workaround from the jvm.


Given this, I think we should 
a) add a logError when initialPosition == finalPosition when inChannel.size  0 
asking users to upgrade to a newer linux kernal
b) ofcourse use append = true : to workaround immediate issues.

(a) will ensure that developers and users/admins will be notified of issues in 
case other codepaths (currently or in future) hit the same issue.



 Sort-based shuffle can lead to assorted stream-corruption exceptions
 

 Key: SPARK-3948
 URL: https://issues.apache.org/jira/browse/SPARK-3948
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Saisai Shao
Assignee: Saisai Shao

 Several exceptions occurred when running TPC-DS queries against latest master 
 branch with sort-based shuffle enable, like PARSING_ERROR(2) in snappy, 
 deserializing error in Kryo and offset out-range in FileManagedBuffer, all 
 these exceptions are gone when we changed to hash-based shuffle.
 With deep investigation, we found that some shuffle output file is 
 unexpectedly smaller than the others, as the log shows:
 {noformat}
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_9_11, offset: 3055635, length: 236708, file length: 47274167
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_10_11, offset: 2986484, length: 222755, file length: 47174539
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_11_11, offset: 2995341, length: 259871, file length: 383405
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_12_11, offset: 2991030, length: 268191, file length: 47478892
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_13_11, offset: 3016292, length: 230694, file length: 47420826
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_14_11, offset: 3061400, length: 241136, file length: 47395509
 {noformat}
 As you can see the total file length of shuffle_6_11_11 is much smaller than 
 other same stage map output results.
 And we also dump the map outputs in map side to see if this small size output 
 is correct or not, below is the log:
 {noformat}
  In bypass merge sort, file name: /mnt/DP_disk1/animal/spark/spark-local-
 20141014182142-8345/22/shuffle_6_11_0.data, file length: 383405length:
  274722 262597 291290 272902 264941 270358 291005 295285 252482 
 287142 232617 259871 233734 241439 228897 234282 253834 235619 
 233803 255532 270739 253825 262087 266404 234273 250120 262983 
 257024 255947 254971 258908 247862 221613 258566 245399 251684 
 274843 226150 264278 245279 225656 235084 239466 212851 242245 
 218781 222191 215500 211548 234256 208601 204113 191923 217895 
 227020 215331 212313 223725 250876 256875 239276 266777 235520 
 237462 234063 242270 246825 255888 235937 236956 233099 264508 
 260303 233294 239061 254856 257475 230105 246553 260412 210355 
 211201 219572 206636 226866 209937 226618 218208 206255 248069 
 221717 222112 215734 248088

[jira] [Commented] (SPARK-3814) Bitwise does not work in Hive

2014-10-16 Thread Ravindra Pesala (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173526#comment-14173526
 ] 

Ravindra Pesala commented on SPARK-3814:


Added support for Bitwise AND(), OR(|) ,XOR(^), NOT(~) in this PR and I 
updated the title of this defect.

 Bitwise  does not work  in Hive
 

 Key: SPARK-3814
 URL: https://issues.apache.org/jira/browse/SPARK-3814
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yana Kadiyska
Assignee: Ravindra Pesala
Priority: Minor

 Error: java.lang.RuntimeException: 
 Unsupported language features in query: select (case when bit_field  1=1 
 then r_end - r_start else NULL end) from mytable where pkey='0178-2014-07' 
 LIMIT 2
 TOK_QUERY
   TOK_FROM
 TOK_TABREF
   TOK_TABNAME
mytable 
   TOK_INSERT
 TOK_DESTINATION
   TOK_DIR
 TOK_TMP_FILE
 TOK_SELECT
   TOK_SELEXPR
 TOK_FUNCTION
   when
   =
 
   TOK_TABLE_OR_COL
 bit_field
   1
 1
   -
 TOK_TABLE_OR_COL
   r_end
 TOK_TABLE_OR_COL
   r_start
   TOK_NULL
 TOK_WHERE
   =
 TOK_TABLE_OR_COL
   pkey
 '0178-2014-07'
 TOK_LIMIT
   2
 SQLState:  null
 ErrorCode: 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3814) Support for Bitwise AND(), OR(|) ,XOR(^), NOT(~) in Spark HQL and SQL

2014-10-16 Thread Ravindra Pesala (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravindra Pesala updated SPARK-3814:
---
Summary: Support for Bitwise AND(), OR(|) ,XOR(^), NOT(~) in Spark HQL and 
SQL  (was: Bitwise  does not work  in Hive)

 Support for Bitwise AND(), OR(|) ,XOR(^), NOT(~) in Spark HQL and SQL
 --

 Key: SPARK-3814
 URL: https://issues.apache.org/jira/browse/SPARK-3814
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yana Kadiyska
Assignee: Ravindra Pesala
Priority: Minor

 Error: java.lang.RuntimeException: 
 Unsupported language features in query: select (case when bit_field  1=1 
 then r_end - r_start else NULL end) from mytable where pkey='0178-2014-07' 
 LIMIT 2
 TOK_QUERY
   TOK_FROM
 TOK_TABREF
   TOK_TABNAME
mytable 
   TOK_INSERT
 TOK_DESTINATION
   TOK_DIR
 TOK_TMP_FILE
 TOK_SELECT
   TOK_SELEXPR
 TOK_FUNCTION
   when
   =
 
   TOK_TABLE_OR_COL
 bit_field
   1
 1
   -
 TOK_TABLE_OR_COL
   r_end
 TOK_TABLE_OR_COL
   r_start
   TOK_NULL
 TOK_WHERE
   =
 TOK_TABLE_OR_COL
   pkey
 '0178-2014-07'
 TOK_LIMIT
   2
 SQLState:  null
 ErrorCode: 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3948) Sort-based shuffle can lead to assorted stream-corruption exceptions

2014-10-16 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173528#comment-14173528
 ] 

Saisai Shao commented on SPARK-3948:


Hi [~mridulm80],

Thanks a lot for your suggestion. I think currently for other codes in Spark 
which use copyStream will not be affected by this issue, since they only copy 
one input file to the output file. But for some use cases like ExternalSorter  
will indeed be affected by it. I will submit a PR according to your 
suggestions, thanks a lot.

 Sort-based shuffle can lead to assorted stream-corruption exceptions
 

 Key: SPARK-3948
 URL: https://issues.apache.org/jira/browse/SPARK-3948
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Saisai Shao
Assignee: Saisai Shao

 Several exceptions occurred when running TPC-DS queries against latest master 
 branch with sort-based shuffle enable, like PARSING_ERROR(2) in snappy, 
 deserializing error in Kryo and offset out-range in FileManagedBuffer, all 
 these exceptions are gone when we changed to hash-based shuffle.
 With deep investigation, we found that some shuffle output file is 
 unexpectedly smaller than the others, as the log shows:
 {noformat}
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_9_11, offset: 3055635, length: 236708, file length: 47274167
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_10_11, offset: 2986484, length: 222755, file length: 47174539
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_11_11, offset: 2995341, length: 259871, file length: 383405
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_12_11, offset: 2991030, length: 268191, file length: 47478892
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_13_11, offset: 3016292, length: 230694, file length: 47420826
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_14_11, offset: 3061400, length: 241136, file length: 47395509
 {noformat}
 As you can see the total file length of shuffle_6_11_11 is much smaller than 
 other same stage map output results.
 And we also dump the map outputs in map side to see if this small size output 
 is correct or not, below is the log:
 {noformat}
  In bypass merge sort, file name: /mnt/DP_disk1/animal/spark/spark-local-
 20141014182142-8345/22/shuffle_6_11_0.data, file length: 383405length:
  274722 262597 291290 272902 264941 270358 291005 295285 252482 
 287142 232617 259871 233734 241439 228897 234282 253834 235619 
 233803 255532 270739 253825 262087 266404 234273 250120 262983 
 257024 255947 254971 258908 247862 221613 258566 245399 251684 
 274843 226150 264278 245279 225656 235084 239466 212851 242245 
 218781 222191 215500 211548 234256 208601 204113 191923 217895 
 227020 215331 212313 223725 250876 256875 239276 266777 235520 
 237462 234063 242270 246825 255888 235937 236956 233099 264508 
 260303 233294 239061 254856 257475 230105 246553 260412 210355 
 211201 219572 206636 226866 209937 226618 218208 206255 248069 
 221717 222112 215734 248088 239207 246125 239056 241133 253091 
 246738 233128 242794 231606 255737 221123 252115 247286 229688 
 251087 250047 237579 263079 256251 238214 208641 201120 204009 
 200825 211965 200600 194492 226471 194887 226975 215072 206008 
 233288 222132 208860 219064 218162 237126 220465 201343 225711 
 232178 233786 212767 211462 213671 215853 227822 233782 214727 
 247001 228968 247413 222674 214241 184122 215643 207665 219079 
 215185 207718 212723 201613 216600 212591 208174 204195 208099 
 229079 230274 223373 214999 256626 228895 231821 383405 229646 
 220212 245495 245960 227556 213266 237203 203805 240509 239306 
 242365 218416 238487 219397 240026 251011 258369 255365 259811 
 283313 248450 264286 264562 257485 279459 249187 257609 274964 
 292369 273826
 {noformat}
 Here I dump the file name, length and each partition's length, obviously the 
 sum of all partition lengths is not equal to file length. So I think there 
 may be a situation paritionWriter in ExternalSorter not always append to the 
 end of previous written file, the file's content is overwritten in some 
 parts, and this lead to the exceptions I mentioned before.
 Also I changed the code of copyStream by disable transferTo, use the previous 
 one, all the issues are gone. So I think there maybe some flushing problems 
 in transferTo when processed data is large.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail:

[jira] [Commented] (SPARK-2706) Enable Spark to support Hive 0.13

2014-10-16 Thread qiaohaijun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173541#comment-14173541
 ] 

qiaohaijun commented on SPARK-2706:
---

 sh make-distribution.sh --tgz -Phadoop-provided -Pyarn -DskipTests 
-Dhadoop.version=2.3.0-cdh5.0.0 -Phive -Dhive.version=0.13.1

I get the same error output, and I execute the command git pull  today that 
get the update code .

 Enable Spark to support Hive 0.13
 -

 Key: SPARK-2706
 URL: https://issues.apache.org/jira/browse/SPARK-2706
 Project: Spark
  Issue Type: Dependency upgrade
  Components: SQL
Affects Versions: 1.0.1
Reporter: Chunjun Xiao
Assignee: Zhan Zhang
 Attachments: hive.diff, spark-2706-v1.txt, spark-2706-v2.txt, 
 spark-hive.err, v1.0.2.diff


 It seems Spark cannot work with Hive 0.13 well.
 When I compiled Spark with Hive 0.13.1, I got some error messages, as 
 attached below.
 So, when can Spark be enabled to support Hive 0.13?
 Compiling Error:
 {quote}
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala:180:
  type mismatch;
  found   : String
  required: Array[String]
 [ERROR]   val proc: CommandProcessor = 
 CommandProcessorFactory.get(tokens(0), hiveconf)
 [ERROR]  ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:264:
  overloaded method constructor TableDesc with alternatives:
   (x$1: Class[_ : org.apache.hadoop.mapred.InputFormat[_, _]],x$2: 
 Class[_],x$3: java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc 
 and
   ()org.apache.hadoop.hive.ql.plan.TableDesc
  cannot be applied to (Class[org.apache.hadoop.hive.serde2.Deserializer], 
 Class[(some other)?0(in value tableDesc)(in value tableDesc)], Class[?0(in 
 value tableDesc)(in value tableDesc)], java.util.Properties)
 [ERROR]   val tableDesc = new TableDesc(
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala:140:
  value getPartitionPath is not a member of 
 org.apache.hadoop.hive.ql.metadata.Partition
 [ERROR]   val partPath = partition.getPartitionPath
 [ERROR]^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala:132:
  value appendReadColumnNames is not a member of object 
 org.apache.hadoop.hive.serde2.ColumnProjectionUtils
 [ERROR] ColumnProjectionUtils.appendReadColumnNames(hiveConf, 
 attributes.map(_.name))
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:79:
  org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor
 [ERROR]   new HiveDecimal(bd.underlying())
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:132:
  type mismatch;
  found   : org.apache.hadoop.fs.Path
  required: String
 [ERROR]   
 SparkHiveHadoopWriter.createPathFromString(fileSinkConf.getDirName, conf))
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:179:
  value getExternalTmpFileURI is not a member of 
 org.apache.hadoop.hive.ql.Context
 [ERROR] val tmpLocation = hiveContext.getExternalTmpFileURI(tableLocation)
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala:209:
  org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor
 [ERROR]   case bd: BigDecimal = new HiveDecimal(bd.underlying())
 [ERROR]  ^
 [ERROR] 8 errors found
 [DEBUG] Compilation failed (CompilerInterface)
 [INFO] 
 
 [INFO] Reactor Summary:
 [INFO] 
 [INFO] Spark Project Parent POM .. SUCCESS [2.579s]
 [INFO] Spark Project Core  SUCCESS [2:39.805s]
 [INFO] Spark Project Bagel ... SUCCESS [21.148s]
 [INFO] Spark Project GraphX .. SUCCESS [59.950s]
 [INFO] Spark Project ML Library .. SUCCESS [1:08.771s]
 [INFO] Spark Project Streaming ... SUCCESS [1:17.759s]
 [INFO] Spark Project Tools ... SUCCESS [15.405s]
 [INFO] Spark Project Catalyst  SUCCESS [1:17.405s]
 [INFO] Spark Project SQL . SUCCESS [1:11.094s]
 [INFO] Spark Project Hive  FAILURE [11.121s]
 [INFO] Spark Project REPL

[jira] [Commented] (SPARK-3174) Provide elastic scaling within a Spark application

2014-10-16 Thread Praveen Seluka (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173542#comment-14173542
]

Praveen Seluka commented on SPARK-3174:
---

[~vanzin] Regarding your question related to scaling heuristic

- It does take care of the backlog of tasks also. Once some task completes, it
calculates the average runtime of a task (within a stage). Then it estimates
the runtime(remaining) of the stage using the following heuristic
var estimatedRuntimeForStage = averageRuntime * (remainingTasks +
(activeTasks/2))
We also add (activeTasks/2) as we need to take the current running tasks into
account.

- I think, the proposal I have made is not very different from the exponential
approach. Lets say we start the Spark application with just 2 executors. It
will double the number of executors and hence goes to 4, 8 and so on. The
difference I see here comparing to exponential approach is, we start doubling
the current count of executors whereas exponential starts from 1 (also resets
to 1 when there are no pending tasks). But yeah, this could be altered to do
the same as exponential approach also.

- In a way, this proposal adds some ETA based heuristic in addition. (threshold
for stage completion time)

- Also, this proposal adds the memory used heuristic too for scaling
decisions which is missing in Andrew's PR. (Correct me if am wrong here). This
for sure will be very useful.

Provide elastic scaling within a Spark application
--

Key: SPARK-3174
URL: https://issues.apache.org/jira/browse/SPARK-3174
Project: Spark
Issue Type: Improvement
Components: Spark Core, YARN
Affects Versions: 1.0.2
Reporter: Sandy Ryza
Assignee: Andrew Or
Attachments: SPARK-3174design.pdf, SparkElasticScalingDesignB.pdf,
dynamic-scaling-executors-10-6-14.pdf

A common complaint with Spark in a multi-tenant environment is that
applications have a fixed allocation that doesn't grow and shrink with their
resource needs. We're blocked on YARN-1197 for dynamically changing the
resources within executors, but we can still allocate and discard whole
executors.
It would be useful to have some heuristics that
* Request more executors when many pending tasks are building up
* Discard executors when they are idle
See the latest design doc for more information.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1479) building spark on 2.0.0-cdh4.4.0 failed

2014-10-16 Thread qiaohaijun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173543#comment-14173543
 ] 

qiaohaijun commented on SPARK-1479:
---

 sh make-distribution.sh --tgz -Phadoop-provided -Pyarn -DskipTests 
-Dhadoop.version=2.3.0-cdh5.0.0 -Phive -Dhive.verison=0.13.1

get the same error output

 building spark on 2.0.0-cdh4.4.0 failed
 ---

 Key: SPARK-1479
 URL: https://issues.apache.org/jira/browse/SPARK-1479
 Project: Spark
  Issue Type: Question
 Environment: 2.0.0-cdh4.4.0
 Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL
 spark 0.9.1
 java version 1.6.0_32
Reporter: jackielihf
 Attachments: mvn.log


 [INFO] 
 
 [ERROR] Failed to execute goal 
 net.alchim31.maven:scala-maven-plugin:3.1.5:compile (scala-compile-first) on 
 project spark-yarn-alpha_2.10: Execution scala-compile-first of goal 
 net.alchim31.maven:scala-maven-plugin:3.1.5:compile failed. CompileFailed - 
 [Help 1]
 org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute 
 goal net.alchim31.maven:scala-maven-plugin:3.1.5:compile 
 (scala-compile-first) on project spark-yarn-alpha_2.10: Execution 
 scala-compile-first of goal 
 net.alchim31.maven:scala-maven-plugin:3.1.5:compile failed.
   at 
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:225)
   at 
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
   at 
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
   at 
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
   at 
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
   at 
 org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
   at 
 org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
   at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
   at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
   at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
   at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
   at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
 Caused by: org.apache.maven.plugin.PluginExecutionException: Execution 
 scala-compile-first of goal 
 net.alchim31.maven:scala-maven-plugin:3.1.5:compile failed.
   at 
 org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:110)
   at 
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209)
   ... 19 more
 Caused by: Compilation failed
   at sbt.compiler.AnalyzingCompiler.call(AnalyzingCompiler.scala:76)
   at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:35)
   at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:29)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4$$anonfun$compileScala$1$1.apply$mcV$sp(AggressiveCompile.scala:71)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:71)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:71)
   at 
 sbt.compiler.AggressiveCompile.sbt$compiler$AggressiveCompile$$timed(AggressiveCompile.scala:101)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4.compileScala$1(AggressiveCompile.scala:70)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4.apply(AggressiveCompile.scala:88)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4.apply(AggressiveCompile.scala:60)
   at 
 sbt.inc.IncrementalCompile$$anonfun$doCompile$1.apply(Compile.scala:24)
   at 
 sbt.inc.IncrementalCompile$$anonfun$doCompile$1.apply(Compile.scala:22)
   at sbt.inc.Incremental$.cycle(Incremental.scala:40)
   at sbt.inc.Incremental$.compile(Incremental.scala:25)
   at sbt.inc.IncrementalCompile$.apply(Compile.scala:20)
   at

[jira] [Comment Edited] (SPARK-3174) Provide elastic scaling within a Spark application

2014-10-16 Thread Praveen Seluka (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173542#comment-14173542
]

Praveen Seluka edited comment on SPARK-3174 at 10/16/14 8:55 AM:
-

[~vanzin] Regarding your question related to scaling heuristic

- In a way, this proposal adds some ETA based heuristic in addition. (threshold
for stage completion time)

- Also, this proposal adds the memory used heuristic too for scaling
decisions which is missing in Andrew's PR. (Correct me if am wrong here). This
for sure will be very useful.

- The main point being, It does all these without making any changes in
TaskSchedulerImpl/TaskSetManager code base.

was (Author: praveenseluka):
[~vanzin] Regarding your question related to scaling heuristic

- In a way, this proposal adds some ETA based heuristic in addition. (threshold
for stage completion time)

- Also, this proposal adds the memory used heuristic too for scaling
decisions which is missing in Andrew's PR. (Correct me if am wrong here). This
for sure will be very useful.

Provide elastic scaling within a Spark application
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3174) Provide elastic scaling within a Spark application

2014-10-16 Thread Praveen Seluka (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173542#comment-14173542
]

Praveen Seluka edited comment on SPARK-3174 at 10/16/14 8:56 AM:
-

[~vanzin] Regarding your question related to scaling heuristic

- In a way, this proposal adds some ETA based heuristic in addition. (threshold
for stage completion time)

- Also, this proposal adds the memory used heuristic too for scaling
decisions which is missing in Andrew's PR. (Correct me if am wrong here). This
for sure will be very useful.

- The main point being, It does all these without making any changes in
TaskSchedulerImpl/TaskSetManager code base. And I think, thats a really nice
thing.

was (Author: praveenseluka):
[~vanzin] Regarding your question related to scaling heuristic

- In a way, this proposal adds some ETA based heuristic in addition. (threshold
for stage completion time)

- Also, this proposal adds the memory used heuristic too for scaling
decisions which is missing in Andrew's PR. (Correct me if am wrong here). This
for sure will be very useful.

- The main point being, It does all these without making any changes in
TaskSchedulerImpl/TaskSetManager code base.

Provide elastic scaling within a Spark application
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3967) Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions

2014-10-16 Thread JIRA

Christophe PRÉAUD created SPARK-3967:


 Summary: Spark applications fail in yarn-cluster mode when the 
directories configured in yarn.nodemanager.local-dirs are located on different 
disks/partitions
 Key: SPARK-3967
 URL: https://issues.apache.org/jira/browse/SPARK-3967
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Christophe PRÉAUD


Spark applications fail from time to time in yarn-cluster mode (but not in 
yarn-client mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is set 
to a comma-separated list of directories which are located on different 
disks/partitions.

Steps to reproduce:
1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of directories 
located on different partitions (the more you set, the more likely it will be 
to reproduce the bug):
(...)
property
  nameyarn.nodemanager.local-dirs/name
  
valuefile:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir/value
/property
(...)
2. Launch (several times) an application in yarn-cluster mode, it will fail 
(apparently randomly) from time to time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3967) Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions

2014-10-16 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christophe PRÉAUD updated SPARK-3967:
-
Attachment: spark-1.1.0-yarn_cluster_tmpdir.patch

Ensure that the temporary file which the jar file is fetched in is located in 
the same directory than the target jar file


 Spark applications fail in yarn-cluster mode when the directories configured 
 in yarn.nodemanager.local-dirs are located on different disks/partitions
 -

 Key: SPARK-3967
 URL: https://issues.apache.org/jira/browse/SPARK-3967
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Christophe PRÉAUD
 Attachments: spark-1.1.0-yarn_cluster_tmpdir.patch


 Spark applications fail from time to time in yarn-cluster mode (but not in 
 yarn-client mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is 
 set to a comma-separated list of directories which are located on different 
 disks/partitions.
 Steps to reproduce:
 1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of 
 directories located on different partitions (the more you set, the more 
 likely it will be to reproduce the bug):
 (...)
 property
   nameyarn.nodemanager.local-dirs/name
   
 valuefile:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir/value
 /property
 (...)
 2. Launch (several times) an application in yarn-cluster mode, it will fail 
 (apparently randomly) from time to time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3967) Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions

2014-10-16 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173566#comment-14173566
 ] 

Christophe PRÉAUD commented on SPARK-3967:
--

After investigating, it turns out that the problem is when the executor fetches 
a jar file: the jar is downloaded in a temporary file, always in 
/d1/yarn/local/nm-local-dir (first directory of yarn.nodemanager.local-dirs), 
and then moved in one of the directories of yarn.nodemanager.local-dirs:
-- if it is the same than the temporary file (i.e. 
/d1/yarn/local/nm-local-dir), then the application continues normally
-- if it is another one (i.e. /d2/yarn/local/nm-local-dir, 
/d3/yarn/local/nm-local-dir,...), it fails with the following error:
14/10/10 14:33:51 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 0)
java.io.FileNotFoundException: ./logReader-1.0.10.jar (Permission denied)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.init(FileOutputStream.java:221)
at com.google.common.io.Files$FileByteSink.openStream(Files.java:223)
at com.google.common.io.Files$FileByteSink.openStream(Files.java:211)
at com.google.common.io.ByteSource.copyTo(ByteSource.java:203)
at com.google.common.io.Files.copy(Files.java:436)
at com.google.common.io.Files.move(Files.java:651)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:440)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:325)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:323)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:323)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

I have no idea why the move fails when the source and target files are not on 
the same partition (it is no more atomic, but it should succeed anyway), for 
the moment I have worked around the problem with the attached patch (i.e. I 
ensure that the temp file and the moved file are always on the same partition).

 Spark applications fail in yarn-cluster mode when the directories configured 
 in yarn.nodemanager.local-dirs are located on different disks/partitions
 -

 Key: SPARK-3967
 URL: https://issues.apache.org/jira/browse/SPARK-3967
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Christophe PRÉAUD
 Attachments: spark-1.1.0-yarn_cluster_tmpdir.patch


 Spark applications fail from time to time in yarn-cluster mode (but not in 
 yarn-client mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is 
 set to a comma-separated list of directories which are located on different 
 disks/partitions.
 Steps to reproduce:
 1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of 
 directories located on different partitions (the more you set, the more 
 likely it will be to reproduce the bug):
 (...)
 property
   nameyarn.nodemanager.local-dirs/name
   
 valuefile:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir/value
 /property
 (...)
 2. Launch (several times) an application in yarn-cluster mode, it will fail 
 (apparently randomly) from time to time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3431) Parallelize execution of tests

2014-10-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173618#comment-14173618
 ] 

Sean Owen commented on SPARK-3431:
--

Yes that should be what scalatest does. It is a fork of an old surefire so only 
has a very few options. This parallelization failed as above for a few reasons. 
I have not gotten surefire to run the scala tests 

 Parallelize execution of tests
 --

 Key: SPARK-3431
 URL: https://issues.apache.org/jira/browse/SPARK-3431
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Nicholas Chammas

 Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common 
 strategy to cut test time down is to parallelize the execution of the tests. 
 Doing that may in turn require some prerequisite changes to be made to how 
 certain tests run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2321) Design a proper progress reporting event listener API

2014-10-16 Thread Dev Lakhani (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173619#comment-14173619
]

Dev Lakhani commented on SPARK-2321:

There are some issues and bugs under the webui component that are active.
Should we incorporate these into this Jira or is it best to work on them
separately and then merge these (2321) changes later?

https://issues.apache.org/jira/browse/SPARK/component/12322616

Design a proper progress reporting event listener API
---

Key: SPARK-2321
URL: https://issues.apache.org/jira/browse/SPARK-2321
Project: Spark
Issue Type: Improvement
Components: Java API, Spark Core
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Josh Rosen
Priority: Critical

This is a ticket to track progress on redesigning the SparkListener and
JobProgressListener API.
There are multiple problems with the current design, including:
0. I'm not sure if the API is usable in Java (there are at least some enums
we used in Scala and a bunch of case classes that might complicate things).
1. The whole API is marked as DeveloperApi, because we haven't paid a lot of
attention to it yet. Something as important as progress reporting deserves a
more stable API.
2. There is no easy way to connect jobs with stages. Similarly, there is no
easy way to connect job groups with jobs / stages.
3. JobProgressListener itself has no encapsulation at all. States can be
arbitrarily mutated by external programs. Variable names are sort of randomly
decided and inconsistent.
We should just revisit these and propose a new, concrete design.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3968) Using parquet-mr filter2 api in spark sql, add a custom filter for InSet clause

2014-10-16 Thread Yash Datta (JIRA)

Yash Datta created SPARK-3968:
-

 Summary: Using parquet-mr filter2 api in spark sql, add a custom 
filter for InSet clause
 Key: SPARK-3968
 URL: https://issues.apache.org/jira/browse/SPARK-3968
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yash Datta
Priority: Minor
 Fix For: 1.1.1


The parquet-mr project has introduced a new filter api , along with several 
fixes , like filtering on OPTIONAL columns as well. It can also eliminate 
entire RowGroups depending on certain statistics like min/max
We can leverage that to further improve performance of queries with filters.
Also filter2 api introduces ability to create custom filters. We can create a 
custom filter for the optimized In clause (InSet) , so that elimination happens 
in the ParquetRecordReader itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3968) Using parquet-mr filter2 api in spark sql, add a custom filter for InSet clause

2014-10-16 Thread Yash Datta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-3968:
--
Shepherd: Yash Datta

 Using parquet-mr filter2 api in spark sql, add a custom filter for InSet 
 clause
 ---

 Key: SPARK-3968
 URL: https://issues.apache.org/jira/browse/SPARK-3968
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yash Datta
Priority: Minor
 Fix For: 1.1.1


 The parquet-mr project has introduced a new filter api , along with several 
 fixes , like filtering on OPTIONAL columns as well. It can also eliminate 
 entire RowGroups depending on certain statistics like min/max
 We can leverage that to further improve performance of queries with filters.
 Also filter2 api introduces ability to create custom filters. We can create a 
 custom filter for the optimized In clause (InSet) , so that elimination 
 happens in the ParquetRecordReader itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3629) Improvements to YARN doc

2014-10-16 Thread ssj (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173731#comment-14173731
 ] 

ssj commented on SPARK-3629:


need someone to verifity this patch

 Improvements to YARN doc
 

 Key: SPARK-3629
 URL: https://issues.apache.org/jira/browse/SPARK-3629
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, YARN
Reporter: Matei Zaharia
  Labels: starter

 Right now this doc starts off with a big list of config options, and only 
 then tells you how to submit an app. It would be better to put that part and 
 the packaging part first, and the config options only at the end.
 In addition, the doc mentions yarn-cluster vs yarn-client as separate 
 masters, which is inconsistent with the help output from spark-submit (which 
 says to always use yarn).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3948) Sort-based shuffle can lead to assorted stream-corruption exceptions


[ 
https://issues.apache.org/jira/browse/SPARK-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173757#comment-14173757
 ] 

Apache Spark commented on SPARK-3948:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/2824

 Sort-based shuffle can lead to assorted stream-corruption exceptions
 

 Key: SPARK-3948
 URL: https://issues.apache.org/jira/browse/SPARK-3948
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Saisai Shao
Assignee: Saisai Shao

 Several exceptions occurred when running TPC-DS queries against latest master 
 branch with sort-based shuffle enable, like PARSING_ERROR(2) in snappy, 
 deserializing error in Kryo and offset out-range in FileManagedBuffer, all 
 these exceptions are gone when we changed to hash-based shuffle.
 With deep investigation, we found that some shuffle output file is 
 unexpectedly smaller than the others, as the log shows:
 {noformat}
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_9_11, offset: 3055635, length: 236708, file length: 47274167
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_10_11, offset: 2986484, length: 222755, file length: 47174539
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_11_11, offset: 2995341, length: 259871, file length: 383405
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_12_11, offset: 2991030, length: 268191, file length: 47478892
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_13_11, offset: 3016292, length: 230694, file length: 47420826
 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: 
 shuffle_6_14_11, offset: 3061400, length: 241136, file length: 47395509
 {noformat}
 As you can see the total file length of shuffle_6_11_11 is much smaller than 
 other same stage map output results.
 And we also dump the map outputs in map side to see if this small size output 
 is correct or not, below is the log:
 {noformat}
  In bypass merge sort, file name: /mnt/DP_disk1/animal/spark/spark-local-
 20141014182142-8345/22/shuffle_6_11_0.data, file length: 383405length:
  274722 262597 291290 272902 264941 270358 291005 295285 252482 
 287142 232617 259871 233734 241439 228897 234282 253834 235619 
 233803 255532 270739 253825 262087 266404 234273 250120 262983 
 257024 255947 254971 258908 247862 221613 258566 245399 251684 
 274843 226150 264278 245279 225656 235084 239466 212851 242245 
 218781 222191 215500 211548 234256 208601 204113 191923 217895 
 227020 215331 212313 223725 250876 256875 239276 266777 235520 
 237462 234063 242270 246825 255888 235937 236956 233099 264508 
 260303 233294 239061 254856 257475 230105 246553 260412 210355 
 211201 219572 206636 226866 209937 226618 218208 206255 248069 
 221717 222112 215734 248088 239207 246125 239056 241133 253091 
 246738 233128 242794 231606 255737 221123 252115 247286 229688 
 251087 250047 237579 263079 256251 238214 208641 201120 204009 
 200825 211965 200600 194492 226471 194887 226975 215072 206008 
 233288 222132 208860 219064 218162 237126 220465 201343 225711 
 232178 233786 212767 211462 213671 215853 227822 233782 214727 
 247001 228968 247413 222674 214241 184122 215643 207665 219079 
 215185 207718 212723 201613 216600 212591 208174 204195 208099 
 229079 230274 223373 214999 256626 228895 231821 383405 229646 
 220212 245495 245960 227556 213266 237203 203805 240509 239306 
 242365 218416 238487 219397 240026 251011 258369 255365 259811 
 283313 248450 264286 264562 257485 279459 249187 257609 274964 
 292369 273826
 {noformat}
 Here I dump the file name, length and each partition's length, obviously the 
 sum of all partition lengths is not equal to file length. So I think there 
 may be a situation paritionWriter in ExternalSorter not always append to the 
 end of previous written file, the file's content is overwritten in some 
 parts, and this lead to the exceptions I mentioned before.
 Also I changed the code of copyStream by disable transferTo, use the previous 
 one, all the issues are gone. So I think there maybe some flushing problems 
 in transferTo when processed data is large.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3907) add truncate table support


[ 
https://issues.apache.org/jira/browse/SPARK-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173806#comment-14173806
 ] 

Apache Spark commented on SPARK-3907:
-

User 'wangxiaojing' has created a pull request for this issue:
https://github.com/apache/spark/pull/2770

 add truncate table support 
 -

 Key: SPARK-3907
 URL: https://issues.apache.org/jira/browse/SPARK-3907
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangxj
Priority: Minor
  Labels: features
 Fix For: 1.1.0

   Original Estimate: 0.05h
  Remaining Estimate: 0.05h

 The truncate table  syntax had been disabled.
 Removes all rows from a table or partition(s),Currently target table should 
 be native/managed table or exception will be thrown.User can specify partial 
 partition_spec for truncating multiple partitions at once and omitting 
 partition_spec will truncate all partitions in the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3969) Optimizer should have a super class as an interface.


[ 
https://issues.apache.org/jira/browse/SPARK-3969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173854#comment-14173854
 ] 

Apache Spark commented on SPARK-3969:
-

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2825

 Optimizer should have a super class as an interface.
 

 Key: SPARK-3969
 URL: https://issues.apache.org/jira/browse/SPARK-3969
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin

 Some developers want to replace {{Optimizer}} to fit their projects but can't 
 do so because currently {{Optimizer}} is an {{object}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3970) Remove duplicate removal of local dirs

2014-10-16 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-3970:
--

 Summary: Remove duplicate removal of local dirs
 Key: SPARK-3970
 URL: https://issues.apache.org/jira/browse/SPARK-3970
 Project: Spark
  Issue Type: Bug
Reporter: Liang-Chi Hsieh


The shutdown hook of DiskBlockManager would remove localDirs. So do not need to 
register them with Utils.registerShutdownDeleteDir. It causes duplicate removal 
of these local dirs and corresponding exceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3969) Optimizer should have a super class as an interface.

2014-10-16 Thread Takuya Ueshin (JIRA)

Takuya Ueshin created SPARK-3969:


 Summary: Optimizer should have a super class as an interface.
 Key: SPARK-3969
 URL: https://issues.apache.org/jira/browse/SPARK-3969
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin


Some developers want to replace {{Optimizer}} to fit their projects but can't 
do so because currently {{Optimizer}} is an {{object}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3970) Remove duplicate removal of local dirs


[ 
https://issues.apache.org/jira/browse/SPARK-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173856#comment-14173856
 ] 

Apache Spark commented on SPARK-3970:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/2826

 Remove duplicate removal of local dirs
 --

 Key: SPARK-3970
 URL: https://issues.apache.org/jira/browse/SPARK-3970
 Project: Spark
  Issue Type: Bug
Reporter: Liang-Chi Hsieh

 The shutdown hook of DiskBlockManager would remove localDirs. So do not need 
 to register them with Utils.registerShutdownDeleteDir. It causes duplicate 
 removal of these local dirs and corresponding exceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store

2014-10-16 Thread Akshat Aranya (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173904#comment-14173904
]

Akshat Aranya commented on SPARK-2365:
--

This looks great! I have been using IndexedRDD for a while, to great effect.
I have one suggestion: it would be nice to override setName() in IndexedRDDLike

{code}
override def setName(_name: String): this.type = {
partitionsRDD.setName(_name)
this
}
{code}

so that the IndexedRDD shows up with friendly names in the storage UI, just
like regular, cached RDDs do.

Add IndexedRDD, an efficient updatable key-value store
--

Key: SPARK-2365
URL: https://issues.apache.org/jira/browse/SPARK-2365
Project: Spark
Issue Type: New Feature
Components: GraphX, Spark Core
Reporter: Ankur Dave
Assignee: Ankur Dave
Attachments: 2014-07-07-IndexedRDD-design-review.pdf

RDDs currently provide a bulk-updatable, iterator-based interface. This
imposes minimal requirements on the storage layer, which only needs to
support sequential access, enabling on-disk and serialized storage.
However, many applications would benefit from a richer interface. Efficient
support for point lookups would enable serving data out of RDDs, but it
currently requires iterating over an entire partition to find the desired
element. Point updates similarly require copying an entire iterator. Joins
are also expensive, requiring a shuffle and local hash joins.
To address these problems, we propose IndexedRDD, an efficient key-value
store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key
uniqueness and pre-indexing the entries for efficient joins and point
lookups, updates, and deletions.
It would be implemented by (1) hash-partitioning the entries by key, (2)
maintaining a hash index within each partition, and (3) using purely
functional (immutable and efficiently updatable) data structures to enable
efficient modifications and deletions.
GraphX would be the first user of IndexedRDD, since it currently implements a
limited form of this functionality in VertexRDD. We envision a variety of
other uses for IndexedRDD, including streaming updates to RDDs, direct
serving from RDDs, and as an execution strategy for Spark SQL.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI

2014-10-16 Thread Dev Lakhani (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173926#comment-14173926
]

Dev Lakhani commented on SPARK-3957:

Here is my thoughts on a possible approach.

Hi All

The broadcast occurs form the Spark Context to the broadcastmanager and new
Broadcast method. In the first instance, the broadcasted data is stored in the
Block Manager (see HttpBroadCast) of the executor. Any tracking of broadcast
variables must be referenced by the BlockManagerSlaveActor and
BlockManagerMasterActor. In particular UpdateBlockInfo and RemoveBroadcast
should update the total memory in blocks used when blocks are added and removed.

These can then be hooked up to the UI using a new Page like ExecutorsPage and
defining a new methods in the relevant listener such as StorageStatusListener.

These are my initial thoughts for someone new to these components, any other
ideas or approaches?

Broadcast variable memory usage not reflected in UI
---

Key: SPARK-3957
URL: https://issues.apache.org/jira/browse/SPARK-3957
Project: Spark
Issue Type: Bug
Components: Block Manager, Web UI
Affects Versions: 1.0.2, 1.1.0
Reporter: Shivaram Venkataraman
Assignee: Nan Zhu

Memory used by broadcast variables are not reflected in the memory usage
reported in the WebUI. For example, the executors tab shows memory used in
each executor but this number doesn't include memory used by broadcast
variables. Similarly the storage tab only shows list of rdds cached and how
much memory they use.
We should add a separate column / tab for broadcast variables to make it
easier to debug.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3174) Provide elastic scaling within a Spark application

[
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173951#comment-14173951
]

Marcelo Vanzin commented on SPARK-3174:
---

bq. Lets say we start the Spark application with just 2 executors. It will
double the number of executors and hence goes to 4, 8 and so on.

Well, I'd say it's unusual for applications to start with a low number of
executors, especially if the user knows it will be executing things right away.
So if I start it with 32 executors, your code will right away try to make it
64. Andrew's approach would try to make it 33, then 35, then...

But I agree that it might be a good idea to make the auto-scaling backend an
interface, so that we can easily play with different approaches. That shouldn't
be hard at all.

bq. The main point being, It does all these without making any changes in
TaskSchedulerImpl/TaskSetManager

Theoretically, I agree that's a good thing. I haven't gone through the code in
detail, though, to know whether all the information Andrew is using from the
scheduler is available from SparkListener events. If you can derive that info,
great, I think it would be worth it to make the auto-scale code decoupled from
the scheduler. If not, then we either have the choice of hooking the
auto-scaling backend into the scheduler (like Andrew's change) or exposing more
info in the events - which may or may not be a good thing, depending on what
that info is.

Anyway, as I've said, both approaches are not irreconcilably different -
they're actually more similar than not.

Provide elastic scaling within a Spark application
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI

2014-10-16 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173954#comment-14173954
 ] 

Shivaram Venkataraman commented on SPARK-3957:
--

I think it needs to be tracked in the Block Manager -- However we also need to 
track this on a per-executor basis and not just at the driver. Right now AFAIK, 
executors do not report new broadcast blocks to the master to reduce 
communication. However we could add broadcast blocks to some periodic report. 
[~andrewor] might know more.

 Broadcast variable memory usage not reflected in UI
 ---

 Key: SPARK-3957
 URL: https://issues.apache.org/jira/browse/SPARK-3957
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Web UI
Affects Versions: 1.0.2, 1.1.0
Reporter: Shivaram Venkataraman
Assignee: Nan Zhu

 Memory used by broadcast variables are not reflected in the memory usage 
 reported in the WebUI. For example, the executors tab shows memory used in 
 each executor but this number doesn't include memory used by broadcast 
 variables. Similarly the storage tab only shows list of rdds cached and how 
 much memory they use.  
 We should add a separate column / tab for broadcast variables to make it 
 easier to debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3883) Provide SSL support for Akka and HttpServer based connections

[
https://issues.apache.org/jira/browse/SPARK-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174005#comment-14174005
]

Marcelo Vanzin commented on SPARK-3883:
---

FYI, any PR here should make sure the default configuration is safe against the
POODLE attack (https://access.redhat.com/security/cve/CVE-2014-3566). Here's
something for Jetty:

http://stackoverflow.com/questions/26382540/how-to-disable-the-sslv3-protocol-in-jetty-to-prevent-poodle-attack

Provide SSL support for Akka and HttpServer based connections
-

Key: SPARK-3883
URL: https://issues.apache.org/jira/browse/SPARK-3883
Project: Spark
Issue Type: Improvement
Components: Spark Core
Reporter: Jacek Lewandowski

Spark uses at least 4 logical communication channels:
1. Control messages - Akka based
2. JARs and other files - Jetty based (HttpServer)
3. Computation results - Java NIO based
4. Web UI - Jetty based
The aim of this feature is to enable SSL for (1) and (2).
Why:
Spark configuration is sent through (1). Spark configuration may contain
sensitive information like credentials for accessing external data sources or
streams. Application JAR files (2) may include the application logic and
therefore they may include information about the structure of the external
data sources, and credentials as well.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2750) Add Https support for Web UI


[ 
https://issues.apache.org/jira/browse/SPARK-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174003#comment-14174003
 ] 

Marcelo Vanzin commented on SPARK-2750:
---

FYI, any PR here should make sure the default configuration is save against the 
POODLE attack (https://access.redhat.com/security/cve/CVE-2014-3566). Here's 
something for Jetty:

http://stackoverflow.com/questions/26382540/how-to-disable-the-sslv3-protocol-in-jetty-to-prevent-poodle-attack

 Add Https support for Web UI
 

 Key: SPARK-2750
 URL: https://issues.apache.org/jira/browse/SPARK-2750
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Reporter: WangTaoTheTonic
  Labels: https, ssl, webui
 Fix For: 1.0.3

   Original Estimate: 96h
  Remaining Estimate: 96h

 Now I try to add https support for web ui using Jetty ssl integration.Below 
 is the plan:
 1.Web UI include Master UI, Worker UI, HistoryServer UI and Spark Ui. User 
 can switch between https and http by configure spark.http.policy in JVM 
 property for each process, while choose http by default.
 2.Web port of Master and worker would be decided in order of launch 
 arguments, JVM property, System Env and default port.
 3.Below is some other configuration items:
 spark.ssl.server.keystore.location The file or URL of the SSL Key store
 spark.ssl.server.keystore.password  The password for the key store
 spark.ssl.server.keystore.keypassword The password (if any) for the specific 
 key within the key store
 spark.ssl.server.keystore.type The type of the key store (default JKS)
 spark.client.https.need-auth True if SSL needs client authentication
 spark.ssl.server.truststore.location The file name or URL of the trust store 
 location
 spark.ssl.server.truststore.password The password for the trust store
 spark.ssl.server.truststore.type The type of the trust store (default JKS)
 Any feedback is welcome!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI

[
https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174019#comment-14174019
]

Andrew Or commented on SPARK-3957:
--

Yeah my understanding is that broadcast blocks aren't reported to the driver
(and it makes sense to not report them because the driver is the one who
initiated the broadcast in the first place). The source of the broadcast info
we want to display is in the BlockManager of each executor, and we need to get
this to the driver somehow. We could add some periodic reporting but that opens
another channel between the driver and the executors. There is an ongoing
effort to do something similar for task metrics
https://github.com/apache/spark/pull/2087, so maybe we can piggyback this
information on the heartbeats there.

Also I believe this is a duplicate of an old issue SPARK-1761, though this one
contains more information so let's keep this one open. I will close the other
one in favor of this.

Broadcast variable memory usage not reflected in UI
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-1761) Add broadcast information on SparkUI storage tab


 [ 
https://issues.apache.org/jira/browse/SPARK-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-1761.

Resolution: Duplicate

 Add broadcast information on SparkUI storage tab
 

 Key: SPARK-1761
 URL: https://issues.apache.org/jira/browse/SPARK-1761
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or

 It would be nice to know where the broadcast blocks are persisted. More 
 details coming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI


[ 
https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174024#comment-14174024
 ] 

Andrew Or commented on SPARK-3957:
--

Hey [~devl.development] are you planning to work on this? Or is [~CodingCat]? 
The latter is currently assigned but maybe you guys should work it out.

 Broadcast variable memory usage not reflected in UI
 ---

 Key: SPARK-3957
 URL: https://issues.apache.org/jira/browse/SPARK-3957
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Web UI
Affects Versions: 1.0.2, 1.1.0
Reporter: Shivaram Venkataraman
Assignee: Nan Zhu

 Memory used by broadcast variables are not reflected in the memory usage 
 reported in the WebUI. For example, the executors tab shows memory used in 
 each executor but this number doesn't include memory used by broadcast 
 variables. Similarly the storage tab only shows list of rdds cached and how 
 much memory they use.  
 We should add a separate column / tab for broadcast variables to make it 
 easier to debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1761) Add broadcast information on SparkUI storage tab


[ 
https://issues.apache.org/jira/browse/SPARK-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174026#comment-14174026
 ] 

Andrew Or commented on SPARK-1761:
--

Closing in favor of SPARK-3957, which is more descriptive.

 Add broadcast information on SparkUI storage tab
 

 Key: SPARK-1761
 URL: https://issues.apache.org/jira/browse/SPARK-1761
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or

 It would be nice to know where the broadcast blocks are persisted. More 
 details coming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3971) Failed to deserialize Vector in cluster mode

2014-10-16 Thread Davies Liu (JIRA)

Davies Liu created SPARK-3971:
-

 Summary: Failed to deserialize Vector in cluster mode
 Key: SPARK-3971
 URL: https://issues.apache.org/jira/browse/SPARK-3971
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Reporter: Davies Liu
Priority: Blocker


The serialization of Vector/Rating did not work in cluster mode, because the 
initializer is not called in executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2585) Remove special handling of Hadoop JobConf

2014-10-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2585.
---
Resolution: Fixed

Due to the CONFIGURATION_INSTANTIATION_LOCK thread-safety issue, I think that 
we'll still end up having to serialize the Configuration separately. If we 
didn't, then we'd have to hold CONFIGURATION_INSTANTIATION_LOCK while 
deserializing each task, which could have a huge performance penalty (it's fine 
to hold the lock while loading the Configuration, since that doesn't take too 
long).

Therefore, I'm closing this as Won't Fix.  The thread-safety issues with 
Configuration will be addressed by a separate clone() patch.

 Remove special handling of Hadoop JobConf
 -

 Key: SPARK-2585
 URL: https://issues.apache.org/jira/browse/SPARK-2585
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Josh Rosen
Priority: Critical

 This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the 
 implementation does not use shared conf objects). We no longer need to 
 specially broadcast the Hadoop configuration since we are broadcasting RDD 
 data anyways.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3972) PySpark Error on Windows with sc.wholeTextFiles

2014-10-16 Thread Michael Griffiths (JIRA)

Michael Griffiths created SPARK-3972:


 Summary: PySpark Error on Windows with sc.wholeTextFiles
 Key: SPARK-3972
 URL: https://issues.apache.org/jira/browse/SPARK-3972
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, PySpark, Windows
Affects Versions: 1.1.0
 Environment: Windows 8.1 x64
Java SE Version 8 Update 20 (build 1.8.0_20-b26); 
Python 2.7.7
Reporter: Michael Griffiths
Priority: Minor


When running sc.wholeTextFiles() on a directory, I can run the command but not 
do anything with the resulting RDD – specifically, I get an error in 
py4j.protocol.Py4JJavaError; the error is unspecified. This occurs even if I 
can read the text file(s) individually with sc.textFile()

Steps followed:
1) Download Spark 1.1.0 (pre-builet for Hadoop 2.4: 
[spark-1.1.0-bin-hadoop2.4.tgz|http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-hadoop2.4.tgz])
2) Extract into folder at root of drive: **D:\spark**
3) Create test folder at **D:\testdata** with one (HTML) file contained within 
it.
4) Launch PySpark at **bin\PySpark**
5) Try to use sc.wholeTextFiles('d:/testdata'); fail.

Note: I followed instructions from the upcoming O'Reilly book [Learning 
Spark|http://shop.oreilly.com/product/0636920028512.do] for this. I do not have 
any related tools installed (e.g. Hadoop) on the Windows machine.

See session (below)with tracebacks from errors.

{noformat}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.1.0
  /_/

Using Python version 2.7.7 (default, Jun 11 2014 10:40:02)
SparkContext available as sc.
 file = sc.textFile(d:/testdata/cbcc5b470ec06f212990c68c8f76e887b884)
 file.count()
732
 file.first()
u'!DOCTYPE html'
 data = sc.wholeTextFiles('d:/testdata')
 data.first()
Traceback (most recent call last):
  File stdin, line 1, in module
  File D:\spark\python\pyspark\rdd.py, line 1167, in first
return self.take(1)[0]
  File D:\spark\python\pyspark\rdd.py, line 1126, in take
totalParts = self._jrdd.partitions().size()
  File D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py, line 
538, in __call__
  File D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py, line 300, 
in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling o21.partitions.
: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:559)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:534)
at 
org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42)
   at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1697)
at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1679)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:302)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:263)
at 
org.apache.spark.input.WholeTextFileInputFormat.setMaxSplitSize(WholeTextFileInputFormat.scala:54)
at 
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50)
at 
org.apache.spark.api.java.JavaPairRDD.partitions(JavaPairRDD.scala:44)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at

[jira] [Commented] (SPARK-3736) Workers should reconnect to Master if disconnected


[ 
https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174049#comment-14174049
 ] 

Apache Spark commented on SPARK-3736:
-

User 'mccheah' has created a pull request for this issue:
https://github.com/apache/spark/pull/2828

 Workers should reconnect to Master if disconnected
 --

 Key: SPARK-3736
 URL: https://issues.apache.org/jira/browse/SPARK-3736
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Andrew Ash
Assignee: Matthew Cheah
Priority: Critical

 In standalone mode, when a worker gets disconnected from the master for some 
 reason it never attempts to reconnect.  In this situation you have to bounce 
 the worker before it will reconnect to the master.
 The preferred alternative is to follow what Hadoop does -- when there's a 
 disconnect, attempt to reconnect at a particular interval until successful (I 
 think it repeats indefinitely every 10sec).
 This has been observed by:
 - [~pkolaczk] in 
 http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html
 - [~romi-totango] in 
 http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html
 - [~aash]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI

2014-10-16 Thread Nan Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174053#comment-14174053
 ] 

Nan Zhu commented on SPARK-3957:


I agree with [~andrewor14], I was also thinking about piggyback the information 
in the heartbeat between heartbeatReceiver and the executor 

...not sure about the current Hadoop implementation, in 1.x version, TaskStatus 
was piggyback in the heartbeat between TaskTracker and JobTracker...to me, it's 
a very natural way to do this

I accepted it this morning and have started some work, so, [~devlakhani], 
please let me finish this, thanks

 Broadcast variable memory usage not reflected in UI
 ---

 Key: SPARK-3957
 URL: https://issues.apache.org/jira/browse/SPARK-3957
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Web UI
Affects Versions: 1.0.2, 1.1.0
Reporter: Shivaram Venkataraman
Assignee: Nan Zhu

 Memory used by broadcast variables are not reflected in the memory usage 
 reported in the WebUI. For example, the executors tab shows memory used in 
 each executor but this number doesn't include memory used by broadcast 
 variables. Similarly the storage tab only shows list of rdds cached and how 
 much memory they use.  
 We should add a separate column / tab for broadcast variables to make it 
 easier to debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3973) Print callSite information for broadcast variables

2014-10-16 Thread Shivaram Venkataraman (JIRA)

Shivaram Venkataraman created SPARK-3973:


 Summary: Print callSite information for broadcast variables
 Key: SPARK-3973
 URL: https://issues.apache.org/jira/browse/SPARK-3973
 Project: Spark
  Issue Type: Bug
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Minor
 Fix For: 1.2.0


Printing call site information for broadcast variables will help in debugging 
which variables are used, when they are used etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3973) Print callSite information for broadcast variables


[ 
https://issues.apache.org/jira/browse/SPARK-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174066#comment-14174066
 ] 

Apache Spark commented on SPARK-3973:
-

User 'shivaram' has created a pull request for this issue:
https://github.com/apache/spark/pull/2829

 Print callSite information for broadcast variables
 --

 Key: SPARK-3973
 URL: https://issues.apache.org/jira/browse/SPARK-3973
 Project: Spark
  Issue Type: Bug
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Minor
 Fix For: 1.2.0


 Printing call site information for broadcast variables will help in debugging 
 which variables are used, when they are used etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2706) Enable Spark to support Hive 0.13

2014-10-16 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174073#comment-14174073
 ] 

Zhan Zhang commented on SPARK-2706:
---

The code does not go to upstream yet. To build 0.13.1 support, you need to 
apply the patch. But now it uses the customized package with org.spark-project, 
which seems to be withdraw from published. So to use it, you need to change 
org.spark-project to original hive package. 

 Enable Spark to support Hive 0.13
 -

 Key: SPARK-2706
 URL: https://issues.apache.org/jira/browse/SPARK-2706
 Project: Spark
  Issue Type: Dependency upgrade
  Components: SQL
Affects Versions: 1.0.1
Reporter: Chunjun Xiao
Assignee: Zhan Zhang
 Attachments: hive.diff, spark-2706-v1.txt, spark-2706-v2.txt, 
 spark-hive.err, v1.0.2.diff


 It seems Spark cannot work with Hive 0.13 well.
 When I compiled Spark with Hive 0.13.1, I got some error messages, as 
 attached below.
 So, when can Spark be enabled to support Hive 0.13?
 Compiling Error:
 {quote}
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala:180:
  type mismatch;
  found   : String
  required: Array[String]
 [ERROR]   val proc: CommandProcessor = 
 CommandProcessorFactory.get(tokens(0), hiveconf)
 [ERROR]  ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:264:
  overloaded method constructor TableDesc with alternatives:
   (x$1: Class[_ : org.apache.hadoop.mapred.InputFormat[_, _]],x$2: 
 Class[_],x$3: java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc 
 and
   ()org.apache.hadoop.hive.ql.plan.TableDesc
  cannot be applied to (Class[org.apache.hadoop.hive.serde2.Deserializer], 
 Class[(some other)?0(in value tableDesc)(in value tableDesc)], Class[?0(in 
 value tableDesc)(in value tableDesc)], java.util.Properties)
 [ERROR]   val tableDesc = new TableDesc(
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala:140:
  value getPartitionPath is not a member of 
 org.apache.hadoop.hive.ql.metadata.Partition
 [ERROR]   val partPath = partition.getPartitionPath
 [ERROR]^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala:132:
  value appendReadColumnNames is not a member of object 
 org.apache.hadoop.hive.serde2.ColumnProjectionUtils
 [ERROR] ColumnProjectionUtils.appendReadColumnNames(hiveConf, 
 attributes.map(_.name))
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:79:
  org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor
 [ERROR]   new HiveDecimal(bd.underlying())
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:132:
  type mismatch;
  found   : org.apache.hadoop.fs.Path
  required: String
 [ERROR]   
 SparkHiveHadoopWriter.createPathFromString(fileSinkConf.getDirName, conf))
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:179:
  value getExternalTmpFileURI is not a member of 
 org.apache.hadoop.hive.ql.Context
 [ERROR] val tmpLocation = hiveContext.getExternalTmpFileURI(tableLocation)
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala:209:
  org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor
 [ERROR]   case bd: BigDecimal = new HiveDecimal(bd.underlying())
 [ERROR]  ^
 [ERROR] 8 errors found
 [DEBUG] Compilation failed (CompilerInterface)
 [INFO] 
 
 [INFO] Reactor Summary:
 [INFO] 
 [INFO] Spark Project Parent POM .. SUCCESS [2.579s]
 [INFO] Spark Project Core  SUCCESS [2:39.805s]
 [INFO] Spark Project Bagel ... SUCCESS [21.148s]
 [INFO] Spark Project GraphX .. SUCCESS [59.950s]
 [INFO] Spark Project ML Library .. SUCCESS [1:08.771s]
 [INFO] Spark Project Streaming ... SUCCESS [1:17.759s]
 [INFO] Spark Project Tools ... SUCCESS [15.405s]
 [INFO] Spark Project Catalyst  SUCCESS [1:17.405s]
 [INFO] Spark Project SQL . SUCCESS [1:11.094s]
 [INFO] Spark Project Hive

[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI

2014-10-16 Thread Dev Lakhani (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174107#comment-14174107
 ] 

Dev Lakhani commented on SPARK-3957:


Hi 

For now I am happy for [~CodingCat] to take this on, maybe once there are some 
commits I can help with the UI side, but for now I'll hold back.



 Broadcast variable memory usage not reflected in UI
 ---

 Key: SPARK-3957
 URL: https://issues.apache.org/jira/browse/SPARK-3957
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Web UI
Affects Versions: 1.0.2, 1.1.0
Reporter: Shivaram Venkataraman
Assignee: Nan Zhu

 Memory used by broadcast variables are not reflected in the memory usage 
 reported in the WebUI. For example, the executors tab shows memory used in 
 each executor but this number doesn't include memory used by broadcast 
 variables. Similarly the storage tab only shows list of rdds cached and how 
 much memory they use.  
 We should add a separate column / tab for broadcast variables to make it 
 easier to debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3972) PySpark Error on Windows with sc.wholeTextFiles

2014-10-16 Thread Michael Griffiths (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174110#comment-14174110
 ] 

Michael Griffiths commented on SPARK-3972:
--

This issue does NOT occur if I build Spark from source; using Bash and sbt\sbt 
assembly. It's restricted to the pre-compiled version.

 PySpark Error on Windows with sc.wholeTextFiles
 ---

 Key: SPARK-3972
 URL: https://issues.apache.org/jira/browse/SPARK-3972
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, PySpark, Windows
Affects Versions: 1.1.0
 Environment: Windows 8.1 x64
 Java SE Version 8 Update 20 (build 1.8.0_20-b26); 
 Python 2.7.7
Reporter: Michael Griffiths
Priority: Minor

 When running sc.wholeTextFiles() on a directory, I can run the command but 
 not do anything with the resulting RDD – specifically, I get an error in 
 py4j.protocol.Py4JJavaError; the error is unspecified. This occurs even if I 
 can read the text file(s) individually with sc.textFile()
 Steps followed:
 1) Download Spark 1.1.0 (pre-builet for Hadoop 2.4: 
 [spark-1.1.0-bin-hadoop2.4.tgz|http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-hadoop2.4.tgz])
 2) Extract into folder at root of drive: **D:\spark**
 3) Create test folder at **D:\testdata** with one (HTML) file contained 
 within it.
 4) Launch PySpark at **bin\PySpark**
 5) Try to use sc.wholeTextFiles('d:/testdata'); fail.
 Note: I followed instructions from the upcoming O'Reilly book [Learning 
 Spark|http://shop.oreilly.com/product/0636920028512.do] for this. I do not 
 have any related tools installed (e.g. Hadoop) on the Windows machine.
 See session (below)with tracebacks from errors.
 {noformat}
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/__ / .__/\_,_/_/ /_/\_\   version 1.1.0
   /_/
 Using Python version 2.7.7 (default, Jun 11 2014 10:40:02)
 SparkContext available as sc.
  file = sc.textFile(d:/testdata/cbcc5b470ec06f212990c68c8f76e887b884)
  file.count()
 732
  file.first()
 u'!DOCTYPE html'
  data = sc.wholeTextFiles('d:/testdata')
  data.first()
 Traceback (most recent call last):
   File stdin, line 1, in module
   File D:\spark\python\pyspark\rdd.py, line 1167, in first
 return self.take(1)[0]
   File D:\spark\python\pyspark\rdd.py, line 1126, in take
 totalParts = self._jrdd.partitions().size()
   File D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py, line 
 538, in __call__
   File D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py, line 300, 
 in get_return_value 
 py4j.protocol.Py4JJavaError: An error occurred while calling o21.partitions.
 : java.lang.NullPointerException
 at java.lang.ProcessBuilder.start(Unknown Source)
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
 at org.apache.hadoop.util.Shell.run(Shell.java:418)
 at 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
 at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
 at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
 at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
 at 
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:559)
 at 
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:534)
 at 
 org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42)
at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1697)
 at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1679)
 at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:302)
 at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:263)
 at 
 org.apache.spark.input.WholeTextFileInputFormat.setMaxSplitSize(WholeTextFileInputFormat.scala:54)
 at 
 org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at 
 org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50)
 at 
 org.apache.spark.api.java.JavaPairRDD.partitions(JavaPairRDD.scala:44)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
 at

[jira] [Updated] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job

2014-10-16 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3882:
---
Description: 
A long running spark context (non-streaming) will eventually start throwing the 
following in the driver:

{code}
java.util.NoSuchElementException: key not found: 12771
  at scala.collection.MapLike$class.default(MapLike.scala:228)
  at scala.collection.AbstractMap.default(Map.scala:58)
  at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
  at 
org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at 
org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
  at 
org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
  at 
org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
  at scala.Option.foreach(Option.scala:236)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)
2014-10-09 18:45:33,523 [SparkListenerBus] ERROR 
org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener threw 
an exception
java.util.NoSuchElementException: key not found: 12782
  at scala.collection.MapLike$class.default(MapLike.scala:228)
  at scala.collection.AbstractMap.default(Map.scala:58)
  at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
  at 
org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at 
org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
  at 
org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
  at 
org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
  at scala.Option.foreach(Option.scala:236)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)
{code}

And the ui will show running jobs that are in fact no longer running and never 
clean them up. (see attached screenshot)

The result is that the ui becomes unusable, and the JobProgressListener leaks 
memory as the list of running jobs continues to grow.

  was:
A long running spark context (non-streaming) will eventually start throwing the 
following in the driver:

[jira] [Created] (SPARK-3974) Block matrix abstracitons and partitioners

Reza Zadeh created SPARK-3974:
-

 Summary: Block matrix abstracitons and partitioners
 Key: SPARK-3974
 URL: https://issues.apache.org/jira/browse/SPARK-3974
 Project: Spark
  Issue Type: Improvement
Reporter: Reza Zadeh


We need abstractions for block matrices with fixed block sizes, with each block 
being dense. Partitioners along both rows and columns required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3975) Block Matrix addition and multiplication

Reza Zadeh created SPARK-3975:
-

 Summary: Block Matrix addition and multiplication
 Key: SPARK-3975
 URL: https://issues.apache.org/jira/browse/SPARK-3975
 Project: Spark
  Issue Type: Improvement
Reporter: Reza Zadeh


Block matrix addition and multiplication, for the case when partitioning 
schemes match.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3976) Detect block matrix partitioning schemes

Reza Zadeh created SPARK-3976:
-

 Summary: Detect block matrix partitioning schemes
 Key: SPARK-3976
 URL: https://issues.apache.org/jira/browse/SPARK-3976
 Project: Spark
  Issue Type: Improvement
Reporter: Reza Zadeh


Provide repartitioning methods for block matrices to repartition matrix for 
add/multiply of non-identically partitioned matrices



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3977) Conversions between {Row, Coordinate}Matrix - BlockMatrix

Reza Zadeh created SPARK-3977:
-

 Summary: Conversions between {Row, Coordinate}Matrix - 
BlockMatrix
 Key: SPARK-3977
 URL: https://issues.apache.org/jira/browse/SPARK-3977
 Project: Spark
  Issue Type: Improvement
Reporter: Reza Zadeh


Build conversion functions between {Row, Coordinate}Matrix - BlockMatrix



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3971) Failed to deserialize Vector in cluster mode


[ 
https://issues.apache.org/jira/browse/SPARK-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174146#comment-14174146
 ] 

Apache Spark commented on SPARK-3971:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2830

 Failed to deserialize Vector in cluster mode
 

 Key: SPARK-3971
 URL: https://issues.apache.org/jira/browse/SPARK-3971
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Reporter: Davies Liu
Priority: Blocker

 The serialization of Vector/Rating did not work in cluster mode, because the 
 initializer is not called in executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3466) Limit size of results that a driver collects for each action

2014-10-16 Thread Matt Cheah (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174148#comment-14174148
 ] 

Matt Cheah commented on SPARK-3466:
---

I'll look into this. Someone please assign to me!

 Limit size of results that a driver collects for each action
 

 Key: SPARK-3466
 URL: https://issues.apache.org/jira/browse/SPARK-3466
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Matei Zaharia

 Right now, operations like {{collect()}} and {{take()}} can crash the driver 
 with an OOM if they bring back too many data. We should add a 
 {{spark.driver.maxResultSize}} setting (or something like that) that will 
 make the driver abort a job if its result is too big. We can set it to some 
 fraction of the driver's memory by default, or to something like 100 MB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3978) Schema change on Spark-Hive (Parquet file format) table not working

2014-10-16 Thread Nilesh Barge (JIRA)

Nilesh Barge created SPARK-3978:
---

 Summary: Schema change on Spark-Hive (Parquet file format) table 
not working
 Key: SPARK-3978
 URL: https://issues.apache.org/jira/browse/SPARK-3978
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Nilesh Barge


On following releases: 
Spark 1.1.0 (built using sbt/sbt -Dhadoop.version=2.2.0 -Phive assembly) , 
Apache HDFS 2.2 

Spark job is able to create/add/read data in hive, parquet formatted, tables 
using HiveContext. 
But, after changing schema, spark job is not able to read data and throws 
following exception: 
java.lang.ArrayIndexOutOfBoundsException: 2 
at 
org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.getStructFieldData(ArrayWritableObjectInspector.java:127)
 
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:284)
 
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:278)
 
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) 
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) 
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) 
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) 
at scala.collection.AbstractIterator.to(Iterator.scala:1157) 
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) 
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) 
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) 
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) 
at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774) 
at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774) 
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) 
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) 
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) 
at org.apache.spark.scheduler.Task.run(Task.scala:54) 
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
at java.lang.Thread.run(Thread.java:744)


code snippet in short: 

hiveContext.sql(CREATE EXTERNAL TABLE IF NOT EXISTS people_table (name String, 
age INT) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS 
INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 
'parquet.hive.DeprecatedParquetOutputFormat'); 
hiveContext.sql(INSERT INTO TABLE people_table SELECT name, age FROM 
temp_table_people1); 
hiveContext.sql(SELECT * FROM people_table); //Here, data read was 
successful.  
hiveContext.sql(ALTER TABLE people_table ADD COLUMNS (gender STRING)); 
hiveContext.sql(SELECT * FROM people_table); //Not able to read existing data 
and ArrayIndexOutOfBoundsException is thrown.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3466) Limit size of results that a driver collects for each action

2014-10-16 Thread Andrew Ash (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-3466:
--
Assignee: Matthew Cheah

 Limit size of results that a driver collects for each action
 

 Key: SPARK-3466
 URL: https://issues.apache.org/jira/browse/SPARK-3466
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Matthew Cheah

 Right now, operations like {{collect()}} and {{take()}} can crash the driver 
 with an OOM if they bring back too many data. We should add a 
 {{spark.driver.maxResultSize}} setting (or something like that) that will 
 make the driver abort a job if its result is too big. We can set it to some 
 fraction of the driver's memory by default, or to something like 100 MB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3971) Failed to deserialize Vector in cluster mode

2014-10-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3971:
-
Assignee: Davies Liu

 Failed to deserialize Vector in cluster mode
 

 Key: SPARK-3971
 URL: https://issues.apache.org/jira/browse/SPARK-3971
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker

 The serialization of Vector/Rating did not work in cluster mode, because the 
 initializer is not called in executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3971) Failed to deserialize Vector in cluster mode

2014-10-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3971:
-
 Target Version/s: 1.2.0
Affects Version/s: 1.2.0

 Failed to deserialize Vector in cluster mode
 

 Key: SPARK-3971
 URL: https://issues.apache.org/jira/browse/SPARK-3971
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker

 The serialization of Vector/Rating did not work in cluster mode, because the 
 initializer is not called in executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3979) Yarn backend's default file replication should match HDFS's default one

Marcelo Vanzin created SPARK-3979:
-

 Summary: Yarn backend's default file replication should match 
HDFS's default one
 Key: SPARK-3979
 URL: https://issues.apache.org/jira/browse/SPARK-3979
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Marcelo Vanzin
Priority: Minor


This code in ClientBase.scala sets the replication used for files uploaded to 
HDFS:

{{noformat}}
val replication = sparkConf.getInt(spark.yarn.submit.file.replication, 
3).toShort
{{noformat}}

Instead of a hardcoded 3 (which is the default value for HDFS), it should be 
using the default value from the HDFS conf (dfs.replication).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3979) Yarn backend's default file replication should match HDFS's default one


[ 
https://issues.apache.org/jira/browse/SPARK-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174165#comment-14174165
 ] 

Marcelo Vanzin commented on SPARK-3979:
---

BTW, this would avoid issues like this:

{noformat}
Exception in thread main 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): file 
/user/systest/.sparkStaging/application_1413485082283_0001/spark-assembly-1.2.0-SNAPSHOT-hadoop2.3.0.jar.
Requested replication 3 exceeds maximum 1
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.verifyReplication(BlockManager.java:943)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setReplicationInt(FSNamesystem.java:2243)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setReplication(FSNamesystem.java:2233)
...
at 
org.apache.spark.deploy.yarn.ClientBase$class.copyFileToRemote(ClientBase.scala:101)
{noformat}

 Yarn backend's default file replication should match HDFS's default one
 ---

 Key: SPARK-3979
 URL: https://issues.apache.org/jira/browse/SPARK-3979
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Marcelo Vanzin
Priority: Minor

 This code in ClientBase.scala sets the replication used for files uploaded to 
 HDFS:
 {{noformat}}
 val replication = sparkConf.getInt(spark.yarn.submit.file.replication, 
 3).toShort
 {{noformat}}
 Instead of a hardcoded 3 (which is the default value for HDFS), it should 
 be using the default value from the HDFS conf (dfs.replication).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3979) Yarn backend's default file replication should match HDFS's default one


 [ 
https://issues.apache.org/jira/browse/SPARK-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-3979:
--
Description: 
This code in ClientBase.scala sets the replication used for files uploaded to 
HDFS:

{code}
val replication = sparkConf.getInt(spark.yarn.submit.file.replication, 
3).toShort
{code}

Instead of a hardcoded 3 (which is the default value for HDFS), it should be 
using the default value from the HDFS conf (dfs.replication).

  was:
This code in ClientBase.scala sets the replication used for files uploaded to 
HDFS:

{{noformat}}
val replication = sparkConf.getInt(spark.yarn.submit.file.replication, 
3).toShort
{{noformat}}

Instead of a hardcoded 3 (which is the default value for HDFS), it should be 
using the default value from the HDFS conf (dfs.replication).


 Yarn backend's default file replication should match HDFS's default one
 ---

 Key: SPARK-3979
 URL: https://issues.apache.org/jira/browse/SPARK-3979
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Minor

 This code in ClientBase.scala sets the replication used for files uploaded to 
 HDFS:
 {code}
 val replication = sparkConf.getInt(spark.yarn.submit.file.replication, 
 3).toShort
 {code}
 Instead of a hardcoded 3 (which is the default value for HDFS), it should 
 be using the default value from the HDFS conf (dfs.replication).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3979) Yarn backend's default file replication should match HDFS's default one


[ 
https://issues.apache.org/jira/browse/SPARK-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174224#comment-14174224
 ] 

Apache Spark commented on SPARK-3979:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2831

 Yarn backend's default file replication should match HDFS's default one
 ---

 Key: SPARK-3979
 URL: https://issues.apache.org/jira/browse/SPARK-3979
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Minor

 This code in ClientBase.scala sets the replication used for files uploaded to 
 HDFS:
 {code}
 val replication = sparkConf.getInt(spark.yarn.submit.file.replication, 
 3).toShort
 {code}
 Instead of a hardcoded 3 (which is the default value for HDFS), it should 
 be using the default value from the HDFS conf (dfs.replication).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3980) GraphX Performance Issue

2014-10-16 Thread Jarred Li (JIRA)

Jarred Li created SPARK-3980:


 Summary: GraphX Performance Issue
 Key: SPARK-3980
 URL: https://issues.apache.org/jira/browse/SPARK-3980
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.1.0
Reporter: Jarred Li


I run 4 workes in AWS (c3.xlarge), 4g memory for executor, 85,331,846 edges 
from(http://socialcomputing.asu.edu/uploads/1296759055/Twitter-dataset.zip). 
For PageRank algorithm, the job can not be completed withon 7 hours.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3981) Consider a better approach to initialize SerDe on executors

2014-10-16 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-3981:


 Summary: Consider a better approach to initialize SerDe on 
executors
 Key: SPARK-3981
 URL: https://issues.apache.org/jira/browse/SPARK-3981
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Xiangrui Meng


In SPARK-3971, we copied SerDe code from Core to MLlib in order to recognize 
MLlib types on executors as a hotfix. This is not ideal. We should find a way 
to add hooks to the SerDe in Core to support MLlib types in a pluggable way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3980) GraphX Performance Issue

2014-10-16 Thread Jarred Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jarred Li updated SPARK-3980:
-
Description: I run 4 workes in AWS (c3.xlarge), 4g memory for executor, 
85,331,846 edges 
from(http://socialcomputing.asu.edu/uploads/1296759055/Twitter-dataset.zip). 
For PageRank algorithm, the job can not be completed within 7 hours.  (was: I 
run 4 workes in AWS (c3.xlarge), 4g memory for executor, 85,331,846 edges 
from(http://socialcomputing.asu.edu/uploads/1296759055/Twitter-dataset.zip). 
For PageRank algorithm, the job can not be completed withon 7 hours.)

 GraphX Performance Issue
 

 Key: SPARK-3980
 URL: https://issues.apache.org/jira/browse/SPARK-3980
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.1.0
Reporter: Jarred Li

 I run 4 workes in AWS (c3.xlarge), 4g memory for executor, 85,331,846 edges 
 from(http://socialcomputing.asu.edu/uploads/1296759055/Twitter-dataset.zip). 
 For PageRank algorithm, the job can not be completed within 7 hours.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3982) receiverStream in Python API

2014-10-16 Thread Davies Liu (JIRA)

Davies Liu created SPARK-3982:
-

 Summary: receiverStream in Python API
 Key: SPARK-3982
 URL: https://issues.apache.org/jira/browse/SPARK-3982
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Streaming
Reporter: Davies Liu


receiverStream() is used to extend the input sources of streaming, it will be 
very useful to have it in Python API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3980) GraphX Performance Issue

2014-10-16 Thread Jarred Li (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jarred Li updated SPARK-3980:
-
Description:
I run 4 workes in AWS (c3.xlarge), 4g memory for executor, 85,331,846 edges
from(http://socialcomputing.asu.edu/uploads/1296759055/Twitter-dataset.zip).
For PageRank algorithm, the job can not be completed within 7 hours. For small
dataset with 5,000,000
edges(http://socialcomputing.asu.edu/uploads/1296591553/Last.fm-dataset.zip)
, the job can be completed within 16 seconds.

was:I run 4 workes in AWS (c3.xlarge), 4g memory for executor, 85,331,846
edges
from(http://socialcomputing.asu.edu/uploads/1296759055/Twitter-dataset.zip).
For PageRank algorithm, the job can not be completed within 7 hours.

GraphX Performance Issue

Key: SPARK-3980
URL: https://issues.apache.org/jira/browse/SPARK-3980
Project: Spark
Issue Type: Bug
Components: GraphX
Affects Versions: 1.1.0
Reporter: Jarred Li

I run 4 workes in AWS (c3.xlarge), 4g memory for executor, 85,331,846 edges
from(http://socialcomputing.asu.edu/uploads/1296759055/Twitter-dataset.zip).
For PageRank algorithm, the job can not be completed within 7 hours. For
small dataset with 5,000,000
edges(http://socialcomputing.asu.edu/uploads/1296591553/Last.fm-dataset.zip)
, the job can be completed within 16 seconds.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3983) Scheduler delay (shown in the UI) is incorrect

Kay Ousterhout created SPARK-3983:
-

 Summary: Scheduler delay (shown in the UI) is incorrect
 Key: SPARK-3983
 URL: https://issues.apache.org/jira/browse/SPARK-3983
 Project: Spark
  Issue Type: Bug
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
 Fix For: 1.2.0


The reported scheduler delay includes time to get a new thread (from a 
threadpool) in order to start the task, time to deserialize the task, and time 
to serialize the result.  None of these things are delay caused by the 
scheduler; including them as such is misleading.

cc [~sparks] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-3983) Scheduler delay (shown in the UI) is incorrect


 [ 
https://issues.apache.org/jira/browse/SPARK-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-3983:
--
Comment: was deleted

(was: This is especially problematic when debugging performance of short tasks 
(that run in 10s of milliseconds), when the scheduler delay can be very large 
relative to the task duration.)

 Scheduler delay (shown in the UI) is incorrect
 --

 Key: SPARK-3983
 URL: https://issues.apache.org/jira/browse/SPARK-3983
 Project: Spark
  Issue Type: Bug
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
 Fix For: 1.2.0


 The reported scheduler delay includes time to get a new thread (from a 
 threadpool) in order to start the task, time to deserialize the task, and 
 time to serialize the result.  None of these things are delay caused by the 
 scheduler; including them as such is misleading.
 cc [~sparks] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3983) Scheduler delay (shown in the UI) is incorrect


[ 
https://issues.apache.org/jira/browse/SPARK-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174452#comment-14174452
 ] 

Kay Ousterhout commented on SPARK-3983:
---

This is especially problematic when debugging performance of short tasks (that 
run in 10s of milliseconds), when the scheduler delay can be very large 
relative to the task duration.

 Scheduler delay (shown in the UI) is incorrect
 --

 Key: SPARK-3983
 URL: https://issues.apache.org/jira/browse/SPARK-3983
 Project: Spark
  Issue Type: Bug
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
 Fix For: 1.2.0


 The reported scheduler delay includes time to get a new thread (from a 
 threadpool) in order to start the task, time to deserialize the task, and 
 time to serialize the result.  None of these things are delay caused by the 
 scheduler; including them as such is misleading.
 cc [~sparks] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3983) Scheduler delay (shown in the UI) is incorrect


 [ 
https://issues.apache.org/jira/browse/SPARK-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-3983:
--
Description: 
The reported scheduler delay includes time to get a new thread (from a 
threadpool) in order to start the task, time to deserialize the task, and time 
to serialize the result.  None of these things are delay caused by the 
scheduler; including them as such is misleading.

This is especially problematic when debugging performance of short tasks (that 
run in 10s of milliseconds), when the scheduler delay can be very large 
relative to the task duration.

cc [~sparks] [~shivaram]

  was:
The reported scheduler delay includes time to get a new thread (from a 
threadpool) in order to start the task, time to deserialize the task, and time 
to serialize the result.  None of these things are delay caused by the 
scheduler; including them as such is misleading.

cc [~sparks] [~shivaram]


 Scheduler delay (shown in the UI) is incorrect
 --

 Key: SPARK-3983
 URL: https://issues.apache.org/jira/browse/SPARK-3983
 Project: Spark
  Issue Type: Bug
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
 Fix For: 1.2.0


 The reported scheduler delay includes time to get a new thread (from a 
 threadpool) in order to start the task, time to deserialize the task, and 
 time to serialize the result.  None of these things are delay caused by the 
 scheduler; including them as such is misleading.
 This is especially problematic when debugging performance of short tasks 
 (that run in 10s of milliseconds), when the scheduler delay can be very large 
 relative to the task duration.
 cc [~sparks] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails


[ 
https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174507#comment-14174507
 ] 

Marcelo Vanzin edited comment on SPARK-3877 at 10/17/14 12:08 AM:
--

[~tgraves] this can be seen as a subset of SPARK-2167, but as I mentioned on 
that bug, I don't think it's fixable for all cases. SparkSubmit is executing 
user code, so it can only report errors when the user code does. 

e.g., a job like this would report an error today

{code}
  val sc = ...
  try {
// do stuff
if (somethingBad) throw MyJobFailedException()
  } finally {
sc.stop()
  }
{code}

But this one wouldn't:

{code}
  val sc = ...
  try {
// do stuff
if (somethingBad) throw MyJobFailedException()
  } catch {
case e: Exception = logError(Oops, something bad happened., e)
  } finally {
sc.stop()
  }
{code}

yarn-client mode will abruptly stop the SparkContext when the Yarn app fails. 
But depending on how the user's {{main()}} deals with errors, that still may 
not result in a non-zero exit status.


was (Author: vanzin):
[~tgraves] this can be seen as a subset of SPARK-2167, but as I mentioned on 
that bug, I don't think it's fixable for all cases. SparkSubmit is executing 
user code, so it can only report errors when the user code does. 

e.g., a job like this would report an error today

{code}
  val sc = ...
  try {
// do stuff
if (somethingBad) throw MyJobFailedException()
  } finally {
sc.stop()
  }
{code}

But this one wouldn't:

{code}
  val sc = ...
  try {
// do stuff
if (somethingBad) throw MyJobFailedException()
  } catch {
case e: Exception = logError(Oops, something bad happened., e)
  } finally {
sc.stop()
  }
{code}

yarn-client mode will abruptly stop the SparkContext when the Yarn app fails. 
But depending on how the user's {main()} deals with errors, that still may not 
result in a non-zero exit status.

 The exit code of spark-submit is still 0 when an yarn application fails
 ---

 Key: SPARK-3877
 URL: https://issues.apache.org/jira/browse/SPARK-3877
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Shixiong Zhu
Priority: Minor
  Labels: yarn

 When an yarn application fails (yarn-cluster mode), the exit code of 
 spark-submit is still 0. It's hard for people to write some automatic scripts 
 to run spark jobs in yarn because the failure can not be detected in these 
 scripts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails


[ 
https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174507#comment-14174507
 ] 

Marcelo Vanzin commented on SPARK-3877:
---

[~tgraves] this can be seen as a subset of SPARK-2167, but as I mentioned on 
that bug, I don't think it's fixable for all cases. SparkSubmit is executing 
user code, so it can only report errors when the user code does. 

e.g., a job like this would report an error today

{code}
  val sc = ...
  try {
// do stuff
if (somethingBad) throw MyJobFailedException()
  } finally {
sc.stop()
  }
{code}

But this one wouldn't:

{code}
  val sc = ...
  try {
// do stuff
if (somethingBad) throw MyJobFailedException()
  } catch {
case e: Exception = logError(Oops, something bad happened., e)
  } finally {
sc.stop()
  }
{code}

yarn-client mode will abruptly stop the SparkContext when the Yarn app fails. 
But depending on how the user's {main()} deals with errors, that still may not 
result in a non-zero exit status.

 The exit code of spark-submit is still 0 when an yarn application fails
 ---

 Key: SPARK-3877
 URL: https://issues.apache.org/jira/browse/SPARK-3877
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Shixiong Zhu
Priority: Minor
  Labels: yarn

 When an yarn application fails (yarn-cluster mode), the exit code of 
 spark-submit is still 0. It's hard for people to write some automatic scripts 
 to run spark jobs in yarn because the failure can not be detected in these 
 scripts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3984) Display finer grained metrics about task launch overhead in the UI

Kay Ousterhout created SPARK-3984:
-

 Summary: Display finer grained metrics about task launch overhead 
in the UI
 Key: SPARK-3984
 URL: https://issues.apache.org/jira/browse/SPARK-3984
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
 Fix For: 1.2.0


Right now, the UI does not display the time to deserialize the task, to 
serialize the task result, or to launch a new thread for the task.  When 
running short jobs (e.g., for ML) these overheads can become significant.  It 
would be great to show these in the summary quantiles for each stage in the UI 
to facilitate better performance debugging.

cc [~sparks] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3983) Scheduler delay (shown in the UI) is incorrect


[ 
https://issues.apache.org/jira/browse/SPARK-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174574#comment-14174574
 ] 

Kay Ousterhout commented on SPARK-3983:
---

https://github.com/apache/spark/pull/2832

 Scheduler delay (shown in the UI) is incorrect
 --

 Key: SPARK-3983
 URL: https://issues.apache.org/jira/browse/SPARK-3983
 Project: Spark
  Issue Type: Bug
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
 Fix For: 1.2.0


 The reported scheduler delay includes time to get a new thread (from a 
 threadpool) in order to start the task, time to deserialize the task, and 
 time to serialize the result.  None of these things are delay caused by the 
 scheduler; including them as such is misleading.
 This is especially problematic when debugging performance of short tasks 
 (that run in 10s of milliseconds), when the scheduler delay can be very large 
 relative to the task duration.
 cc [~sparks] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3983) Scheduler delay (shown in the UI) is incorrect


 [ 
https://issues.apache.org/jira/browse/SPARK-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-3983:
--
Component/s: Web UI

 Scheduler delay (shown in the UI) is incorrect
 --

 Key: SPARK-3983
 URL: https://issues.apache.org/jira/browse/SPARK-3983
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
 Fix For: 1.2.0


 The reported scheduler delay includes time to get a new thread (from a 
 threadpool) in order to start the task, time to deserialize the task, and 
 time to serialize the result.  None of these things are delay caused by the 
 scheduler; including them as such is misleading.
 This is especially problematic when debugging performance of short tasks 
 (that run in 10s of milliseconds), when the scheduler delay can be very large 
 relative to the task duration.
 cc [~sparks] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-3984) Display finer grained metrics about task launch overhead in the UI


 [ 
https://issues.apache.org/jira/browse/SPARK-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-3984:
--
Comment: was deleted

(was: https://github.com/apache/spark/pull/2832)

 Display finer grained metrics about task launch overhead in the UI
 --

 Key: SPARK-3984
 URL: https://issues.apache.org/jira/browse/SPARK-3984
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
 Fix For: 1.2.0


 Right now, the UI does not display the time to deserialize the task, to 
 serialize the task result, or to launch a new thread for the task.  When 
 running short jobs (e.g., for ML) these overheads can become significant.  It 
 would be great to show these in the summary quantiles for each stage in the 
 UI to facilitate better performance debugging.
 cc [~sparks] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3984) Display finer grained metrics about task launch overhead in the UI


[ 
https://issues.apache.org/jira/browse/SPARK-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174576#comment-14174576
 ] 

Apache Spark commented on SPARK-3984:
-

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/2832

 Display finer grained metrics about task launch overhead in the UI
 --

 Key: SPARK-3984
 URL: https://issues.apache.org/jira/browse/SPARK-3984
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
 Fix For: 1.2.0


 Right now, the UI does not display the time to deserialize the task, to 
 serialize the task result, or to launch a new thread for the task.  When 
 running short jobs (e.g., for ML) these overheads can become significant.  It 
 would be great to show these in the summary quantiles for each stage in the 
 UI to facilitate better performance debugging.
 cc [~sparks] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3983) Scheduler delay (shown in the UI) is incorrect


[ 
https://issues.apache.org/jira/browse/SPARK-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174577#comment-14174577
 ] 

Apache Spark commented on SPARK-3983:
-

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/2832

 Scheduler delay (shown in the UI) is incorrect
 --

 Key: SPARK-3983
 URL: https://issues.apache.org/jira/browse/SPARK-3983
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
 Fix For: 1.2.0


 The reported scheduler delay includes time to get a new thread (from a 
 threadpool) in order to start the task, time to deserialize the task, and 
 time to serialize the result.  None of these things are delay caused by the 
 scheduler; including them as such is misleading.
 This is especially problematic when debugging performance of short tasks 
 (that run in 10s of milliseconds), when the scheduler delay can be very large 
 relative to the task duration.
 cc [~sparks] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3984) Display finer grained metrics about task launch overhead in the UI


[ 
https://issues.apache.org/jira/browse/SPARK-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174575#comment-14174575
 ] 

Kay Ousterhout commented on SPARK-3984:
---

https://github.com/apache/spark/pull/2832

 Display finer grained metrics about task launch overhead in the UI
 --

 Key: SPARK-3984
 URL: https://issues.apache.org/jira/browse/SPARK-3984
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
 Fix For: 1.2.0


 Right now, the UI does not display the time to deserialize the task, to 
 serialize the task result, or to launch a new thread for the task.  When 
 running short jobs (e.g., for ML) these overheads can become significant.  It 
 would be great to show these in the summary quantiles for each stage in the 
 UI to facilitate better performance debugging.
 cc [~sparks] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3982) receiverStream in Python API


[ 
https://issues.apache.org/jira/browse/SPARK-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174580#comment-14174580
 ] 

Apache Spark commented on SPARK-3982:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2833

 receiverStream in Python API
 

 Key: SPARK-3982
 URL: https://issues.apache.org/jira/browse/SPARK-3982
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Streaming
Reporter: Davies Liu
Assignee: Davies Liu

 receiverStream() is used to extend the input sources of streaming, it will be 
 very useful to have it in Python API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3874) Provide stable TaskContext API


 [ 
https://issues.apache.org/jira/browse/SPARK-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3874.

   Resolution: Fixed
Fix Version/s: 1.2.0

 Provide stable TaskContext API
 --

 Key: SPARK-3874
 URL: https://issues.apache.org/jira/browse/SPARK-3874
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Prashant Sharma
 Fix For: 1.2.0


 We made some improvements in SPARK-3543 but for Spark 1.2 we should convert 
 TaskContext into a fully stable API. To do this I’d suggest the following 
 changes - note that some of this reverses parts of SPARK-3543. The goal is to 
 provide a class that users can’t easily construct and exposes only the public 
 functionality.
 1. Separate TaskContext into a public abstract class (TaskContext) and a 
 private implementation called TaskContextImpl. The former should be a Java 
 abstract class - the latter should be a private[spark] Scala class to reduce 
 visibility (or maybe we can keep it as Java and tell people not to use it?).
 2. TaskContext abstract class will have (NOTE: this changes getXX() to XX() 
 intentionally)
 public isCompleted()
 public isInterrupted()
 public addTaskCompletionListener(...)
 public addTaskCompletionCallback(...) (deprecated)
 public stageId()
 public partitionId()
 public attemptId()
 pubic isRunningLocally()
 STATIC
 public get() 
 set() and unset() at default visibility
 3. A new private[spark] static object TaskContextHelper in the same package 
 as TaskContext will exist to expose set() and unset() from within Spark using 
 forwarder methods that just call TaskContext.set(). If someone within Spark 
 wants to set this they call TaskContextHelper.set() and it forwards it.
 4. TaskContextImpl will be used whenever we construct a TaskContext 
 internally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3975) Block Matrix addition and multiplication


 [ 
https://issues.apache.org/jira/browse/SPARK-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3975:
---
Component/s: MLlib

 Block Matrix addition and multiplication
 

 Key: SPARK-3975
 URL: https://issues.apache.org/jira/browse/SPARK-3975
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh

 Block matrix addition and multiplication, for the case when partitioning 
 schemes match.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3974) Block matrix abstracitons and partitioners


 [ 
https://issues.apache.org/jira/browse/SPARK-3974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Zadeh updated SPARK-3974:
--
Component/s: MLlib

 Block matrix abstracitons and partitioners
 --

 Key: SPARK-3974
 URL: https://issues.apache.org/jira/browse/SPARK-3974
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh

 We need abstractions for block matrices with fixed block sizes, with each 
 block being dense. Partitioners along both rows and columns required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3976) Detect block matrix partitioning schemes


 [ 
https://issues.apache.org/jira/browse/SPARK-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Zadeh updated SPARK-3976:
--
Component/s: MLlib

 Detect block matrix partitioning schemes
 

 Key: SPARK-3976
 URL: https://issues.apache.org/jira/browse/SPARK-3976
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh

 Provide repartitioning methods for block matrices to repartition matrix for 
 add/multiply of non-identically partitioned matrices



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3977) Conversions between {Row, Coordinate}Matrix - BlockMatrix


 [ 
https://issues.apache.org/jira/browse/SPARK-3977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Zadeh updated SPARK-3977:
--
Component/s: MLlib

 Conversions between {Row, Coordinate}Matrix - BlockMatrix
 ---

 Key: SPARK-3977
 URL: https://issues.apache.org/jira/browse/SPARK-3977
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh

 Build conversion functions between {Row, Coordinate}Matrix - BlockMatrix



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job


[ 
https://issues.apache.org/jira/browse/SPARK-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174629#comment-14174629
 ] 

Patrick Wendell commented on SPARK-3882:


This is a known issue (SPARK-2316) that was fixed in Spark 1.1. To verify that 
you are hitting the same issue, would you mind testing your job with Spark 1.1 
and seeing if you observe it?

 JobProgressListener gets permanently out of sync with long running job
 --

 Key: SPARK-3882
 URL: https://issues.apache.org/jira/browse/SPARK-3882
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.2
Reporter: Davis Shepherd
 Attachments: Screen Shot 2014-10-03 at 12.50.59 PM.png


 A long running spark context (non-streaming) will eventually start throwing 
 the following in the driver:
 {code}
 java.util.NoSuchElementException: key not found: 12771
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at scala.collection.AbstractMap.default(Map.scala:58)
   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
   at 
 org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)
 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR 
 org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener 
 threw an exception
 java.util.NoSuchElementException: key not found: 12782
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at scala.collection.AbstractMap.default(Map.scala:58)
   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
   at 
 org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
   at

[jira] [Updated] (SPARK-3973) Print callSite information for broadcast variables


 [ 
https://issues.apache.org/jira/browse/SPARK-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3973:
---
Component/s: Spark Core

 Print callSite information for broadcast variables
 --

 Key: SPARK-3973
 URL: https://issues.apache.org/jira/browse/SPARK-3973
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Minor
 Fix For: 1.2.0


 Printing call site information for broadcast variables will help in debugging 
 which variables are used, when they are used etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-3923) All Standalone Mode services time out with each other

[
https://issues.apache.org/jira/browse/SPARK-3923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Or closed SPARK-3923.

Resolution: Fixed
Fix Version/s: 1.2.0
Assignee: Aaron Davidson
Target Version/s: 1.2.0

All Standalone Mode services time out with each other
-

Key: SPARK-3923
URL: https://issues.apache.org/jira/browse/SPARK-3923
Project: Spark
Issue Type: Bug
Components: Deploy
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Blocker
Fix For: 1.2.0

I'm seeing an issue where it seems that components in Standalone Mode
(Worker, Master, Driver, and Executor) all seem to time out with each other
after around 1000 seconds. Here is an example log:
{code}
14/10/13 06:43:55 INFO Master: Registering worker
ip-10-0-147-189.us-west-2.compute.internal:38922 with 4 cores, 29.0 GB RAM
14/10/13 06:43:55 INFO Master: Registering worker
ip-10-0-175-214.us-west-2.compute.internal:42918 with 4 cores, 59.0 GB RAM
14/10/13 06:43:56 INFO Master: Registering app Databricks Shell
14/10/13 06:43:56 INFO Master: Registered app Databricks Shell with ID
app-20141013064356-
... precisely 1000 seconds later ...
14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote
system
[akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922] has
failed, address is now gated for [5000] ms. Reason is: [Disassociated].
14/10/13 07:00:35 INFO Master:
akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922 got
disassociated, removing it.
14/10/13 07:00:35 INFO LocalActorRef: Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
Actor[akka://sparkMaster/deadLetters] to
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.147.189%3A54956-1#1529980245]
was not delivered. [2] dead letters encountered. This logging can be turned
off or adjusted with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
14/10/13 07:00:35 INFO Master:
akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got
disassociated, removing it.
14/10/13 07:00:35 INFO Master: Removing worker
worker-20141013064354-ip-10-0-175-214.us-west-2.compute.internal-42918 on
ip-10-0-175-214.us-west-2.compute.internal:42918
14/10/13 07:00:35 INFO Master: Telling app of lost executor: 1
14/10/13 07:00:35 INFO Master:
akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got
disassociated, removing it.
14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote
system
[akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918] has
failed, address is now gated for [5000] ms. Reason is: [Disassociated].
14/10/13 07:00:35 INFO LocalActorRef: Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
Actor[akka://sparkMaster/deadLetters] to
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324]
was not delivered. [3] dead letters encountered. This logging can be turned
off or adjusted with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
14/10/13 07:00:35 INFO LocalActorRef: Message
[akka.remote.transport.AssociationHandle$Disassociated] from
Actor[akka://sparkMaster/deadLetters] to
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324]
was not delivered. [4] dead letters encountered. This logging can be turned
off or adjusted with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
14/10/13 07:00:36 INFO ProtocolStateActor: No response from remote. Handshake
timed out or transport failure detector triggered.
14/10/13 07:00:36 INFO Master:
akka.tcp://sparkdri...@ip-10-0-175-215.us-west-2.compute.internal:58259 got
disassociated, removing it.
14/10/13 07:00:36 INFO LocalActorRef: Message
[akka.remote.transport.AssociationHandle$InboundPayload] from
Actor[akka://sparkMaster/deadLetters] to
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#1944377249]
was not delivered. [5] dead letters encountered. This logging can be turned
off or adjusted with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
14/10/13 07:00:36 INFO Master: Removing app app-20141013064356-
14/10/13 07:00:36 WARN ReliableDeliverySupervisor: Association with remote
system

[jira] [Closed] (SPARK-3941) _remainingMem should not increase twice when updateBlockInfo


 [ 
https://issues.apache.org/jira/browse/SPARK-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3941.

  Resolution: Fixed
   Fix Version/s: 1.2.0
Assignee: Zhang, Liye
Target Version/s: 1.2.0

 _remainingMem should not increase twice when updateBlockInfo
 

 Key: SPARK-3941
 URL: https://issues.apache.org/jira/browse/SPARK-3941
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Zhang, Liye
Assignee: Zhang, Liye
 Fix For: 1.2.0


 In BlockManagermasterActor, _remainingMem would increase memSize for twice 
 when updateBlockInfo if new storageLevel is invalid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3890) remove redundant spark.executor.memory in doc


 [ 
https://issues.apache.org/jira/browse/SPARK-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3890:
-
Affects Version/s: (was: 1.2.0)
   1.1.0

 remove redundant spark.executor.memory in doc
 -

 Key: SPARK-3890
 URL: https://issues.apache.org/jira/browse/SPARK-3890
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.1.0
Reporter: WangTaoTheTonic
Priority: Minor
 Fix For: 1.1.1, 1.2.0


 Seems like there is a redundant spark.executor.memory config item in docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-3890) remove redundant spark.executor.memory in doc


 [ 
https://issues.apache.org/jira/browse/SPARK-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3890.

  Resolution: Fixed
   Fix Version/s: 1.2.0
  1.1.1
Assignee: WangTaoTheTonic
Target Version/s: 1.1.1, 1.2.0

 remove redundant spark.executor.memory in doc
 -

 Key: SPARK-3890
 URL: https://issues.apache.org/jira/browse/SPARK-3890
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.1.0
Reporter: WangTaoTheTonic
Assignee: WangTaoTheTonic
Priority: Minor
 Fix For: 1.1.1, 1.2.0


 Seems like there is a redundant spark.executor.memory config item in docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3890) remove redundant spark.executor.memory in doc


 [ 
https://issues.apache.org/jira/browse/SPARK-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3890:
-
Affects Version/s: 1.2.0

 remove redundant spark.executor.memory in doc
 -

 Key: SPARK-3890
 URL: https://issues.apache.org/jira/browse/SPARK-3890
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.1.0
Reporter: WangTaoTheTonic
Priority: Minor
 Fix For: 1.1.1, 1.2.0


 Seems like there is a redundant spark.executor.memory config item in docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3963) Support getting task-scoped properties from TaskContext