[jira] [Commented] (SPARK-3948) Sort-based shuffle can lead to assorted stream-corruption exceptions
[ https://issues.apache.org/jira/browse/SPARK-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173413#comment-14173413 ] Mridul Muralidharan commented on SPARK-3948: Not exactly, what I was suggesting was : a) At begining of the transferTo method, add {code: java} val initialPos = channel.position() {code} b) At bottom of the transferTo method, before returning size, add {code: java} val finalPos = channel.position() if (finalPos == initialPos) { logWarning(Hit kernal bug, upgrade kernal. Attempting workaround) channel.position(initialPos + size) } else { assert(finalPos == initialPos + size) } {code} What I understand from the javadoc, this should alleviate the problem : ofcourse, will need verification on the setup you have where it is currently failing ! Note that the reason I would prefer this to append is for simple reason : the method is generic method to copy streams - and it might be used (currnetly, or in future) in scenarios where append is not true. So would be good to be defensive about the final state. Sort-based shuffle can lead to assorted stream-corruption exceptions Key: SPARK-3948 URL: https://issues.apache.org/jira/browse/SPARK-3948 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Saisai Shao Assignee: Saisai Shao Several exceptions occurred when running TPC-DS queries against latest master branch with sort-based shuffle enable, like PARSING_ERROR(2) in snappy, deserializing error in Kryo and offset out-range in FileManagedBuffer, all these exceptions are gone when we changed to hash-based shuffle. With deep investigation, we found that some shuffle output file is unexpectedly smaller than the others, as the log shows: {noformat} 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_9_11, offset: 3055635, length: 236708, file length: 47274167 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_10_11, offset: 2986484, length: 222755, file length: 47174539 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_11_11, offset: 2995341, length: 259871, file length: 383405 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_12_11, offset: 2991030, length: 268191, file length: 47478892 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_13_11, offset: 3016292, length: 230694, file length: 47420826 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_14_11, offset: 3061400, length: 241136, file length: 47395509 {noformat} As you can see the total file length of shuffle_6_11_11 is much smaller than other same stage map output results. And we also dump the map outputs in map side to see if this small size output is correct or not, below is the log: {noformat} In bypass merge sort, file name: /mnt/DP_disk1/animal/spark/spark-local- 20141014182142-8345/22/shuffle_6_11_0.data, file length: 383405length: 274722 262597 291290 272902 264941 270358 291005 295285 252482 287142 232617 259871 233734 241439 228897 234282 253834 235619 233803 255532 270739 253825 262087 266404 234273 250120 262983 257024 255947 254971 258908 247862 221613 258566 245399 251684 274843 226150 264278 245279 225656 235084 239466 212851 242245 218781 222191 215500 211548 234256 208601 204113 191923 217895 227020 215331 212313 223725 250876 256875 239276 266777 235520 237462 234063 242270 246825 255888 235937 236956 233099 264508 260303 233294 239061 254856 257475 230105 246553 260412 210355 211201 219572 206636 226866 209937 226618 218208 206255 248069 221717 222112 215734 248088 239207 246125 239056 241133 253091 246738 233128 242794 231606 255737 221123 252115 247286 229688 251087 250047 237579 263079 256251 238214 208641 201120 204009 200825 211965 200600 194492 226471 194887 226975 215072 206008 233288 222132 208860 219064 218162 237126 220465 201343 225711 232178 233786 212767 211462 213671 215853 227822 233782 214727 247001 228968 247413 222674 214241 184122 215643 207665 219079 215185 207718 212723 201613 216600 212591 208174 204195 208099 229079 230274 223373 214999 256626 228895 231821 383405 229646 220212 245495 245960 227556 213266 237203 203805 240509 239306 242365 218416 238487 219397 240026 251011 258369 255365 259811 283313 248450 264286 264562 257485 279459 249187 257609 274964 292369 273826 {noformat} Here I dump the file name, length and each partition's length, obviously the sum of all partition lengths is not equal to file length. So I think there may be a situation paritionWriter in ExternalSorter not always append to the end of previous written
[jira] [Commented] (SPARK-3948) Sort-based shuffle can lead to assorted stream-corruption exceptions
[ https://issues.apache.org/jira/browse/SPARK-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173427#comment-14173427 ] Saisai Shao commented on SPARK-3948: Hi [~mridulm80], thanks a lot for your suggestions, here is the snippet I changed: {code} val inChannel = in.asInstanceOf[FileInputStream].getChannel() val outChannel = out.asInstanceOf[FileOutputStream].getChannel() val initialPos = outChannel.position() println(size = + outChannel.size) println(initial position = + initialPos) val size = inChannel.size() // In case transferTo method transferred less data than we have required. while (count size) { count += inChannel.transferTo(count, size - count, outChannel) } val finalPos = outChannel.position() println(final position = + finalPos) if (initialPos == finalPos) { outChannel.position(initialPos + count) } else { assert(finalPos == initialPos + count) } {code} And the result shows as below: {noformat} size = 0 initial position = 0 final position = 0 size = 118 for i = 0 size = 118 initial position = 118 final position = 118 size = 118 for i = 1 size = 118 initial position = 236 final position = 236 size = 118 for i = 2 size = 118 initial position = 354 final position = 354 size = 118 for i = 3 size = 118 initial position = 472 final position = 472 size = 118 for i = 4 size = 118 initial position = 590 final position = 590 size = 118 for i = 5 size = 118 initial position = 708 final position = 708 size = 118 for i = 6 size = 118 initial position = 826 final position = 826 size = 118 for i = 7 size = 118 initial position = 944 final position = 944 size = 118 for i = 8 size = 118 initial position = 1062 final position = 1062 size = 118 for i = 9 {noformat} Still has problem in my 2.6.32 machine, though position is moving forward, the file size is still 118. But it is OK in my Ubuntu machine, so probably this is not feasible. Sort-based shuffle can lead to assorted stream-corruption exceptions Key: SPARK-3948 URL: https://issues.apache.org/jira/browse/SPARK-3948 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Saisai Shao Assignee: Saisai Shao Several exceptions occurred when running TPC-DS queries against latest master branch with sort-based shuffle enable, like PARSING_ERROR(2) in snappy, deserializing error in Kryo and offset out-range in FileManagedBuffer, all these exceptions are gone when we changed to hash-based shuffle. With deep investigation, we found that some shuffle output file is unexpectedly smaller than the others, as the log shows: {noformat} 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_9_11, offset: 3055635, length: 236708, file length: 47274167 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_10_11, offset: 2986484, length: 222755, file length: 47174539 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_11_11, offset: 2995341, length: 259871, file length: 383405 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_12_11, offset: 2991030, length: 268191, file length: 47478892 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_13_11, offset: 3016292, length: 230694, file length: 47420826 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_14_11, offset: 3061400, length: 241136, file length: 47395509 {noformat} As you can see the total file length of shuffle_6_11_11 is much smaller than other same stage map output results. And we also dump the map outputs in map side to see if this small size output is correct or not, below is the log: {noformat} In bypass merge sort, file name: /mnt/DP_disk1/animal/spark/spark-local- 20141014182142-8345/22/shuffle_6_11_0.data, file length: 383405length: 274722 262597 291290 272902 264941 270358 291005 295285 252482 287142 232617 259871 233734 241439 228897 234282 253834 235619 233803 255532 270739 253825 262087 266404 234273 250120 262983 257024 255947 254971 258908 247862 221613 258566 245399 251684 274843 226150 264278 245279 225656 235084 239466 212851 242245 218781 222191 215500 211548 234256 208601 204113 191923 217895 227020 215331 212313 223725 250876 256875 239276 266777 235520 237462 234063 242270 246825 255888 235937 236956 233099 264508 260303 233294 239061 254856 257475 230105 246553 260412 210355 211201 219572 206636 226866 209937 226618 218208 206255 248069 221717 222112 215734 248088 239207 246125 239056 241133 253091 246738 233128 242794 231606 255737 221123 252115 247286
[jira] [Commented] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job
[ https://issues.apache.org/jira/browse/SPARK-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173428#comment-14173428 ] Davis Shepherd commented on SPARK-3882: --- This is also a serious memory leak that will cause long running drivers for spark streaming jobs to exhaust their heap. JobProgressListener gets permanently out of sync with long running job -- Key: SPARK-3882 URL: https://issues.apache.org/jira/browse/SPARK-3882 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.2 Reporter: Davis Shepherd Attachments: Screen Shot 2014-10-03 at 12.50.59 PM.png A long running spark context (non-streaming) will eventually start throwing the following in the driver: java.util.NoSuchElementException: key not found: 12771 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener threw an exception java.util.NoSuchElementException: key not found: 12782 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at
[jira] [Commented] (SPARK-3807) SparkSql does not work for tables created using custom serde
[ https://issues.apache.org/jira/browse/SPARK-3807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173430#comment-14173430 ] Apache Spark commented on SPARK-3807: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/2821 SparkSql does not work for tables created using custom serde Key: SPARK-3807 URL: https://issues.apache.org/jira/browse/SPARK-3807 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: chirag aggarwal Fix For: 1.1.1, 1.2.0 SparkSql crashes on selecting tables using custom serde. Example: CREATE EXTERNAL TABLE table_name PARTITIONED BY ( a int) ROW FORMAT 'SERDE org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer with serdeproperties(serialization.format=org.apache.thrift.protocol.TBinaryProtocol,serialization.class=ser_class) STORED AS SEQUENCEFILE; The following exception is seen on running a query like 'select * from table_name limit 1': ERROR CliDriver: org.apache.hadoop.hive.serde2.SerDeException: java.lang.NullPointerException at org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer.initialize(ThriftDeserializer.java:68) at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:80) at org.apache.spark.sql.hive.execution.HiveTableScan.addColumnMetadataToConf(HiveTableScan.scala:86) at org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:100) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188) at org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:280) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406) at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:406) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.NullPointerException After fixing this issue, when some columns in the table were referred in the query, sparksql could not resolve those references. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job
[ https://issues.apache.org/jira/browse/SPARK-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Shepherd updated SPARK-3882: -- Comment: was deleted (was: This is also a serious memory leak that will cause long running drivers for spark streaming jobs to exhaust their heap.) JobProgressListener gets permanently out of sync with long running job -- Key: SPARK-3882 URL: https://issues.apache.org/jira/browse/SPARK-3882 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.2 Reporter: Davis Shepherd Attachments: Screen Shot 2014-10-03 at 12.50.59 PM.png A long running spark context (non-streaming) will eventually start throwing the following in the driver: java.util.NoSuchElementException: key not found: 12771 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener threw an exception java.util.NoSuchElementException: key not found: 12782 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at
[jira] [Commented] (SPARK-3965) Spark assembly for hadoop2 contains avro-mapred for hadoop1
[ https://issues.apache.org/jira/browse/SPARK-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173445#comment-14173445 ] Apache Spark commented on SPARK-3965: - User 'dajac' has created a pull request for this issue: https://github.com/apache/spark/pull/2822 Spark assembly for hadoop2 contains avro-mapred for hadoop1 --- Key: SPARK-3965 URL: https://issues.apache.org/jira/browse/SPARK-3965 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.0.2, 1.1.0 Environment: hadoop2, HDP2.1 Reporter: David Jacot When building Spark assembly for hadoop2, org.apache.avro:avro-mapred for hadoop1 is picked and added to the assembly which leads to following exception at runtime. {code} java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111) ... {code} The patch for SPARK-3039 works well at compile time but artefact's classifier is not applied when assembly is built. I'm not a maven expert but I don't think that classifiers are applied on transitive dependencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3963) Support getting task-scoped properties from TaskContext
[ https://issues.apache.org/jira/browse/SPARK-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173451#comment-14173451 ] Reynold Xin commented on SPARK-3963: Would it make sense to support arbitrary data types also? Also should we consider merging this with TaskMetrics? Support getting task-scoped properties from TaskContext --- Key: SPARK-3963 URL: https://issues.apache.org/jira/browse/SPARK-3963 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell This is a proposal for a minor feature. Given stabilization of the TaskContext API, it would be nice to have a mechanism for Spark jobs to access properties that are defined based on task-level scope by Spark RDD's. I'd like to propose adding a simple properties hash map with some standard spark properties that users can access. Later it would be nice to support users setting these properties, but for now to keep it simple in 1.2. I'd prefer users not be able to set them. The main use case is providing the file name from Hadoop RDD's, a very common request. But I'd imagine us using this for other things later on. We could also use this to expose some of the taskMetrics, such as e.g. the input bytes. {code} val data = sc.textFile(s3n//..2014/*/*/*.json) data.mapPartitions { val tc = TaskContext.get val filename = tc.getProperty(TaskContext.HADOOP_FILE_NAME) val parts = fileName.split(/) val (year, month, day) = (parts[3], parts[4], parts[5]) ... } {code} Internally we'd have a method called setProperty, but this wouldn't be exposed initially. This is structured as a simple (String, String) hash map for ease of porting to python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3873) Scala style: check import ordering
[ https://issues.apache.org/jira/browse/SPARK-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3873: --- Assignee: (was: Marcelo Vanzin) Scala style: check import ordering -- Key: SPARK-3873 URL: https://issues.apache.org/jira/browse/SPARK-3873 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3888) Limit the memory used by python worker
[ https://issues.apache.org/jira/browse/SPARK-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173454#comment-14173454 ] Apache Spark commented on SPARK-3888: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2743 Limit the memory used by python worker -- Key: SPARK-3888 URL: https://issues.apache.org/jira/browse/SPARK-3888 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu Assignee: Davies Liu Right now, we did not limit the memory by Python workers, then it maybe run out of memory and freeze the OS. it's safe to have a configurable hard limitation for it, which should be large than spark.executor.python.memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3948) Sort-based shuffle can lead to assorted stream-corruption exceptions
[ https://issues.apache.org/jira/browse/SPARK-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173489#comment-14173489 ] Mridul Muralidharan commented on SPARK-3948: Damn, this sucks : the transferTo is not using the position of the channel to write output to ... while it is doing so when append is true (which is effectively setting position to end of file on call to getChannel). The state of the channel, based on what we see above, is the same in both cases - since we can see the position is updated - and is persisted and returned when we call getChannel in next invocation of copyStreams. So there is some other set of issues at play which we might not be able to workaround from the jvm. Given this, I think we should a) add a logError when initialPosition == finalPosition when inChannel.size 0 asking users to upgrade to a newer linux kernal b) ofcourse use append = true : to workaround immediate issues. (a) will ensure that developers and users/admins will be notified of issues in case other codepaths (currently or in future) hit the same issue. Sort-based shuffle can lead to assorted stream-corruption exceptions Key: SPARK-3948 URL: https://issues.apache.org/jira/browse/SPARK-3948 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Saisai Shao Assignee: Saisai Shao Several exceptions occurred when running TPC-DS queries against latest master branch with sort-based shuffle enable, like PARSING_ERROR(2) in snappy, deserializing error in Kryo and offset out-range in FileManagedBuffer, all these exceptions are gone when we changed to hash-based shuffle. With deep investigation, we found that some shuffle output file is unexpectedly smaller than the others, as the log shows: {noformat} 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_9_11, offset: 3055635, length: 236708, file length: 47274167 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_10_11, offset: 2986484, length: 222755, file length: 47174539 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_11_11, offset: 2995341, length: 259871, file length: 383405 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_12_11, offset: 2991030, length: 268191, file length: 47478892 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_13_11, offset: 3016292, length: 230694, file length: 47420826 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_14_11, offset: 3061400, length: 241136, file length: 47395509 {noformat} As you can see the total file length of shuffle_6_11_11 is much smaller than other same stage map output results. And we also dump the map outputs in map side to see if this small size output is correct or not, below is the log: {noformat} In bypass merge sort, file name: /mnt/DP_disk1/animal/spark/spark-local- 20141014182142-8345/22/shuffle_6_11_0.data, file length: 383405length: 274722 262597 291290 272902 264941 270358 291005 295285 252482 287142 232617 259871 233734 241439 228897 234282 253834 235619 233803 255532 270739 253825 262087 266404 234273 250120 262983 257024 255947 254971 258908 247862 221613 258566 245399 251684 274843 226150 264278 245279 225656 235084 239466 212851 242245 218781 222191 215500 211548 234256 208601 204113 191923 217895 227020 215331 212313 223725 250876 256875 239276 266777 235520 237462 234063 242270 246825 255888 235937 236956 233099 264508 260303 233294 239061 254856 257475 230105 246553 260412 210355 211201 219572 206636 226866 209937 226618 218208 206255 248069 221717 222112 215734 248088 239207 246125 239056 241133 253091 246738 233128 242794 231606 255737 221123 252115 247286 229688 251087 250047 237579 263079 256251 238214 208641 201120 204009 200825 211965 200600 194492 226471 194887 226975 215072 206008 233288 222132 208860 219064 218162 237126 220465 201343 225711 232178 233786 212767 211462 213671 215853 227822 233782 214727 247001 228968 247413 222674 214241 184122 215643 207665 219079 215185 207718 212723 201613 216600 212591 208174 204195 208099 229079 230274 223373 214999 256626 228895 231821 383405 229646 220212 245495 245960 227556 213266 237203 203805 240509 239306 242365 218416 238487 219397 240026 251011 258369 255365 259811 283313 248450 264286 264562 257485 279459 249187 257609 274964 292369 273826 {noformat} Here I dump the file name, length and each partition's length, obviously the sum of all partition lengths is not equal to file length. So I think there may be a situation paritionWriter in ExternalSorter not always append to the end of
[jira] [Comment Edited] (SPARK-3948) Sort-based shuffle can lead to assorted stream-corruption exceptions
[ https://issues.apache.org/jira/browse/SPARK-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173489#comment-14173489 ] Mridul Muralidharan edited comment on SPARK-3948 at 10/16/14 7:39 AM: -- Damn, this sucks : the transferTo is not using the position of the channel to write output to ... while it is doing so when append is true (which is effectively setting position to end of file on call to getChannel). The state of the channel, based on what we see above, is the same in both cases - since we can see the position is updated - and is persisted and returned when we call getChannel in next invocation of copyStreams. Not sure if sync() will help ... So there is some other set of issues at play which we might not be able to workaround from the jvm. Given this, I think we should a) add a logError when initialPosition == finalPosition when inChannel.size 0 asking users to upgrade to a newer linux kernal b) ofcourse use append = true : to workaround immediate issues. (a) will ensure that developers and users/admins will be notified of issues in case other codepaths (currently or in future) hit the same issue. was (Author: mridulm80): Damn, this sucks : the transferTo is not using the position of the channel to write output to ... while it is doing so when append is true (which is effectively setting position to end of file on call to getChannel). The state of the channel, based on what we see above, is the same in both cases - since we can see the position is updated - and is persisted and returned when we call getChannel in next invocation of copyStreams. So there is some other set of issues at play which we might not be able to workaround from the jvm. Given this, I think we should a) add a logError when initialPosition == finalPosition when inChannel.size 0 asking users to upgrade to a newer linux kernal b) ofcourse use append = true : to workaround immediate issues. (a) will ensure that developers and users/admins will be notified of issues in case other codepaths (currently or in future) hit the same issue. Sort-based shuffle can lead to assorted stream-corruption exceptions Key: SPARK-3948 URL: https://issues.apache.org/jira/browse/SPARK-3948 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Saisai Shao Assignee: Saisai Shao Several exceptions occurred when running TPC-DS queries against latest master branch with sort-based shuffle enable, like PARSING_ERROR(2) in snappy, deserializing error in Kryo and offset out-range in FileManagedBuffer, all these exceptions are gone when we changed to hash-based shuffle. With deep investigation, we found that some shuffle output file is unexpectedly smaller than the others, as the log shows: {noformat} 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_9_11, offset: 3055635, length: 236708, file length: 47274167 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_10_11, offset: 2986484, length: 222755, file length: 47174539 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_11_11, offset: 2995341, length: 259871, file length: 383405 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_12_11, offset: 2991030, length: 268191, file length: 47478892 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_13_11, offset: 3016292, length: 230694, file length: 47420826 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_14_11, offset: 3061400, length: 241136, file length: 47395509 {noformat} As you can see the total file length of shuffle_6_11_11 is much smaller than other same stage map output results. And we also dump the map outputs in map side to see if this small size output is correct or not, below is the log: {noformat} In bypass merge sort, file name: /mnt/DP_disk1/animal/spark/spark-local- 20141014182142-8345/22/shuffle_6_11_0.data, file length: 383405length: 274722 262597 291290 272902 264941 270358 291005 295285 252482 287142 232617 259871 233734 241439 228897 234282 253834 235619 233803 255532 270739 253825 262087 266404 234273 250120 262983 257024 255947 254971 258908 247862 221613 258566 245399 251684 274843 226150 264278 245279 225656 235084 239466 212851 242245 218781 222191 215500 211548 234256 208601 204113 191923 217895 227020 215331 212313 223725 250876 256875 239276 266777 235520 237462 234063 242270 246825 255888 235937 236956 233099 264508 260303 233294 239061 254856 257475 230105 246553 260412 210355 211201 219572 206636 226866 209937 226618 218208 206255 248069 221717 222112 215734 248088
[jira] [Commented] (SPARK-3814) Bitwise does not work in Hive
[ https://issues.apache.org/jira/browse/SPARK-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173526#comment-14173526 ] Ravindra Pesala commented on SPARK-3814: Added support for Bitwise AND(), OR(|) ,XOR(^), NOT(~) in this PR and I updated the title of this defect. Bitwise does not work in Hive Key: SPARK-3814 URL: https://issues.apache.org/jira/browse/SPARK-3814 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Yana Kadiyska Assignee: Ravindra Pesala Priority: Minor Error: java.lang.RuntimeException: Unsupported language features in query: select (case when bit_field 1=1 then r_end - r_start else NULL end) from mytable where pkey='0178-2014-07' LIMIT 2 TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME mytable TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_FUNCTION when = TOK_TABLE_OR_COL bit_field 1 1 - TOK_TABLE_OR_COL r_end TOK_TABLE_OR_COL r_start TOK_NULL TOK_WHERE = TOK_TABLE_OR_COL pkey '0178-2014-07' TOK_LIMIT 2 SQLState: null ErrorCode: 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3814) Support for Bitwise AND(), OR(|) ,XOR(^), NOT(~) in Spark HQL and SQL
[ https://issues.apache.org/jira/browse/SPARK-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravindra Pesala updated SPARK-3814: --- Summary: Support for Bitwise AND(), OR(|) ,XOR(^), NOT(~) in Spark HQL and SQL (was: Bitwise does not work in Hive) Support for Bitwise AND(), OR(|) ,XOR(^), NOT(~) in Spark HQL and SQL -- Key: SPARK-3814 URL: https://issues.apache.org/jira/browse/SPARK-3814 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Yana Kadiyska Assignee: Ravindra Pesala Priority: Minor Error: java.lang.RuntimeException: Unsupported language features in query: select (case when bit_field 1=1 then r_end - r_start else NULL end) from mytable where pkey='0178-2014-07' LIMIT 2 TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME mytable TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_FUNCTION when = TOK_TABLE_OR_COL bit_field 1 1 - TOK_TABLE_OR_COL r_end TOK_TABLE_OR_COL r_start TOK_NULL TOK_WHERE = TOK_TABLE_OR_COL pkey '0178-2014-07' TOK_LIMIT 2 SQLState: null ErrorCode: 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3948) Sort-based shuffle can lead to assorted stream-corruption exceptions
[ https://issues.apache.org/jira/browse/SPARK-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173528#comment-14173528 ] Saisai Shao commented on SPARK-3948: Hi [~mridulm80], Thanks a lot for your suggestion. I think currently for other codes in Spark which use copyStream will not be affected by this issue, since they only copy one input file to the output file. But for some use cases like ExternalSorter will indeed be affected by it. I will submit a PR according to your suggestions, thanks a lot. Sort-based shuffle can lead to assorted stream-corruption exceptions Key: SPARK-3948 URL: https://issues.apache.org/jira/browse/SPARK-3948 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Saisai Shao Assignee: Saisai Shao Several exceptions occurred when running TPC-DS queries against latest master branch with sort-based shuffle enable, like PARSING_ERROR(2) in snappy, deserializing error in Kryo and offset out-range in FileManagedBuffer, all these exceptions are gone when we changed to hash-based shuffle. With deep investigation, we found that some shuffle output file is unexpectedly smaller than the others, as the log shows: {noformat} 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_9_11, offset: 3055635, length: 236708, file length: 47274167 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_10_11, offset: 2986484, length: 222755, file length: 47174539 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_11_11, offset: 2995341, length: 259871, file length: 383405 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_12_11, offset: 2991030, length: 268191, file length: 47478892 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_13_11, offset: 3016292, length: 230694, file length: 47420826 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_14_11, offset: 3061400, length: 241136, file length: 47395509 {noformat} As you can see the total file length of shuffle_6_11_11 is much smaller than other same stage map output results. And we also dump the map outputs in map side to see if this small size output is correct or not, below is the log: {noformat} In bypass merge sort, file name: /mnt/DP_disk1/animal/spark/spark-local- 20141014182142-8345/22/shuffle_6_11_0.data, file length: 383405length: 274722 262597 291290 272902 264941 270358 291005 295285 252482 287142 232617 259871 233734 241439 228897 234282 253834 235619 233803 255532 270739 253825 262087 266404 234273 250120 262983 257024 255947 254971 258908 247862 221613 258566 245399 251684 274843 226150 264278 245279 225656 235084 239466 212851 242245 218781 222191 215500 211548 234256 208601 204113 191923 217895 227020 215331 212313 223725 250876 256875 239276 266777 235520 237462 234063 242270 246825 255888 235937 236956 233099 264508 260303 233294 239061 254856 257475 230105 246553 260412 210355 211201 219572 206636 226866 209937 226618 218208 206255 248069 221717 222112 215734 248088 239207 246125 239056 241133 253091 246738 233128 242794 231606 255737 221123 252115 247286 229688 251087 250047 237579 263079 256251 238214 208641 201120 204009 200825 211965 200600 194492 226471 194887 226975 215072 206008 233288 222132 208860 219064 218162 237126 220465 201343 225711 232178 233786 212767 211462 213671 215853 227822 233782 214727 247001 228968 247413 222674 214241 184122 215643 207665 219079 215185 207718 212723 201613 216600 212591 208174 204195 208099 229079 230274 223373 214999 256626 228895 231821 383405 229646 220212 245495 245960 227556 213266 237203 203805 240509 239306 242365 218416 238487 219397 240026 251011 258369 255365 259811 283313 248450 264286 264562 257485 279459 249187 257609 274964 292369 273826 {noformat} Here I dump the file name, length and each partition's length, obviously the sum of all partition lengths is not equal to file length. So I think there may be a situation paritionWriter in ExternalSorter not always append to the end of previous written file, the file's content is overwritten in some parts, and this lead to the exceptions I mentioned before. Also I changed the code of copyStream by disable transferTo, use the previous one, all the issues are gone. So I think there maybe some flushing problems in transferTo when processed data is large. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Commented] (SPARK-2706) Enable Spark to support Hive 0.13
[ https://issues.apache.org/jira/browse/SPARK-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173541#comment-14173541 ] qiaohaijun commented on SPARK-2706: --- sh make-distribution.sh --tgz -Phadoop-provided -Pyarn -DskipTests -Dhadoop.version=2.3.0-cdh5.0.0 -Phive -Dhive.version=0.13.1 I get the same error output, and I execute the command git pull today that get the update code . Enable Spark to support Hive 0.13 - Key: SPARK-2706 URL: https://issues.apache.org/jira/browse/SPARK-2706 Project: Spark Issue Type: Dependency upgrade Components: SQL Affects Versions: 1.0.1 Reporter: Chunjun Xiao Assignee: Zhan Zhang Attachments: hive.diff, spark-2706-v1.txt, spark-2706-v2.txt, spark-hive.err, v1.0.2.diff It seems Spark cannot work with Hive 0.13 well. When I compiled Spark with Hive 0.13.1, I got some error messages, as attached below. So, when can Spark be enabled to support Hive 0.13? Compiling Error: {quote} [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala:180: type mismatch; found : String required: Array[String] [ERROR] val proc: CommandProcessor = CommandProcessorFactory.get(tokens(0), hiveconf) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:264: overloaded method constructor TableDesc with alternatives: (x$1: Class[_ : org.apache.hadoop.mapred.InputFormat[_, _]],x$2: Class[_],x$3: java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc and ()org.apache.hadoop.hive.ql.plan.TableDesc cannot be applied to (Class[org.apache.hadoop.hive.serde2.Deserializer], Class[(some other)?0(in value tableDesc)(in value tableDesc)], Class[?0(in value tableDesc)(in value tableDesc)], java.util.Properties) [ERROR] val tableDesc = new TableDesc( [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala:140: value getPartitionPath is not a member of org.apache.hadoop.hive.ql.metadata.Partition [ERROR] val partPath = partition.getPartitionPath [ERROR]^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala:132: value appendReadColumnNames is not a member of object org.apache.hadoop.hive.serde2.ColumnProjectionUtils [ERROR] ColumnProjectionUtils.appendReadColumnNames(hiveConf, attributes.map(_.name)) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:79: org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor [ERROR] new HiveDecimal(bd.underlying()) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:132: type mismatch; found : org.apache.hadoop.fs.Path required: String [ERROR] SparkHiveHadoopWriter.createPathFromString(fileSinkConf.getDirName, conf)) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:179: value getExternalTmpFileURI is not a member of org.apache.hadoop.hive.ql.Context [ERROR] val tmpLocation = hiveContext.getExternalTmpFileURI(tableLocation) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala:209: org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor [ERROR] case bd: BigDecimal = new HiveDecimal(bd.underlying()) [ERROR] ^ [ERROR] 8 errors found [DEBUG] Compilation failed (CompilerInterface) [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [2.579s] [INFO] Spark Project Core SUCCESS [2:39.805s] [INFO] Spark Project Bagel ... SUCCESS [21.148s] [INFO] Spark Project GraphX .. SUCCESS [59.950s] [INFO] Spark Project ML Library .. SUCCESS [1:08.771s] [INFO] Spark Project Streaming ... SUCCESS [1:17.759s] [INFO] Spark Project Tools ... SUCCESS [15.405s] [INFO] Spark Project Catalyst SUCCESS [1:17.405s] [INFO] Spark Project SQL . SUCCESS [1:11.094s] [INFO] Spark Project Hive FAILURE [11.121s] [INFO] Spark Project REPL
[jira] [Commented] (SPARK-3174) Provide elastic scaling within a Spark application
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173542#comment-14173542 ] Praveen Seluka commented on SPARK-3174: --- [~vanzin] Regarding your question related to scaling heuristic - It does take care of the backlog of tasks also. Once some task completes, it calculates the average runtime of a task (within a stage). Then it estimates the runtime(remaining) of the stage using the following heuristic var estimatedRuntimeForStage = averageRuntime * (remainingTasks + (activeTasks/2)) We also add (activeTasks/2) as we need to take the current running tasks into account. - I think, the proposal I have made is not very different from the exponential approach. Lets say we start the Spark application with just 2 executors. It will double the number of executors and hence goes to 4, 8 and so on. The difference I see here comparing to exponential approach is, we start doubling the current count of executors whereas exponential starts from 1 (also resets to 1 when there are no pending tasks). But yeah, this could be altered to do the same as exponential approach also. - In a way, this proposal adds some ETA based heuristic in addition. (threshold for stage completion time) - Also, this proposal adds the memory used heuristic too for scaling decisions which is missing in Andrew's PR. (Correct me if am wrong here). This for sure will be very useful. Provide elastic scaling within a Spark application -- Key: SPARK-3174 URL: https://issues.apache.org/jira/browse/SPARK-3174 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.0.2 Reporter: Sandy Ryza Assignee: Andrew Or Attachments: SPARK-3174design.pdf, SparkElasticScalingDesignB.pdf, dynamic-scaling-executors-10-6-14.pdf A common complaint with Spark in a multi-tenant environment is that applications have a fixed allocation that doesn't grow and shrink with their resource needs. We're blocked on YARN-1197 for dynamically changing the resources within executors, but we can still allocate and discard whole executors. It would be useful to have some heuristics that * Request more executors when many pending tasks are building up * Discard executors when they are idle See the latest design doc for more information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1479) building spark on 2.0.0-cdh4.4.0 failed
[ https://issues.apache.org/jira/browse/SPARK-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173543#comment-14173543 ] qiaohaijun commented on SPARK-1479: --- sh make-distribution.sh --tgz -Phadoop-provided -Pyarn -DskipTests -Dhadoop.version=2.3.0-cdh5.0.0 -Phive -Dhive.verison=0.13.1 get the same error output building spark on 2.0.0-cdh4.4.0 failed --- Key: SPARK-1479 URL: https://issues.apache.org/jira/browse/SPARK-1479 Project: Spark Issue Type: Question Environment: 2.0.0-cdh4.4.0 Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL spark 0.9.1 java version 1.6.0_32 Reporter: jackielihf Attachments: mvn.log [INFO] [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.1.5:compile (scala-compile-first) on project spark-yarn-alpha_2.10: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.1.5:compile failed. CompileFailed - [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.1.5:compile (scala-compile-first) on project spark-yarn-alpha_2.10: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.1.5:compile failed. at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:225) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59) at org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320) at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156) at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537) at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196) at org.apache.maven.cli.MavenCli.main(MavenCli.java:141) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290) at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409) at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352) Caused by: org.apache.maven.plugin.PluginExecutionException: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.1.5:compile failed. at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:110) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209) ... 19 more Caused by: Compilation failed at sbt.compiler.AnalyzingCompiler.call(AnalyzingCompiler.scala:76) at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:35) at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:29) at sbt.compiler.AggressiveCompile$$anonfun$4$$anonfun$compileScala$1$1.apply$mcV$sp(AggressiveCompile.scala:71) at sbt.compiler.AggressiveCompile$$anonfun$4$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:71) at sbt.compiler.AggressiveCompile$$anonfun$4$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:71) at sbt.compiler.AggressiveCompile.sbt$compiler$AggressiveCompile$$timed(AggressiveCompile.scala:101) at sbt.compiler.AggressiveCompile$$anonfun$4.compileScala$1(AggressiveCompile.scala:70) at sbt.compiler.AggressiveCompile$$anonfun$4.apply(AggressiveCompile.scala:88) at sbt.compiler.AggressiveCompile$$anonfun$4.apply(AggressiveCompile.scala:60) at sbt.inc.IncrementalCompile$$anonfun$doCompile$1.apply(Compile.scala:24) at sbt.inc.IncrementalCompile$$anonfun$doCompile$1.apply(Compile.scala:22) at sbt.inc.Incremental$.cycle(Incremental.scala:40) at sbt.inc.Incremental$.compile(Incremental.scala:25) at sbt.inc.IncrementalCompile$.apply(Compile.scala:20) at
[jira] [Comment Edited] (SPARK-3174) Provide elastic scaling within a Spark application
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173542#comment-14173542 ] Praveen Seluka edited comment on SPARK-3174 at 10/16/14 8:55 AM: - [~vanzin] Regarding your question related to scaling heuristic - It does take care of the backlog of tasks also. Once some task completes, it calculates the average runtime of a task (within a stage). Then it estimates the runtime(remaining) of the stage using the following heuristic var estimatedRuntimeForStage = averageRuntime * (remainingTasks + (activeTasks/2)) We also add (activeTasks/2) as we need to take the current running tasks into account. - I think, the proposal I have made is not very different from the exponential approach. Lets say we start the Spark application with just 2 executors. It will double the number of executors and hence goes to 4, 8 and so on. The difference I see here comparing to exponential approach is, we start doubling the current count of executors whereas exponential starts from 1 (also resets to 1 when there are no pending tasks). But yeah, this could be altered to do the same as exponential approach also. - In a way, this proposal adds some ETA based heuristic in addition. (threshold for stage completion time) - Also, this proposal adds the memory used heuristic too for scaling decisions which is missing in Andrew's PR. (Correct me if am wrong here). This for sure will be very useful. - The main point being, It does all these without making any changes in TaskSchedulerImpl/TaskSetManager code base. was (Author: praveenseluka): [~vanzin] Regarding your question related to scaling heuristic - It does take care of the backlog of tasks also. Once some task completes, it calculates the average runtime of a task (within a stage). Then it estimates the runtime(remaining) of the stage using the following heuristic var estimatedRuntimeForStage = averageRuntime * (remainingTasks + (activeTasks/2)) We also add (activeTasks/2) as we need to take the current running tasks into account. - I think, the proposal I have made is not very different from the exponential approach. Lets say we start the Spark application with just 2 executors. It will double the number of executors and hence goes to 4, 8 and so on. The difference I see here comparing to exponential approach is, we start doubling the current count of executors whereas exponential starts from 1 (also resets to 1 when there are no pending tasks). But yeah, this could be altered to do the same as exponential approach also. - In a way, this proposal adds some ETA based heuristic in addition. (threshold for stage completion time) - Also, this proposal adds the memory used heuristic too for scaling decisions which is missing in Andrew's PR. (Correct me if am wrong here). This for sure will be very useful. Provide elastic scaling within a Spark application -- Key: SPARK-3174 URL: https://issues.apache.org/jira/browse/SPARK-3174 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.0.2 Reporter: Sandy Ryza Assignee: Andrew Or Attachments: SPARK-3174design.pdf, SparkElasticScalingDesignB.pdf, dynamic-scaling-executors-10-6-14.pdf A common complaint with Spark in a multi-tenant environment is that applications have a fixed allocation that doesn't grow and shrink with their resource needs. We're blocked on YARN-1197 for dynamically changing the resources within executors, but we can still allocate and discard whole executors. It would be useful to have some heuristics that * Request more executors when many pending tasks are building up * Discard executors when they are idle See the latest design doc for more information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3174) Provide elastic scaling within a Spark application
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173542#comment-14173542 ] Praveen Seluka edited comment on SPARK-3174 at 10/16/14 8:56 AM: - [~vanzin] Regarding your question related to scaling heuristic - It does take care of the backlog of tasks also. Once some task completes, it calculates the average runtime of a task (within a stage). Then it estimates the runtime(remaining) of the stage using the following heuristic var estimatedRuntimeForStage = averageRuntime * (remainingTasks + (activeTasks/2)) We also add (activeTasks/2) as we need to take the current running tasks into account. - I think, the proposal I have made is not very different from the exponential approach. Lets say we start the Spark application with just 2 executors. It will double the number of executors and hence goes to 4, 8 and so on. The difference I see here comparing to exponential approach is, we start doubling the current count of executors whereas exponential starts from 1 (also resets to 1 when there are no pending tasks). But yeah, this could be altered to do the same as exponential approach also. - In a way, this proposal adds some ETA based heuristic in addition. (threshold for stage completion time) - Also, this proposal adds the memory used heuristic too for scaling decisions which is missing in Andrew's PR. (Correct me if am wrong here). This for sure will be very useful. - The main point being, It does all these without making any changes in TaskSchedulerImpl/TaskSetManager code base. And I think, thats a really nice thing. was (Author: praveenseluka): [~vanzin] Regarding your question related to scaling heuristic - It does take care of the backlog of tasks also. Once some task completes, it calculates the average runtime of a task (within a stage). Then it estimates the runtime(remaining) of the stage using the following heuristic var estimatedRuntimeForStage = averageRuntime * (remainingTasks + (activeTasks/2)) We also add (activeTasks/2) as we need to take the current running tasks into account. - I think, the proposal I have made is not very different from the exponential approach. Lets say we start the Spark application with just 2 executors. It will double the number of executors and hence goes to 4, 8 and so on. The difference I see here comparing to exponential approach is, we start doubling the current count of executors whereas exponential starts from 1 (also resets to 1 when there are no pending tasks). But yeah, this could be altered to do the same as exponential approach also. - In a way, this proposal adds some ETA based heuristic in addition. (threshold for stage completion time) - Also, this proposal adds the memory used heuristic too for scaling decisions which is missing in Andrew's PR. (Correct me if am wrong here). This for sure will be very useful. - The main point being, It does all these without making any changes in TaskSchedulerImpl/TaskSetManager code base. Provide elastic scaling within a Spark application -- Key: SPARK-3174 URL: https://issues.apache.org/jira/browse/SPARK-3174 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.0.2 Reporter: Sandy Ryza Assignee: Andrew Or Attachments: SPARK-3174design.pdf, SparkElasticScalingDesignB.pdf, dynamic-scaling-executors-10-6-14.pdf A common complaint with Spark in a multi-tenant environment is that applications have a fixed allocation that doesn't grow and shrink with their resource needs. We're blocked on YARN-1197 for dynamically changing the resources within executors, but we can still allocate and discard whole executors. It would be useful to have some heuristics that * Request more executors when many pending tasks are building up * Discard executors when they are idle See the latest design doc for more information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3967) Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions
Christophe PRÉAUD created SPARK-3967: Summary: Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions Key: SPARK-3967 URL: https://issues.apache.org/jira/browse/SPARK-3967 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Christophe PRÉAUD Spark applications fail from time to time in yarn-cluster mode (but not in yarn-client mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is set to a comma-separated list of directories which are located on different disks/partitions. Steps to reproduce: 1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of directories located on different partitions (the more you set, the more likely it will be to reproduce the bug): (...) property nameyarn.nodemanager.local-dirs/name valuefile:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir/value /property (...) 2. Launch (several times) an application in yarn-cluster mode, it will fail (apparently randomly) from time to time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3967) Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions
[ https://issues.apache.org/jira/browse/SPARK-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christophe PRÉAUD updated SPARK-3967: - Attachment: spark-1.1.0-yarn_cluster_tmpdir.patch Ensure that the temporary file which the jar file is fetched in is located in the same directory than the target jar file Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions - Key: SPARK-3967 URL: https://issues.apache.org/jira/browse/SPARK-3967 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Christophe PRÉAUD Attachments: spark-1.1.0-yarn_cluster_tmpdir.patch Spark applications fail from time to time in yarn-cluster mode (but not in yarn-client mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is set to a comma-separated list of directories which are located on different disks/partitions. Steps to reproduce: 1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of directories located on different partitions (the more you set, the more likely it will be to reproduce the bug): (...) property nameyarn.nodemanager.local-dirs/name valuefile:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir/value /property (...) 2. Launch (several times) an application in yarn-cluster mode, it will fail (apparently randomly) from time to time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3967) Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions
[ https://issues.apache.org/jira/browse/SPARK-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173566#comment-14173566 ] Christophe PRÉAUD commented on SPARK-3967: -- After investigating, it turns out that the problem is when the executor fetches a jar file: the jar is downloaded in a temporary file, always in /d1/yarn/local/nm-local-dir (first directory of yarn.nodemanager.local-dirs), and then moved in one of the directories of yarn.nodemanager.local-dirs: -- if it is the same than the temporary file (i.e. /d1/yarn/local/nm-local-dir), then the application continues normally -- if it is another one (i.e. /d2/yarn/local/nm-local-dir, /d3/yarn/local/nm-local-dir,...), it fails with the following error: 14/10/10 14:33:51 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 0) java.io.FileNotFoundException: ./logReader-1.0.10.jar (Permission denied) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at com.google.common.io.Files$FileByteSink.openStream(Files.java:223) at com.google.common.io.Files$FileByteSink.openStream(Files.java:211) at com.google.common.io.ByteSource.copyTo(ByteSource.java:203) at com.google.common.io.Files.copy(Files.java:436) at com.google.common.io.Files.move(Files.java:651) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:440) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:325) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:323) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:323) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) I have no idea why the move fails when the source and target files are not on the same partition (it is no more atomic, but it should succeed anyway), for the moment I have worked around the problem with the attached patch (i.e. I ensure that the temp file and the moved file are always on the same partition). Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions - Key: SPARK-3967 URL: https://issues.apache.org/jira/browse/SPARK-3967 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Christophe PRÉAUD Attachments: spark-1.1.0-yarn_cluster_tmpdir.patch Spark applications fail from time to time in yarn-cluster mode (but not in yarn-client mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is set to a comma-separated list of directories which are located on different disks/partitions. Steps to reproduce: 1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of directories located on different partitions (the more you set, the more likely it will be to reproduce the bug): (...) property nameyarn.nodemanager.local-dirs/name valuefile:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir/value /property (...) 2. Launch (several times) an application in yarn-cluster mode, it will fail (apparently randomly) from time to time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3431) Parallelize execution of tests
[ https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173618#comment-14173618 ] Sean Owen commented on SPARK-3431: -- Yes that should be what scalatest does. It is a fork of an old surefire so only has a very few options. This parallelization failed as above for a few reasons. I have not gotten surefire to run the scala tests Parallelize execution of tests -- Key: SPARK-3431 URL: https://issues.apache.org/jira/browse/SPARK-3431 Project: Spark Issue Type: Improvement Components: Build Reporter: Nicholas Chammas Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common strategy to cut test time down is to parallelize the execution of the tests. Doing that may in turn require some prerequisite changes to be made to how certain tests run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2321) Design a proper progress reporting event listener API
[ https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173619#comment-14173619 ] Dev Lakhani commented on SPARK-2321: There are some issues and bugs under the webui component that are active. Should we incorporate these into this Jira or is it best to work on them separately and then merge these (2321) changes later? https://issues.apache.org/jira/browse/SPARK/component/12322616 Design a proper progress reporting event listener API --- Key: SPARK-2321 URL: https://issues.apache.org/jira/browse/SPARK-2321 Project: Spark Issue Type: Improvement Components: Java API, Spark Core Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Josh Rosen Priority: Critical This is a ticket to track progress on redesigning the SparkListener and JobProgressListener API. There are multiple problems with the current design, including: 0. I'm not sure if the API is usable in Java (there are at least some enums we used in Scala and a bunch of case classes that might complicate things). 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of attention to it yet. Something as important as progress reporting deserves a more stable API. 2. There is no easy way to connect jobs with stages. Similarly, there is no easy way to connect job groups with jobs / stages. 3. JobProgressListener itself has no encapsulation at all. States can be arbitrarily mutated by external programs. Variable names are sort of randomly decided and inconsistent. We should just revisit these and propose a new, concrete design. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3968) Using parquet-mr filter2 api in spark sql, add a custom filter for InSet clause
Yash Datta created SPARK-3968: - Summary: Using parquet-mr filter2 api in spark sql, add a custom filter for InSet clause Key: SPARK-3968 URL: https://issues.apache.org/jira/browse/SPARK-3968 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Yash Datta Priority: Minor Fix For: 1.1.1 The parquet-mr project has introduced a new filter api , along with several fixes , like filtering on OPTIONAL columns as well. It can also eliminate entire RowGroups depending on certain statistics like min/max We can leverage that to further improve performance of queries with filters. Also filter2 api introduces ability to create custom filters. We can create a custom filter for the optimized In clause (InSet) , so that elimination happens in the ParquetRecordReader itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3968) Using parquet-mr filter2 api in spark sql, add a custom filter for InSet clause
[ https://issues.apache.org/jira/browse/SPARK-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yash Datta updated SPARK-3968: -- Shepherd: Yash Datta Using parquet-mr filter2 api in spark sql, add a custom filter for InSet clause --- Key: SPARK-3968 URL: https://issues.apache.org/jira/browse/SPARK-3968 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Yash Datta Priority: Minor Fix For: 1.1.1 The parquet-mr project has introduced a new filter api , along with several fixes , like filtering on OPTIONAL columns as well. It can also eliminate entire RowGroups depending on certain statistics like min/max We can leverage that to further improve performance of queries with filters. Also filter2 api introduces ability to create custom filters. We can create a custom filter for the optimized In clause (InSet) , so that elimination happens in the ParquetRecordReader itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3629) Improvements to YARN doc
[ https://issues.apache.org/jira/browse/SPARK-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173731#comment-14173731 ] ssj commented on SPARK-3629: need someone to verifity this patch Improvements to YARN doc Key: SPARK-3629 URL: https://issues.apache.org/jira/browse/SPARK-3629 Project: Spark Issue Type: Documentation Components: Documentation, YARN Reporter: Matei Zaharia Labels: starter Right now this doc starts off with a big list of config options, and only then tells you how to submit an app. It would be better to put that part and the packaging part first, and the config options only at the end. In addition, the doc mentions yarn-cluster vs yarn-client as separate masters, which is inconsistent with the help output from spark-submit (which says to always use yarn). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3948) Sort-based shuffle can lead to assorted stream-corruption exceptions
[ https://issues.apache.org/jira/browse/SPARK-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173757#comment-14173757 ] Apache Spark commented on SPARK-3948: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/2824 Sort-based shuffle can lead to assorted stream-corruption exceptions Key: SPARK-3948 URL: https://issues.apache.org/jira/browse/SPARK-3948 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Saisai Shao Assignee: Saisai Shao Several exceptions occurred when running TPC-DS queries against latest master branch with sort-based shuffle enable, like PARSING_ERROR(2) in snappy, deserializing error in Kryo and offset out-range in FileManagedBuffer, all these exceptions are gone when we changed to hash-based shuffle. With deep investigation, we found that some shuffle output file is unexpectedly smaller than the others, as the log shows: {noformat} 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_9_11, offset: 3055635, length: 236708, file length: 47274167 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_10_11, offset: 2986484, length: 222755, file length: 47174539 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_11_11, offset: 2995341, length: 259871, file length: 383405 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_12_11, offset: 2991030, length: 268191, file length: 47478892 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_13_11, offset: 3016292, length: 230694, file length: 47420826 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: shuffle_6_14_11, offset: 3061400, length: 241136, file length: 47395509 {noformat} As you can see the total file length of shuffle_6_11_11 is much smaller than other same stage map output results. And we also dump the map outputs in map side to see if this small size output is correct or not, below is the log: {noformat} In bypass merge sort, file name: /mnt/DP_disk1/animal/spark/spark-local- 20141014182142-8345/22/shuffle_6_11_0.data, file length: 383405length: 274722 262597 291290 272902 264941 270358 291005 295285 252482 287142 232617 259871 233734 241439 228897 234282 253834 235619 233803 255532 270739 253825 262087 266404 234273 250120 262983 257024 255947 254971 258908 247862 221613 258566 245399 251684 274843 226150 264278 245279 225656 235084 239466 212851 242245 218781 222191 215500 211548 234256 208601 204113 191923 217895 227020 215331 212313 223725 250876 256875 239276 266777 235520 237462 234063 242270 246825 255888 235937 236956 233099 264508 260303 233294 239061 254856 257475 230105 246553 260412 210355 211201 219572 206636 226866 209937 226618 218208 206255 248069 221717 222112 215734 248088 239207 246125 239056 241133 253091 246738 233128 242794 231606 255737 221123 252115 247286 229688 251087 250047 237579 263079 256251 238214 208641 201120 204009 200825 211965 200600 194492 226471 194887 226975 215072 206008 233288 222132 208860 219064 218162 237126 220465 201343 225711 232178 233786 212767 211462 213671 215853 227822 233782 214727 247001 228968 247413 222674 214241 184122 215643 207665 219079 215185 207718 212723 201613 216600 212591 208174 204195 208099 229079 230274 223373 214999 256626 228895 231821 383405 229646 220212 245495 245960 227556 213266 237203 203805 240509 239306 242365 218416 238487 219397 240026 251011 258369 255365 259811 283313 248450 264286 264562 257485 279459 249187 257609 274964 292369 273826 {noformat} Here I dump the file name, length and each partition's length, obviously the sum of all partition lengths is not equal to file length. So I think there may be a situation paritionWriter in ExternalSorter not always append to the end of previous written file, the file's content is overwritten in some parts, and this lead to the exceptions I mentioned before. Also I changed the code of copyStream by disable transferTo, use the previous one, all the issues are gone. So I think there maybe some flushing problems in transferTo when processed data is large. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3907) add truncate table support
[ https://issues.apache.org/jira/browse/SPARK-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173806#comment-14173806 ] Apache Spark commented on SPARK-3907: - User 'wangxiaojing' has created a pull request for this issue: https://github.com/apache/spark/pull/2770 add truncate table support - Key: SPARK-3907 URL: https://issues.apache.org/jira/browse/SPARK-3907 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.1.0 Reporter: wangxj Priority: Minor Labels: features Fix For: 1.1.0 Original Estimate: 0.05h Remaining Estimate: 0.05h The truncate table syntax had been disabled. Removes all rows from a table or partition(s),Currently target table should be native/managed table or exception will be thrown.User can specify partial partition_spec for truncating multiple partitions at once and omitting partition_spec will truncate all partitions in the table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3969) Optimizer should have a super class as an interface.
[ https://issues.apache.org/jira/browse/SPARK-3969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173854#comment-14173854 ] Apache Spark commented on SPARK-3969: - User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/2825 Optimizer should have a super class as an interface. Key: SPARK-3969 URL: https://issues.apache.org/jira/browse/SPARK-3969 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin Some developers want to replace {{Optimizer}} to fit their projects but can't do so because currently {{Optimizer}} is an {{object}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3970) Remove duplicate removal of local dirs
Liang-Chi Hsieh created SPARK-3970: -- Summary: Remove duplicate removal of local dirs Key: SPARK-3970 URL: https://issues.apache.org/jira/browse/SPARK-3970 Project: Spark Issue Type: Bug Reporter: Liang-Chi Hsieh The shutdown hook of DiskBlockManager would remove localDirs. So do not need to register them with Utils.registerShutdownDeleteDir. It causes duplicate removal of these local dirs and corresponding exceptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3969) Optimizer should have a super class as an interface.
Takuya Ueshin created SPARK-3969: Summary: Optimizer should have a super class as an interface. Key: SPARK-3969 URL: https://issues.apache.org/jira/browse/SPARK-3969 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin Some developers want to replace {{Optimizer}} to fit their projects but can't do so because currently {{Optimizer}} is an {{object}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3970) Remove duplicate removal of local dirs
[ https://issues.apache.org/jira/browse/SPARK-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173856#comment-14173856 ] Apache Spark commented on SPARK-3970: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/2826 Remove duplicate removal of local dirs -- Key: SPARK-3970 URL: https://issues.apache.org/jira/browse/SPARK-3970 Project: Spark Issue Type: Bug Reporter: Liang-Chi Hsieh The shutdown hook of DiskBlockManager would remove localDirs. So do not need to register them with Utils.registerShutdownDeleteDir. It causes duplicate removal of these local dirs and corresponding exceptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store
[ https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173904#comment-14173904 ] Akshat Aranya commented on SPARK-2365: -- This looks great! I have been using IndexedRDD for a while, to great effect. I have one suggestion: it would be nice to override setName() in IndexedRDDLike {code} override def setName(_name: String): this.type = { partitionsRDD.setName(_name) this } {code} so that the IndexedRDD shows up with friendly names in the storage UI, just like regular, cached RDDs do. Add IndexedRDD, an efficient updatable key-value store -- Key: SPARK-2365 URL: https://issues.apache.org/jira/browse/SPARK-2365 Project: Spark Issue Type: New Feature Components: GraphX, Spark Core Reporter: Ankur Dave Assignee: Ankur Dave Attachments: 2014-07-07-IndexedRDD-design-review.pdf RDDs currently provide a bulk-updatable, iterator-based interface. This imposes minimal requirements on the storage layer, which only needs to support sequential access, enabling on-disk and serialized storage. However, many applications would benefit from a richer interface. Efficient support for point lookups would enable serving data out of RDDs, but it currently requires iterating over an entire partition to find the desired element. Point updates similarly require copying an entire iterator. Joins are also expensive, requiring a shuffle and local hash joins. To address these problems, we propose IndexedRDD, an efficient key-value store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key uniqueness and pre-indexing the entries for efficient joins and point lookups, updates, and deletions. It would be implemented by (1) hash-partitioning the entries by key, (2) maintaining a hash index within each partition, and (3) using purely functional (immutable and efficiently updatable) data structures to enable efficient modifications and deletions. GraphX would be the first user of IndexedRDD, since it currently implements a limited form of this functionality in VertexRDD. We envision a variety of other uses for IndexedRDD, including streaming updates to RDDs, direct serving from RDDs, and as an execution strategy for Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI
[ https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173926#comment-14173926 ] Dev Lakhani commented on SPARK-3957: Here is my thoughts on a possible approach. Hi All The broadcast occurs form the Spark Context to the broadcastmanager and new Broadcast method. In the first instance, the broadcasted data is stored in the Block Manager (see HttpBroadCast) of the executor. Any tracking of broadcast variables must be referenced by the BlockManagerSlaveActor and BlockManagerMasterActor. In particular UpdateBlockInfo and RemoveBroadcast should update the total memory in blocks used when blocks are added and removed. These can then be hooked up to the UI using a new Page like ExecutorsPage and defining a new methods in the relevant listener such as StorageStatusListener. These are my initial thoughts for someone new to these components, any other ideas or approaches? Broadcast variable memory usage not reflected in UI --- Key: SPARK-3957 URL: https://issues.apache.org/jira/browse/SPARK-3957 Project: Spark Issue Type: Bug Components: Block Manager, Web UI Affects Versions: 1.0.2, 1.1.0 Reporter: Shivaram Venkataraman Assignee: Nan Zhu Memory used by broadcast variables are not reflected in the memory usage reported in the WebUI. For example, the executors tab shows memory used in each executor but this number doesn't include memory used by broadcast variables. Similarly the storage tab only shows list of rdds cached and how much memory they use. We should add a separate column / tab for broadcast variables to make it easier to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3174) Provide elastic scaling within a Spark application
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173951#comment-14173951 ] Marcelo Vanzin commented on SPARK-3174: --- bq. Lets say we start the Spark application with just 2 executors. It will double the number of executors and hence goes to 4, 8 and so on. Well, I'd say it's unusual for applications to start with a low number of executors, especially if the user knows it will be executing things right away. So if I start it with 32 executors, your code will right away try to make it 64. Andrew's approach would try to make it 33, then 35, then... But I agree that it might be a good idea to make the auto-scaling backend an interface, so that we can easily play with different approaches. That shouldn't be hard at all. bq. The main point being, It does all these without making any changes in TaskSchedulerImpl/TaskSetManager Theoretically, I agree that's a good thing. I haven't gone through the code in detail, though, to know whether all the information Andrew is using from the scheduler is available from SparkListener events. If you can derive that info, great, I think it would be worth it to make the auto-scale code decoupled from the scheduler. If not, then we either have the choice of hooking the auto-scaling backend into the scheduler (like Andrew's change) or exposing more info in the events - which may or may not be a good thing, depending on what that info is. Anyway, as I've said, both approaches are not irreconcilably different - they're actually more similar than not. Provide elastic scaling within a Spark application -- Key: SPARK-3174 URL: https://issues.apache.org/jira/browse/SPARK-3174 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.0.2 Reporter: Sandy Ryza Assignee: Andrew Or Attachments: SPARK-3174design.pdf, SparkElasticScalingDesignB.pdf, dynamic-scaling-executors-10-6-14.pdf A common complaint with Spark in a multi-tenant environment is that applications have a fixed allocation that doesn't grow and shrink with their resource needs. We're blocked on YARN-1197 for dynamically changing the resources within executors, but we can still allocate and discard whole executors. It would be useful to have some heuristics that * Request more executors when many pending tasks are building up * Discard executors when they are idle See the latest design doc for more information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI
[ https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173954#comment-14173954 ] Shivaram Venkataraman commented on SPARK-3957: -- I think it needs to be tracked in the Block Manager -- However we also need to track this on a per-executor basis and not just at the driver. Right now AFAIK, executors do not report new broadcast blocks to the master to reduce communication. However we could add broadcast blocks to some periodic report. [~andrewor] might know more. Broadcast variable memory usage not reflected in UI --- Key: SPARK-3957 URL: https://issues.apache.org/jira/browse/SPARK-3957 Project: Spark Issue Type: Bug Components: Block Manager, Web UI Affects Versions: 1.0.2, 1.1.0 Reporter: Shivaram Venkataraman Assignee: Nan Zhu Memory used by broadcast variables are not reflected in the memory usage reported in the WebUI. For example, the executors tab shows memory used in each executor but this number doesn't include memory used by broadcast variables. Similarly the storage tab only shows list of rdds cached and how much memory they use. We should add a separate column / tab for broadcast variables to make it easier to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3883) Provide SSL support for Akka and HttpServer based connections
[ https://issues.apache.org/jira/browse/SPARK-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174005#comment-14174005 ] Marcelo Vanzin commented on SPARK-3883: --- FYI, any PR here should make sure the default configuration is safe against the POODLE attack (https://access.redhat.com/security/cve/CVE-2014-3566). Here's something for Jetty: http://stackoverflow.com/questions/26382540/how-to-disable-the-sslv3-protocol-in-jetty-to-prevent-poodle-attack Provide SSL support for Akka and HttpServer based connections - Key: SPARK-3883 URL: https://issues.apache.org/jira/browse/SPARK-3883 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Jacek Lewandowski Spark uses at least 4 logical communication channels: 1. Control messages - Akka based 2. JARs and other files - Jetty based (HttpServer) 3. Computation results - Java NIO based 4. Web UI - Jetty based The aim of this feature is to enable SSL for (1) and (2). Why: Spark configuration is sent through (1). Spark configuration may contain sensitive information like credentials for accessing external data sources or streams. Application JAR files (2) may include the application logic and therefore they may include information about the structure of the external data sources, and credentials as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2750) Add Https support for Web UI
[ https://issues.apache.org/jira/browse/SPARK-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174003#comment-14174003 ] Marcelo Vanzin commented on SPARK-2750: --- FYI, any PR here should make sure the default configuration is save against the POODLE attack (https://access.redhat.com/security/cve/CVE-2014-3566). Here's something for Jetty: http://stackoverflow.com/questions/26382540/how-to-disable-the-sslv3-protocol-in-jetty-to-prevent-poodle-attack Add Https support for Web UI Key: SPARK-2750 URL: https://issues.apache.org/jira/browse/SPARK-2750 Project: Spark Issue Type: New Feature Components: Web UI Reporter: WangTaoTheTonic Labels: https, ssl, webui Fix For: 1.0.3 Original Estimate: 96h Remaining Estimate: 96h Now I try to add https support for web ui using Jetty ssl integration.Below is the plan: 1.Web UI include Master UI, Worker UI, HistoryServer UI and Spark Ui. User can switch between https and http by configure spark.http.policy in JVM property for each process, while choose http by default. 2.Web port of Master and worker would be decided in order of launch arguments, JVM property, System Env and default port. 3.Below is some other configuration items: spark.ssl.server.keystore.location The file or URL of the SSL Key store spark.ssl.server.keystore.password The password for the key store spark.ssl.server.keystore.keypassword The password (if any) for the specific key within the key store spark.ssl.server.keystore.type The type of the key store (default JKS) spark.client.https.need-auth True if SSL needs client authentication spark.ssl.server.truststore.location The file name or URL of the trust store location spark.ssl.server.truststore.password The password for the trust store spark.ssl.server.truststore.type The type of the trust store (default JKS) Any feedback is welcome! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI
[ https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174019#comment-14174019 ] Andrew Or commented on SPARK-3957: -- Yeah my understanding is that broadcast blocks aren't reported to the driver (and it makes sense to not report them because the driver is the one who initiated the broadcast in the first place). The source of the broadcast info we want to display is in the BlockManager of each executor, and we need to get this to the driver somehow. We could add some periodic reporting but that opens another channel between the driver and the executors. There is an ongoing effort to do something similar for task metrics https://github.com/apache/spark/pull/2087, so maybe we can piggyback this information on the heartbeats there. Also I believe this is a duplicate of an old issue SPARK-1761, though this one contains more information so let's keep this one open. I will close the other one in favor of this. Broadcast variable memory usage not reflected in UI --- Key: SPARK-3957 URL: https://issues.apache.org/jira/browse/SPARK-3957 Project: Spark Issue Type: Bug Components: Block Manager, Web UI Affects Versions: 1.0.2, 1.1.0 Reporter: Shivaram Venkataraman Assignee: Nan Zhu Memory used by broadcast variables are not reflected in the memory usage reported in the WebUI. For example, the executors tab shows memory used in each executor but this number doesn't include memory used by broadcast variables. Similarly the storage tab only shows list of rdds cached and how much memory they use. We should add a separate column / tab for broadcast variables to make it easier to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-1761) Add broadcast information on SparkUI storage tab
[ https://issues.apache.org/jira/browse/SPARK-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-1761. Resolution: Duplicate Add broadcast information on SparkUI storage tab Key: SPARK-1761 URL: https://issues.apache.org/jira/browse/SPARK-1761 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Or It would be nice to know where the broadcast blocks are persisted. More details coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI
[ https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174024#comment-14174024 ] Andrew Or commented on SPARK-3957: -- Hey [~devl.development] are you planning to work on this? Or is [~CodingCat]? The latter is currently assigned but maybe you guys should work it out. Broadcast variable memory usage not reflected in UI --- Key: SPARK-3957 URL: https://issues.apache.org/jira/browse/SPARK-3957 Project: Spark Issue Type: Bug Components: Block Manager, Web UI Affects Versions: 1.0.2, 1.1.0 Reporter: Shivaram Venkataraman Assignee: Nan Zhu Memory used by broadcast variables are not reflected in the memory usage reported in the WebUI. For example, the executors tab shows memory used in each executor but this number doesn't include memory used by broadcast variables. Similarly the storage tab only shows list of rdds cached and how much memory they use. We should add a separate column / tab for broadcast variables to make it easier to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1761) Add broadcast information on SparkUI storage tab
[ https://issues.apache.org/jira/browse/SPARK-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174026#comment-14174026 ] Andrew Or commented on SPARK-1761: -- Closing in favor of SPARK-3957, which is more descriptive. Add broadcast information on SparkUI storage tab Key: SPARK-1761 URL: https://issues.apache.org/jira/browse/SPARK-1761 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Or It would be nice to know where the broadcast blocks are persisted. More details coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3971) Failed to deserialize Vector in cluster mode
Davies Liu created SPARK-3971: - Summary: Failed to deserialize Vector in cluster mode Key: SPARK-3971 URL: https://issues.apache.org/jira/browse/SPARK-3971 Project: Spark Issue Type: Bug Components: MLlib, PySpark Reporter: Davies Liu Priority: Blocker The serialization of Vector/Rating did not work in cluster mode, because the initializer is not called in executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2585) Remove special handling of Hadoop JobConf
[ https://issues.apache.org/jira/browse/SPARK-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2585. --- Resolution: Fixed Due to the CONFIGURATION_INSTANTIATION_LOCK thread-safety issue, I think that we'll still end up having to serialize the Configuration separately. If we didn't, then we'd have to hold CONFIGURATION_INSTANTIATION_LOCK while deserializing each task, which could have a huge performance penalty (it's fine to hold the lock while loading the Configuration, since that doesn't take too long). Therefore, I'm closing this as Won't Fix. The thread-safety issues with Configuration will be addressed by a separate clone() patch. Remove special handling of Hadoop JobConf - Key: SPARK-2585 URL: https://issues.apache.org/jira/browse/SPARK-2585 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Josh Rosen Priority: Critical This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the implementation does not use shared conf objects). We no longer need to specially broadcast the Hadoop configuration since we are broadcasting RDD data anyways. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3972) PySpark Error on Windows with sc.wholeTextFiles
Michael Griffiths created SPARK-3972: Summary: PySpark Error on Windows with sc.wholeTextFiles Key: SPARK-3972 URL: https://issues.apache.org/jira/browse/SPARK-3972 Project: Spark Issue Type: Bug Components: Input/Output, PySpark, Windows Affects Versions: 1.1.0 Environment: Windows 8.1 x64 Java SE Version 8 Update 20 (build 1.8.0_20-b26); Python 2.7.7 Reporter: Michael Griffiths Priority: Minor When running sc.wholeTextFiles() on a directory, I can run the command but not do anything with the resulting RDD – specifically, I get an error in py4j.protocol.Py4JJavaError; the error is unspecified. This occurs even if I can read the text file(s) individually with sc.textFile() Steps followed: 1) Download Spark 1.1.0 (pre-builet for Hadoop 2.4: [spark-1.1.0-bin-hadoop2.4.tgz|http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-hadoop2.4.tgz]) 2) Extract into folder at root of drive: **D:\spark** 3) Create test folder at **D:\testdata** with one (HTML) file contained within it. 4) Launch PySpark at **bin\PySpark** 5) Try to use sc.wholeTextFiles('d:/testdata'); fail. Note: I followed instructions from the upcoming O'Reilly book [Learning Spark|http://shop.oreilly.com/product/0636920028512.do] for this. I do not have any related tools installed (e.g. Hadoop) on the Windows machine. See session (below)with tracebacks from errors. {noformat} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.1.0 /_/ Using Python version 2.7.7 (default, Jun 11 2014 10:40:02) SparkContext available as sc. file = sc.textFile(d:/testdata/cbcc5b470ec06f212990c68c8f76e887b884) file.count() 732 file.first() u'!DOCTYPE html' data = sc.wholeTextFiles('d:/testdata') data.first() Traceback (most recent call last): File stdin, line 1, in module File D:\spark\python\pyspark\rdd.py, line 1167, in first return self.take(1)[0] File D:\spark\python\pyspark\rdd.py, line 1126, in take totalParts = self._jrdd.partitions().size() File D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py, line 538, in __call__ File D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o21.partitions. : java.lang.NullPointerException at java.lang.ProcessBuilder.start(Unknown Source) at org.apache.hadoop.util.Shell.runCommand(Shell.java:445) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) at org.apache.hadoop.util.Shell.execCommand(Shell.java:739) at org.apache.hadoop.util.Shell.execCommand(Shell.java:722) at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097) at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:559) at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:534) at org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1697) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1679) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:302) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:263) at org.apache.spark.input.WholeTextFileInputFormat.setMaxSplitSize(WholeTextFileInputFormat.scala:54) at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50) at org.apache.spark.api.java.JavaPairRDD.partitions(JavaPairRDD.scala:44) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at
[jira] [Commented] (SPARK-3736) Workers should reconnect to Master if disconnected
[ https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174049#comment-14174049 ] Apache Spark commented on SPARK-3736: - User 'mccheah' has created a pull request for this issue: https://github.com/apache/spark/pull/2828 Workers should reconnect to Master if disconnected -- Key: SPARK-3736 URL: https://issues.apache.org/jira/browse/SPARK-3736 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Andrew Ash Assignee: Matthew Cheah Priority: Critical In standalone mode, when a worker gets disconnected from the master for some reason it never attempts to reconnect. In this situation you have to bounce the worker before it will reconnect to the master. The preferred alternative is to follow what Hadoop does -- when there's a disconnect, attempt to reconnect at a particular interval until successful (I think it repeats indefinitely every 10sec). This has been observed by: - [~pkolaczk] in http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html - [~romi-totango] in http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html - [~aash] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI
[ https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174053#comment-14174053 ] Nan Zhu commented on SPARK-3957: I agree with [~andrewor14], I was also thinking about piggyback the information in the heartbeat between heartbeatReceiver and the executor ...not sure about the current Hadoop implementation, in 1.x version, TaskStatus was piggyback in the heartbeat between TaskTracker and JobTracker...to me, it's a very natural way to do this I accepted it this morning and have started some work, so, [~devlakhani], please let me finish this, thanks Broadcast variable memory usage not reflected in UI --- Key: SPARK-3957 URL: https://issues.apache.org/jira/browse/SPARK-3957 Project: Spark Issue Type: Bug Components: Block Manager, Web UI Affects Versions: 1.0.2, 1.1.0 Reporter: Shivaram Venkataraman Assignee: Nan Zhu Memory used by broadcast variables are not reflected in the memory usage reported in the WebUI. For example, the executors tab shows memory used in each executor but this number doesn't include memory used by broadcast variables. Similarly the storage tab only shows list of rdds cached and how much memory they use. We should add a separate column / tab for broadcast variables to make it easier to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3973) Print callSite information for broadcast variables
Shivaram Venkataraman created SPARK-3973: Summary: Print callSite information for broadcast variables Key: SPARK-3973 URL: https://issues.apache.org/jira/browse/SPARK-3973 Project: Spark Issue Type: Bug Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Minor Fix For: 1.2.0 Printing call site information for broadcast variables will help in debugging which variables are used, when they are used etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3973) Print callSite information for broadcast variables
[ https://issues.apache.org/jira/browse/SPARK-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174066#comment-14174066 ] Apache Spark commented on SPARK-3973: - User 'shivaram' has created a pull request for this issue: https://github.com/apache/spark/pull/2829 Print callSite information for broadcast variables -- Key: SPARK-3973 URL: https://issues.apache.org/jira/browse/SPARK-3973 Project: Spark Issue Type: Bug Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Minor Fix For: 1.2.0 Printing call site information for broadcast variables will help in debugging which variables are used, when they are used etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2706) Enable Spark to support Hive 0.13
[ https://issues.apache.org/jira/browse/SPARK-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174073#comment-14174073 ] Zhan Zhang commented on SPARK-2706: --- The code does not go to upstream yet. To build 0.13.1 support, you need to apply the patch. But now it uses the customized package with org.spark-project, which seems to be withdraw from published. So to use it, you need to change org.spark-project to original hive package. Enable Spark to support Hive 0.13 - Key: SPARK-2706 URL: https://issues.apache.org/jira/browse/SPARK-2706 Project: Spark Issue Type: Dependency upgrade Components: SQL Affects Versions: 1.0.1 Reporter: Chunjun Xiao Assignee: Zhan Zhang Attachments: hive.diff, spark-2706-v1.txt, spark-2706-v2.txt, spark-hive.err, v1.0.2.diff It seems Spark cannot work with Hive 0.13 well. When I compiled Spark with Hive 0.13.1, I got some error messages, as attached below. So, when can Spark be enabled to support Hive 0.13? Compiling Error: {quote} [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala:180: type mismatch; found : String required: Array[String] [ERROR] val proc: CommandProcessor = CommandProcessorFactory.get(tokens(0), hiveconf) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:264: overloaded method constructor TableDesc with alternatives: (x$1: Class[_ : org.apache.hadoop.mapred.InputFormat[_, _]],x$2: Class[_],x$3: java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc and ()org.apache.hadoop.hive.ql.plan.TableDesc cannot be applied to (Class[org.apache.hadoop.hive.serde2.Deserializer], Class[(some other)?0(in value tableDesc)(in value tableDesc)], Class[?0(in value tableDesc)(in value tableDesc)], java.util.Properties) [ERROR] val tableDesc = new TableDesc( [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala:140: value getPartitionPath is not a member of org.apache.hadoop.hive.ql.metadata.Partition [ERROR] val partPath = partition.getPartitionPath [ERROR]^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala:132: value appendReadColumnNames is not a member of object org.apache.hadoop.hive.serde2.ColumnProjectionUtils [ERROR] ColumnProjectionUtils.appendReadColumnNames(hiveConf, attributes.map(_.name)) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:79: org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor [ERROR] new HiveDecimal(bd.underlying()) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:132: type mismatch; found : org.apache.hadoop.fs.Path required: String [ERROR] SparkHiveHadoopWriter.createPathFromString(fileSinkConf.getDirName, conf)) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:179: value getExternalTmpFileURI is not a member of org.apache.hadoop.hive.ql.Context [ERROR] val tmpLocation = hiveContext.getExternalTmpFileURI(tableLocation) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala:209: org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor [ERROR] case bd: BigDecimal = new HiveDecimal(bd.underlying()) [ERROR] ^ [ERROR] 8 errors found [DEBUG] Compilation failed (CompilerInterface) [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [2.579s] [INFO] Spark Project Core SUCCESS [2:39.805s] [INFO] Spark Project Bagel ... SUCCESS [21.148s] [INFO] Spark Project GraphX .. SUCCESS [59.950s] [INFO] Spark Project ML Library .. SUCCESS [1:08.771s] [INFO] Spark Project Streaming ... SUCCESS [1:17.759s] [INFO] Spark Project Tools ... SUCCESS [15.405s] [INFO] Spark Project Catalyst SUCCESS [1:17.405s] [INFO] Spark Project SQL . SUCCESS [1:11.094s] [INFO] Spark Project Hive
[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI
[ https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174107#comment-14174107 ] Dev Lakhani commented on SPARK-3957: Hi For now I am happy for [~CodingCat] to take this on, maybe once there are some commits I can help with the UI side, but for now I'll hold back. Broadcast variable memory usage not reflected in UI --- Key: SPARK-3957 URL: https://issues.apache.org/jira/browse/SPARK-3957 Project: Spark Issue Type: Bug Components: Block Manager, Web UI Affects Versions: 1.0.2, 1.1.0 Reporter: Shivaram Venkataraman Assignee: Nan Zhu Memory used by broadcast variables are not reflected in the memory usage reported in the WebUI. For example, the executors tab shows memory used in each executor but this number doesn't include memory used by broadcast variables. Similarly the storage tab only shows list of rdds cached and how much memory they use. We should add a separate column / tab for broadcast variables to make it easier to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3972) PySpark Error on Windows with sc.wholeTextFiles
[ https://issues.apache.org/jira/browse/SPARK-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174110#comment-14174110 ] Michael Griffiths commented on SPARK-3972: -- This issue does NOT occur if I build Spark from source; using Bash and sbt\sbt assembly. It's restricted to the pre-compiled version. PySpark Error on Windows with sc.wholeTextFiles --- Key: SPARK-3972 URL: https://issues.apache.org/jira/browse/SPARK-3972 Project: Spark Issue Type: Bug Components: Input/Output, PySpark, Windows Affects Versions: 1.1.0 Environment: Windows 8.1 x64 Java SE Version 8 Update 20 (build 1.8.0_20-b26); Python 2.7.7 Reporter: Michael Griffiths Priority: Minor When running sc.wholeTextFiles() on a directory, I can run the command but not do anything with the resulting RDD – specifically, I get an error in py4j.protocol.Py4JJavaError; the error is unspecified. This occurs even if I can read the text file(s) individually with sc.textFile() Steps followed: 1) Download Spark 1.1.0 (pre-builet for Hadoop 2.4: [spark-1.1.0-bin-hadoop2.4.tgz|http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-hadoop2.4.tgz]) 2) Extract into folder at root of drive: **D:\spark** 3) Create test folder at **D:\testdata** with one (HTML) file contained within it. 4) Launch PySpark at **bin\PySpark** 5) Try to use sc.wholeTextFiles('d:/testdata'); fail. Note: I followed instructions from the upcoming O'Reilly book [Learning Spark|http://shop.oreilly.com/product/0636920028512.do] for this. I do not have any related tools installed (e.g. Hadoop) on the Windows machine. See session (below)with tracebacks from errors. {noformat} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.1.0 /_/ Using Python version 2.7.7 (default, Jun 11 2014 10:40:02) SparkContext available as sc. file = sc.textFile(d:/testdata/cbcc5b470ec06f212990c68c8f76e887b884) file.count() 732 file.first() u'!DOCTYPE html' data = sc.wholeTextFiles('d:/testdata') data.first() Traceback (most recent call last): File stdin, line 1, in module File D:\spark\python\pyspark\rdd.py, line 1167, in first return self.take(1)[0] File D:\spark\python\pyspark\rdd.py, line 1126, in take totalParts = self._jrdd.partitions().size() File D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py, line 538, in __call__ File D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o21.partitions. : java.lang.NullPointerException at java.lang.ProcessBuilder.start(Unknown Source) at org.apache.hadoop.util.Shell.runCommand(Shell.java:445) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) at org.apache.hadoop.util.Shell.execCommand(Shell.java:739) at org.apache.hadoop.util.Shell.execCommand(Shell.java:722) at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097) at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:559) at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:534) at org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1697) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1679) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:302) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:263) at org.apache.spark.input.WholeTextFileInputFormat.setMaxSplitSize(WholeTextFileInputFormat.scala:54) at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50) at org.apache.spark.api.java.JavaPairRDD.partitions(JavaPairRDD.scala:44) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at
[jira] [Updated] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job
[ https://issues.apache.org/jira/browse/SPARK-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3882: --- Description: A long running spark context (non-streaming) will eventually start throwing the following in the driver: {code} java.util.NoSuchElementException: key not found: 12771 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener threw an exception java.util.NoSuchElementException: key not found: 12782 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) {code} And the ui will show running jobs that are in fact no longer running and never clean them up. (see attached screenshot) The result is that the ui becomes unusable, and the JobProgressListener leaks memory as the list of running jobs continues to grow. was: A long running spark context (non-streaming) will eventually start throwing the following in the driver:
[jira] [Created] (SPARK-3974) Block matrix abstracitons and partitioners
Reza Zadeh created SPARK-3974: - Summary: Block matrix abstracitons and partitioners Key: SPARK-3974 URL: https://issues.apache.org/jira/browse/SPARK-3974 Project: Spark Issue Type: Improvement Reporter: Reza Zadeh We need abstractions for block matrices with fixed block sizes, with each block being dense. Partitioners along both rows and columns required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3975) Block Matrix addition and multiplication
Reza Zadeh created SPARK-3975: - Summary: Block Matrix addition and multiplication Key: SPARK-3975 URL: https://issues.apache.org/jira/browse/SPARK-3975 Project: Spark Issue Type: Improvement Reporter: Reza Zadeh Block matrix addition and multiplication, for the case when partitioning schemes match. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3976) Detect block matrix partitioning schemes
Reza Zadeh created SPARK-3976: - Summary: Detect block matrix partitioning schemes Key: SPARK-3976 URL: https://issues.apache.org/jira/browse/SPARK-3976 Project: Spark Issue Type: Improvement Reporter: Reza Zadeh Provide repartitioning methods for block matrices to repartition matrix for add/multiply of non-identically partitioned matrices -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3977) Conversions between {Row, Coordinate}Matrix - BlockMatrix
Reza Zadeh created SPARK-3977: - Summary: Conversions between {Row, Coordinate}Matrix - BlockMatrix Key: SPARK-3977 URL: https://issues.apache.org/jira/browse/SPARK-3977 Project: Spark Issue Type: Improvement Reporter: Reza Zadeh Build conversion functions between {Row, Coordinate}Matrix - BlockMatrix -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3971) Failed to deserialize Vector in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174146#comment-14174146 ] Apache Spark commented on SPARK-3971: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2830 Failed to deserialize Vector in cluster mode Key: SPARK-3971 URL: https://issues.apache.org/jira/browse/SPARK-3971 Project: Spark Issue Type: Bug Components: MLlib, PySpark Reporter: Davies Liu Priority: Blocker The serialization of Vector/Rating did not work in cluster mode, because the initializer is not called in executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3466) Limit size of results that a driver collects for each action
[ https://issues.apache.org/jira/browse/SPARK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174148#comment-14174148 ] Matt Cheah commented on SPARK-3466: --- I'll look into this. Someone please assign to me! Limit size of results that a driver collects for each action Key: SPARK-3466 URL: https://issues.apache.org/jira/browse/SPARK-3466 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Matei Zaharia Right now, operations like {{collect()}} and {{take()}} can crash the driver with an OOM if they bring back too many data. We should add a {{spark.driver.maxResultSize}} setting (or something like that) that will make the driver abort a job if its result is too big. We can set it to some fraction of the driver's memory by default, or to something like 100 MB. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3978) Schema change on Spark-Hive (Parquet file format) table not working
Nilesh Barge created SPARK-3978: --- Summary: Schema change on Spark-Hive (Parquet file format) table not working Key: SPARK-3978 URL: https://issues.apache.org/jira/browse/SPARK-3978 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Nilesh Barge On following releases: Spark 1.1.0 (built using sbt/sbt -Dhadoop.version=2.2.0 -Phive assembly) , Apache HDFS 2.2 Spark job is able to create/add/read data in hive, parquet formatted, tables using HiveContext. But, after changing schema, spark job is not able to read data and throws following exception: java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.getStructFieldData(ArrayWritableObjectInspector.java:127) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:284) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:278) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774) at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) code snippet in short: hiveContext.sql(CREATE EXTERNAL TABLE IF NOT EXISTS people_table (name String, age INT) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'); hiveContext.sql(INSERT INTO TABLE people_table SELECT name, age FROM temp_table_people1); hiveContext.sql(SELECT * FROM people_table); //Here, data read was successful. hiveContext.sql(ALTER TABLE people_table ADD COLUMNS (gender STRING)); hiveContext.sql(SELECT * FROM people_table); //Not able to read existing data and ArrayIndexOutOfBoundsException is thrown. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3466) Limit size of results that a driver collects for each action
[ https://issues.apache.org/jira/browse/SPARK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-3466: -- Assignee: Matthew Cheah Limit size of results that a driver collects for each action Key: SPARK-3466 URL: https://issues.apache.org/jira/browse/SPARK-3466 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Matei Zaharia Assignee: Matthew Cheah Right now, operations like {{collect()}} and {{take()}} can crash the driver with an OOM if they bring back too many data. We should add a {{spark.driver.maxResultSize}} setting (or something like that) that will make the driver abort a job if its result is too big. We can set it to some fraction of the driver's memory by default, or to something like 100 MB. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3971) Failed to deserialize Vector in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3971: - Assignee: Davies Liu Failed to deserialize Vector in cluster mode Key: SPARK-3971 URL: https://issues.apache.org/jira/browse/SPARK-3971 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.2.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker The serialization of Vector/Rating did not work in cluster mode, because the initializer is not called in executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3971) Failed to deserialize Vector in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3971: - Target Version/s: 1.2.0 Affects Version/s: 1.2.0 Failed to deserialize Vector in cluster mode Key: SPARK-3971 URL: https://issues.apache.org/jira/browse/SPARK-3971 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.2.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker The serialization of Vector/Rating did not work in cluster mode, because the initializer is not called in executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3979) Yarn backend's default file replication should match HDFS's default one
Marcelo Vanzin created SPARK-3979: - Summary: Yarn backend's default file replication should match HDFS's default one Key: SPARK-3979 URL: https://issues.apache.org/jira/browse/SPARK-3979 Project: Spark Issue Type: Bug Components: YARN Reporter: Marcelo Vanzin Priority: Minor This code in ClientBase.scala sets the replication used for files uploaded to HDFS: {{noformat}} val replication = sparkConf.getInt(spark.yarn.submit.file.replication, 3).toShort {{noformat}} Instead of a hardcoded 3 (which is the default value for HDFS), it should be using the default value from the HDFS conf (dfs.replication). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3979) Yarn backend's default file replication should match HDFS's default one
[ https://issues.apache.org/jira/browse/SPARK-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174165#comment-14174165 ] Marcelo Vanzin commented on SPARK-3979: --- BTW, this would avoid issues like this: {noformat} Exception in thread main org.apache.hadoop.ipc.RemoteException(java.io.IOException): file /user/systest/.sparkStaging/application_1413485082283_0001/spark-assembly-1.2.0-SNAPSHOT-hadoop2.3.0.jar. Requested replication 3 exceeds maximum 1 at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.verifyReplication(BlockManager.java:943) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setReplicationInt(FSNamesystem.java:2243) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setReplication(FSNamesystem.java:2233) ... at org.apache.spark.deploy.yarn.ClientBase$class.copyFileToRemote(ClientBase.scala:101) {noformat} Yarn backend's default file replication should match HDFS's default one --- Key: SPARK-3979 URL: https://issues.apache.org/jira/browse/SPARK-3979 Project: Spark Issue Type: Bug Components: YARN Reporter: Marcelo Vanzin Priority: Minor This code in ClientBase.scala sets the replication used for files uploaded to HDFS: {{noformat}} val replication = sparkConf.getInt(spark.yarn.submit.file.replication, 3).toShort {{noformat}} Instead of a hardcoded 3 (which is the default value for HDFS), it should be using the default value from the HDFS conf (dfs.replication). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3979) Yarn backend's default file replication should match HDFS's default one
[ https://issues.apache.org/jira/browse/SPARK-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-3979: -- Description: This code in ClientBase.scala sets the replication used for files uploaded to HDFS: {code} val replication = sparkConf.getInt(spark.yarn.submit.file.replication, 3).toShort {code} Instead of a hardcoded 3 (which is the default value for HDFS), it should be using the default value from the HDFS conf (dfs.replication). was: This code in ClientBase.scala sets the replication used for files uploaded to HDFS: {{noformat}} val replication = sparkConf.getInt(spark.yarn.submit.file.replication, 3).toShort {{noformat}} Instead of a hardcoded 3 (which is the default value for HDFS), it should be using the default value from the HDFS conf (dfs.replication). Yarn backend's default file replication should match HDFS's default one --- Key: SPARK-3979 URL: https://issues.apache.org/jira/browse/SPARK-3979 Project: Spark Issue Type: Bug Components: YARN Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Priority: Minor This code in ClientBase.scala sets the replication used for files uploaded to HDFS: {code} val replication = sparkConf.getInt(spark.yarn.submit.file.replication, 3).toShort {code} Instead of a hardcoded 3 (which is the default value for HDFS), it should be using the default value from the HDFS conf (dfs.replication). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3979) Yarn backend's default file replication should match HDFS's default one
[ https://issues.apache.org/jira/browse/SPARK-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174224#comment-14174224 ] Apache Spark commented on SPARK-3979: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/2831 Yarn backend's default file replication should match HDFS's default one --- Key: SPARK-3979 URL: https://issues.apache.org/jira/browse/SPARK-3979 Project: Spark Issue Type: Bug Components: YARN Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Priority: Minor This code in ClientBase.scala sets the replication used for files uploaded to HDFS: {code} val replication = sparkConf.getInt(spark.yarn.submit.file.replication, 3).toShort {code} Instead of a hardcoded 3 (which is the default value for HDFS), it should be using the default value from the HDFS conf (dfs.replication). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3980) GraphX Performance Issue
Jarred Li created SPARK-3980: Summary: GraphX Performance Issue Key: SPARK-3980 URL: https://issues.apache.org/jira/browse/SPARK-3980 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.1.0 Reporter: Jarred Li I run 4 workes in AWS (c3.xlarge), 4g memory for executor, 85,331,846 edges from(http://socialcomputing.asu.edu/uploads/1296759055/Twitter-dataset.zip). For PageRank algorithm, the job can not be completed withon 7 hours. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3981) Consider a better approach to initialize SerDe on executors
Xiangrui Meng created SPARK-3981: Summary: Consider a better approach to initialize SerDe on executors Key: SPARK-3981 URL: https://issues.apache.org/jira/browse/SPARK-3981 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.2.0 Reporter: Xiangrui Meng In SPARK-3971, we copied SerDe code from Core to MLlib in order to recognize MLlib types on executors as a hotfix. This is not ideal. We should find a way to add hooks to the SerDe in Core to support MLlib types in a pluggable way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3980) GraphX Performance Issue
[ https://issues.apache.org/jira/browse/SPARK-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jarred Li updated SPARK-3980: - Description: I run 4 workes in AWS (c3.xlarge), 4g memory for executor, 85,331,846 edges from(http://socialcomputing.asu.edu/uploads/1296759055/Twitter-dataset.zip). For PageRank algorithm, the job can not be completed within 7 hours. (was: I run 4 workes in AWS (c3.xlarge), 4g memory for executor, 85,331,846 edges from(http://socialcomputing.asu.edu/uploads/1296759055/Twitter-dataset.zip). For PageRank algorithm, the job can not be completed withon 7 hours.) GraphX Performance Issue Key: SPARK-3980 URL: https://issues.apache.org/jira/browse/SPARK-3980 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.1.0 Reporter: Jarred Li I run 4 workes in AWS (c3.xlarge), 4g memory for executor, 85,331,846 edges from(http://socialcomputing.asu.edu/uploads/1296759055/Twitter-dataset.zip). For PageRank algorithm, the job can not be completed within 7 hours. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3982) receiverStream in Python API
Davies Liu created SPARK-3982: - Summary: receiverStream in Python API Key: SPARK-3982 URL: https://issues.apache.org/jira/browse/SPARK-3982 Project: Spark Issue Type: New Feature Components: PySpark, Streaming Reporter: Davies Liu receiverStream() is used to extend the input sources of streaming, it will be very useful to have it in Python API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3980) GraphX Performance Issue
[ https://issues.apache.org/jira/browse/SPARK-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jarred Li updated SPARK-3980: - Description: I run 4 workes in AWS (c3.xlarge), 4g memory for executor, 85,331,846 edges from(http://socialcomputing.asu.edu/uploads/1296759055/Twitter-dataset.zip). For PageRank algorithm, the job can not be completed within 7 hours. For small dataset with 5,000,000 edges(http://socialcomputing.asu.edu/uploads/1296591553/Last.fm-dataset.zip) , the job can be completed within 16 seconds. was:I run 4 workes in AWS (c3.xlarge), 4g memory for executor, 85,331,846 edges from(http://socialcomputing.asu.edu/uploads/1296759055/Twitter-dataset.zip). For PageRank algorithm, the job can not be completed within 7 hours. GraphX Performance Issue Key: SPARK-3980 URL: https://issues.apache.org/jira/browse/SPARK-3980 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.1.0 Reporter: Jarred Li I run 4 workes in AWS (c3.xlarge), 4g memory for executor, 85,331,846 edges from(http://socialcomputing.asu.edu/uploads/1296759055/Twitter-dataset.zip). For PageRank algorithm, the job can not be completed within 7 hours. For small dataset with 5,000,000 edges(http://socialcomputing.asu.edu/uploads/1296591553/Last.fm-dataset.zip) , the job can be completed within 16 seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3983) Scheduler delay (shown in the UI) is incorrect
Kay Ousterhout created SPARK-3983: - Summary: Scheduler delay (shown in the UI) is incorrect Key: SPARK-3983 URL: https://issues.apache.org/jira/browse/SPARK-3983 Project: Spark Issue Type: Bug Reporter: Kay Ousterhout Assignee: Kay Ousterhout Fix For: 1.2.0 The reported scheduler delay includes time to get a new thread (from a threadpool) in order to start the task, time to deserialize the task, and time to serialize the result. None of these things are delay caused by the scheduler; including them as such is misleading. cc [~sparks] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3983) Scheduler delay (shown in the UI) is incorrect
[ https://issues.apache.org/jira/browse/SPARK-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-3983: -- Comment: was deleted (was: This is especially problematic when debugging performance of short tasks (that run in 10s of milliseconds), when the scheduler delay can be very large relative to the task duration.) Scheduler delay (shown in the UI) is incorrect -- Key: SPARK-3983 URL: https://issues.apache.org/jira/browse/SPARK-3983 Project: Spark Issue Type: Bug Reporter: Kay Ousterhout Assignee: Kay Ousterhout Fix For: 1.2.0 The reported scheduler delay includes time to get a new thread (from a threadpool) in order to start the task, time to deserialize the task, and time to serialize the result. None of these things are delay caused by the scheduler; including them as such is misleading. cc [~sparks] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3983) Scheduler delay (shown in the UI) is incorrect
[ https://issues.apache.org/jira/browse/SPARK-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174452#comment-14174452 ] Kay Ousterhout commented on SPARK-3983: --- This is especially problematic when debugging performance of short tasks (that run in 10s of milliseconds), when the scheduler delay can be very large relative to the task duration. Scheduler delay (shown in the UI) is incorrect -- Key: SPARK-3983 URL: https://issues.apache.org/jira/browse/SPARK-3983 Project: Spark Issue Type: Bug Reporter: Kay Ousterhout Assignee: Kay Ousterhout Fix For: 1.2.0 The reported scheduler delay includes time to get a new thread (from a threadpool) in order to start the task, time to deserialize the task, and time to serialize the result. None of these things are delay caused by the scheduler; including them as such is misleading. cc [~sparks] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3983) Scheduler delay (shown in the UI) is incorrect
[ https://issues.apache.org/jira/browse/SPARK-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-3983: -- Description: The reported scheduler delay includes time to get a new thread (from a threadpool) in order to start the task, time to deserialize the task, and time to serialize the result. None of these things are delay caused by the scheduler; including them as such is misleading. This is especially problematic when debugging performance of short tasks (that run in 10s of milliseconds), when the scheduler delay can be very large relative to the task duration. cc [~sparks] [~shivaram] was: The reported scheduler delay includes time to get a new thread (from a threadpool) in order to start the task, time to deserialize the task, and time to serialize the result. None of these things are delay caused by the scheduler; including them as such is misleading. cc [~sparks] [~shivaram] Scheduler delay (shown in the UI) is incorrect -- Key: SPARK-3983 URL: https://issues.apache.org/jira/browse/SPARK-3983 Project: Spark Issue Type: Bug Reporter: Kay Ousterhout Assignee: Kay Ousterhout Fix For: 1.2.0 The reported scheduler delay includes time to get a new thread (from a threadpool) in order to start the task, time to deserialize the task, and time to serialize the result. None of these things are delay caused by the scheduler; including them as such is misleading. This is especially problematic when debugging performance of short tasks (that run in 10s of milliseconds), when the scheduler delay can be very large relative to the task duration. cc [~sparks] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails
[ https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174507#comment-14174507 ] Marcelo Vanzin edited comment on SPARK-3877 at 10/17/14 12:08 AM: -- [~tgraves] this can be seen as a subset of SPARK-2167, but as I mentioned on that bug, I don't think it's fixable for all cases. SparkSubmit is executing user code, so it can only report errors when the user code does. e.g., a job like this would report an error today {code} val sc = ... try { // do stuff if (somethingBad) throw MyJobFailedException() } finally { sc.stop() } {code} But this one wouldn't: {code} val sc = ... try { // do stuff if (somethingBad) throw MyJobFailedException() } catch { case e: Exception = logError(Oops, something bad happened., e) } finally { sc.stop() } {code} yarn-client mode will abruptly stop the SparkContext when the Yarn app fails. But depending on how the user's {{main()}} deals with errors, that still may not result in a non-zero exit status. was (Author: vanzin): [~tgraves] this can be seen as a subset of SPARK-2167, but as I mentioned on that bug, I don't think it's fixable for all cases. SparkSubmit is executing user code, so it can only report errors when the user code does. e.g., a job like this would report an error today {code} val sc = ... try { // do stuff if (somethingBad) throw MyJobFailedException() } finally { sc.stop() } {code} But this one wouldn't: {code} val sc = ... try { // do stuff if (somethingBad) throw MyJobFailedException() } catch { case e: Exception = logError(Oops, something bad happened., e) } finally { sc.stop() } {code} yarn-client mode will abruptly stop the SparkContext when the Yarn app fails. But depending on how the user's {main()} deals with errors, that still may not result in a non-zero exit status. The exit code of spark-submit is still 0 when an yarn application fails --- Key: SPARK-3877 URL: https://issues.apache.org/jira/browse/SPARK-3877 Project: Spark Issue Type: Bug Components: YARN Reporter: Shixiong Zhu Priority: Minor Labels: yarn When an yarn application fails (yarn-cluster mode), the exit code of spark-submit is still 0. It's hard for people to write some automatic scripts to run spark jobs in yarn because the failure can not be detected in these scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails
[ https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174507#comment-14174507 ] Marcelo Vanzin commented on SPARK-3877: --- [~tgraves] this can be seen as a subset of SPARK-2167, but as I mentioned on that bug, I don't think it's fixable for all cases. SparkSubmit is executing user code, so it can only report errors when the user code does. e.g., a job like this would report an error today {code} val sc = ... try { // do stuff if (somethingBad) throw MyJobFailedException() } finally { sc.stop() } {code} But this one wouldn't: {code} val sc = ... try { // do stuff if (somethingBad) throw MyJobFailedException() } catch { case e: Exception = logError(Oops, something bad happened., e) } finally { sc.stop() } {code} yarn-client mode will abruptly stop the SparkContext when the Yarn app fails. But depending on how the user's {main()} deals with errors, that still may not result in a non-zero exit status. The exit code of spark-submit is still 0 when an yarn application fails --- Key: SPARK-3877 URL: https://issues.apache.org/jira/browse/SPARK-3877 Project: Spark Issue Type: Bug Components: YARN Reporter: Shixiong Zhu Priority: Minor Labels: yarn When an yarn application fails (yarn-cluster mode), the exit code of spark-submit is still 0. It's hard for people to write some automatic scripts to run spark jobs in yarn because the failure can not be detected in these scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3984) Display finer grained metrics about task launch overhead in the UI
Kay Ousterhout created SPARK-3984: - Summary: Display finer grained metrics about task launch overhead in the UI Key: SPARK-3984 URL: https://issues.apache.org/jira/browse/SPARK-3984 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Assignee: Kay Ousterhout Fix For: 1.2.0 Right now, the UI does not display the time to deserialize the task, to serialize the task result, or to launch a new thread for the task. When running short jobs (e.g., for ML) these overheads can become significant. It would be great to show these in the summary quantiles for each stage in the UI to facilitate better performance debugging. cc [~sparks] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3983) Scheduler delay (shown in the UI) is incorrect
[ https://issues.apache.org/jira/browse/SPARK-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174574#comment-14174574 ] Kay Ousterhout commented on SPARK-3983: --- https://github.com/apache/spark/pull/2832 Scheduler delay (shown in the UI) is incorrect -- Key: SPARK-3983 URL: https://issues.apache.org/jira/browse/SPARK-3983 Project: Spark Issue Type: Bug Reporter: Kay Ousterhout Assignee: Kay Ousterhout Fix For: 1.2.0 The reported scheduler delay includes time to get a new thread (from a threadpool) in order to start the task, time to deserialize the task, and time to serialize the result. None of these things are delay caused by the scheduler; including them as such is misleading. This is especially problematic when debugging performance of short tasks (that run in 10s of milliseconds), when the scheduler delay can be very large relative to the task duration. cc [~sparks] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3983) Scheduler delay (shown in the UI) is incorrect
[ https://issues.apache.org/jira/browse/SPARK-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-3983: -- Component/s: Web UI Scheduler delay (shown in the UI) is incorrect -- Key: SPARK-3983 URL: https://issues.apache.org/jira/browse/SPARK-3983 Project: Spark Issue Type: Bug Components: Web UI Reporter: Kay Ousterhout Assignee: Kay Ousterhout Fix For: 1.2.0 The reported scheduler delay includes time to get a new thread (from a threadpool) in order to start the task, time to deserialize the task, and time to serialize the result. None of these things are delay caused by the scheduler; including them as such is misleading. This is especially problematic when debugging performance of short tasks (that run in 10s of milliseconds), when the scheduler delay can be very large relative to the task duration. cc [~sparks] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3984) Display finer grained metrics about task launch overhead in the UI
[ https://issues.apache.org/jira/browse/SPARK-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-3984: -- Comment: was deleted (was: https://github.com/apache/spark/pull/2832) Display finer grained metrics about task launch overhead in the UI -- Key: SPARK-3984 URL: https://issues.apache.org/jira/browse/SPARK-3984 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Assignee: Kay Ousterhout Fix For: 1.2.0 Right now, the UI does not display the time to deserialize the task, to serialize the task result, or to launch a new thread for the task. When running short jobs (e.g., for ML) these overheads can become significant. It would be great to show these in the summary quantiles for each stage in the UI to facilitate better performance debugging. cc [~sparks] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3984) Display finer grained metrics about task launch overhead in the UI
[ https://issues.apache.org/jira/browse/SPARK-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174576#comment-14174576 ] Apache Spark commented on SPARK-3984: - User 'kayousterhout' has created a pull request for this issue: https://github.com/apache/spark/pull/2832 Display finer grained metrics about task launch overhead in the UI -- Key: SPARK-3984 URL: https://issues.apache.org/jira/browse/SPARK-3984 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Assignee: Kay Ousterhout Fix For: 1.2.0 Right now, the UI does not display the time to deserialize the task, to serialize the task result, or to launch a new thread for the task. When running short jobs (e.g., for ML) these overheads can become significant. It would be great to show these in the summary quantiles for each stage in the UI to facilitate better performance debugging. cc [~sparks] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3983) Scheduler delay (shown in the UI) is incorrect
[ https://issues.apache.org/jira/browse/SPARK-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174577#comment-14174577 ] Apache Spark commented on SPARK-3983: - User 'kayousterhout' has created a pull request for this issue: https://github.com/apache/spark/pull/2832 Scheduler delay (shown in the UI) is incorrect -- Key: SPARK-3983 URL: https://issues.apache.org/jira/browse/SPARK-3983 Project: Spark Issue Type: Bug Components: Web UI Reporter: Kay Ousterhout Assignee: Kay Ousterhout Fix For: 1.2.0 The reported scheduler delay includes time to get a new thread (from a threadpool) in order to start the task, time to deserialize the task, and time to serialize the result. None of these things are delay caused by the scheduler; including them as such is misleading. This is especially problematic when debugging performance of short tasks (that run in 10s of milliseconds), when the scheduler delay can be very large relative to the task duration. cc [~sparks] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3984) Display finer grained metrics about task launch overhead in the UI
[ https://issues.apache.org/jira/browse/SPARK-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174575#comment-14174575 ] Kay Ousterhout commented on SPARK-3984: --- https://github.com/apache/spark/pull/2832 Display finer grained metrics about task launch overhead in the UI -- Key: SPARK-3984 URL: https://issues.apache.org/jira/browse/SPARK-3984 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Assignee: Kay Ousterhout Fix For: 1.2.0 Right now, the UI does not display the time to deserialize the task, to serialize the task result, or to launch a new thread for the task. When running short jobs (e.g., for ML) these overheads can become significant. It would be great to show these in the summary quantiles for each stage in the UI to facilitate better performance debugging. cc [~sparks] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3982) receiverStream in Python API
[ https://issues.apache.org/jira/browse/SPARK-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174580#comment-14174580 ] Apache Spark commented on SPARK-3982: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2833 receiverStream in Python API Key: SPARK-3982 URL: https://issues.apache.org/jira/browse/SPARK-3982 Project: Spark Issue Type: New Feature Components: PySpark, Streaming Reporter: Davies Liu Assignee: Davies Liu receiverStream() is used to extend the input sources of streaming, it will be very useful to have it in Python API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3874) Provide stable TaskContext API
[ https://issues.apache.org/jira/browse/SPARK-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3874. Resolution: Fixed Fix Version/s: 1.2.0 Provide stable TaskContext API -- Key: SPARK-3874 URL: https://issues.apache.org/jira/browse/SPARK-3874 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Prashant Sharma Fix For: 1.2.0 We made some improvements in SPARK-3543 but for Spark 1.2 we should convert TaskContext into a fully stable API. To do this I’d suggest the following changes - note that some of this reverses parts of SPARK-3543. The goal is to provide a class that users can’t easily construct and exposes only the public functionality. 1. Separate TaskContext into a public abstract class (TaskContext) and a private implementation called TaskContextImpl. The former should be a Java abstract class - the latter should be a private[spark] Scala class to reduce visibility (or maybe we can keep it as Java and tell people not to use it?). 2. TaskContext abstract class will have (NOTE: this changes getXX() to XX() intentionally) public isCompleted() public isInterrupted() public addTaskCompletionListener(...) public addTaskCompletionCallback(...) (deprecated) public stageId() public partitionId() public attemptId() pubic isRunningLocally() STATIC public get() set() and unset() at default visibility 3. A new private[spark] static object TaskContextHelper in the same package as TaskContext will exist to expose set() and unset() from within Spark using forwarder methods that just call TaskContext.set(). If someone within Spark wants to set this they call TaskContextHelper.set() and it forwards it. 4. TaskContextImpl will be used whenever we construct a TaskContext internally. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3975) Block Matrix addition and multiplication
[ https://issues.apache.org/jira/browse/SPARK-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3975: --- Component/s: MLlib Block Matrix addition and multiplication Key: SPARK-3975 URL: https://issues.apache.org/jira/browse/SPARK-3975 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh Block matrix addition and multiplication, for the case when partitioning schemes match. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3974) Block matrix abstracitons and partitioners
[ https://issues.apache.org/jira/browse/SPARK-3974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Zadeh updated SPARK-3974: -- Component/s: MLlib Block matrix abstracitons and partitioners -- Key: SPARK-3974 URL: https://issues.apache.org/jira/browse/SPARK-3974 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh We need abstractions for block matrices with fixed block sizes, with each block being dense. Partitioners along both rows and columns required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3976) Detect block matrix partitioning schemes
[ https://issues.apache.org/jira/browse/SPARK-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Zadeh updated SPARK-3976: -- Component/s: MLlib Detect block matrix partitioning schemes Key: SPARK-3976 URL: https://issues.apache.org/jira/browse/SPARK-3976 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh Provide repartitioning methods for block matrices to repartition matrix for add/multiply of non-identically partitioned matrices -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3977) Conversions between {Row, Coordinate}Matrix - BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-3977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Zadeh updated SPARK-3977: -- Component/s: MLlib Conversions between {Row, Coordinate}Matrix - BlockMatrix --- Key: SPARK-3977 URL: https://issues.apache.org/jira/browse/SPARK-3977 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh Build conversion functions between {Row, Coordinate}Matrix - BlockMatrix -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job
[ https://issues.apache.org/jira/browse/SPARK-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174629#comment-14174629 ] Patrick Wendell commented on SPARK-3882: This is a known issue (SPARK-2316) that was fixed in Spark 1.1. To verify that you are hitting the same issue, would you mind testing your job with Spark 1.1 and seeing if you observe it? JobProgressListener gets permanently out of sync with long running job -- Key: SPARK-3882 URL: https://issues.apache.org/jira/browse/SPARK-3882 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.2 Reporter: Davis Shepherd Attachments: Screen Shot 2014-10-03 at 12.50.59 PM.png A long running spark context (non-streaming) will eventually start throwing the following in the driver: {code} java.util.NoSuchElementException: key not found: 12771 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener threw an exception java.util.NoSuchElementException: key not found: 12782 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at
[jira] [Updated] (SPARK-3973) Print callSite information for broadcast variables
[ https://issues.apache.org/jira/browse/SPARK-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3973: --- Component/s: Spark Core Print callSite information for broadcast variables -- Key: SPARK-3973 URL: https://issues.apache.org/jira/browse/SPARK-3973 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Minor Fix For: 1.2.0 Printing call site information for broadcast variables will help in debugging which variables are used, when they are used etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3923) All Standalone Mode services time out with each other
[ https://issues.apache.org/jira/browse/SPARK-3923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3923. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Aaron Davidson Target Version/s: 1.2.0 All Standalone Mode services time out with each other - Key: SPARK-3923 URL: https://issues.apache.org/jira/browse/SPARK-3923 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.2.0 Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Blocker Fix For: 1.2.0 I'm seeing an issue where it seems that components in Standalone Mode (Worker, Master, Driver, and Executor) all seem to time out with each other after around 1000 seconds. Here is an example log: {code} 14/10/13 06:43:55 INFO Master: Registering worker ip-10-0-147-189.us-west-2.compute.internal:38922 with 4 cores, 29.0 GB RAM 14/10/13 06:43:55 INFO Master: Registering worker ip-10-0-175-214.us-west-2.compute.internal:42918 with 4 cores, 59.0 GB RAM 14/10/13 06:43:56 INFO Master: Registering app Databricks Shell 14/10/13 06:43:56 INFO Master: Registered app Databricks Shell with ID app-20141013064356- ... precisely 1000 seconds later ... 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 14/10/13 07:00:35 INFO Master: akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922 got disassociated, removing it. 14/10/13 07:00:35 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.147.189%3A54956-1#1529980245] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/10/13 07:00:35 INFO Master: akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got disassociated, removing it. 14/10/13 07:00:35 INFO Master: Removing worker worker-20141013064354-ip-10-0-175-214.us-west-2.compute.internal-42918 on ip-10-0-175-214.us-west-2.compute.internal:42918 14/10/13 07:00:35 INFO Master: Telling app of lost executor: 1 14/10/13 07:00:35 INFO Master: akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got disassociated, removing it. 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 14/10/13 07:00:35 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324] was not delivered. [3] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/10/13 07:00:35 INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324] was not delivered. [4] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/10/13 07:00:36 INFO ProtocolStateActor: No response from remote. Handshake timed out or transport failure detector triggered. 14/10/13 07:00:36 INFO Master: akka.tcp://sparkdri...@ip-10-0-175-215.us-west-2.compute.internal:58259 got disassociated, removing it. 14/10/13 07:00:36 INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$InboundPayload] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#1944377249] was not delivered. [5] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/10/13 07:00:36 INFO Master: Removing app app-20141013064356- 14/10/13 07:00:36 WARN ReliableDeliverySupervisor: Association with remote system
[jira] [Closed] (SPARK-3941) _remainingMem should not increase twice when updateBlockInfo
[ https://issues.apache.org/jira/browse/SPARK-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3941. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Zhang, Liye Target Version/s: 1.2.0 _remainingMem should not increase twice when updateBlockInfo Key: SPARK-3941 URL: https://issues.apache.org/jira/browse/SPARK-3941 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Zhang, Liye Assignee: Zhang, Liye Fix For: 1.2.0 In BlockManagermasterActor, _remainingMem would increase memSize for twice when updateBlockInfo if new storageLevel is invalid. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3890) remove redundant spark.executor.memory in doc
[ https://issues.apache.org/jira/browse/SPARK-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3890: - Affects Version/s: (was: 1.2.0) 1.1.0 remove redundant spark.executor.memory in doc - Key: SPARK-3890 URL: https://issues.apache.org/jira/browse/SPARK-3890 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.1.0 Reporter: WangTaoTheTonic Priority: Minor Fix For: 1.1.1, 1.2.0 Seems like there is a redundant spark.executor.memory config item in docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3890) remove redundant spark.executor.memory in doc
[ https://issues.apache.org/jira/browse/SPARK-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3890. Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Assignee: WangTaoTheTonic Target Version/s: 1.1.1, 1.2.0 remove redundant spark.executor.memory in doc - Key: SPARK-3890 URL: https://issues.apache.org/jira/browse/SPARK-3890 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.1.0 Reporter: WangTaoTheTonic Assignee: WangTaoTheTonic Priority: Minor Fix For: 1.1.1, 1.2.0 Seems like there is a redundant spark.executor.memory config item in docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3890) remove redundant spark.executor.memory in doc
[ https://issues.apache.org/jira/browse/SPARK-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3890: - Affects Version/s: 1.2.0 remove redundant spark.executor.memory in doc - Key: SPARK-3890 URL: https://issues.apache.org/jira/browse/SPARK-3890 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.1.0 Reporter: WangTaoTheTonic Priority: Minor Fix For: 1.1.1, 1.2.0 Seems like there is a redundant spark.executor.memory config item in docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3963) Support getting task-scoped properties from TaskContext
[ https://issues.apache.org/jira/browse/SPARK-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174660#comment-14174660 ] Patrick Wendell commented on SPARK-3963: In the initial version of this - I don't want to do either of those things. Support getting task-scoped properties from TaskContext --- Key: SPARK-3963 URL: https://issues.apache.org/jira/browse/SPARK-3963 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell This is a proposal for a minor feature. Given stabilization of the TaskContext API, it would be nice to have a mechanism for Spark jobs to access properties that are defined based on task-level scope by Spark RDD's. I'd like to propose adding a simple properties hash map with some standard spark properties that users can access. Later it would be nice to support users setting these properties, but for now to keep it simple in 1.2. I'd prefer users not be able to set them. The main use case is providing the file name from Hadoop RDD's, a very common request. But I'd imagine us using this for other things later on. We could also use this to expose some of the taskMetrics, such as e.g. the input bytes. {code} val data = sc.textFile(s3n//..2014/*/*/*.json) data.mapPartitions { val tc = TaskContext.get val filename = tc.getProperty(TaskContext.HADOOP_FILE_NAME) val parts = fileName.split(/) val (year, month, day) = (parts[3], parts[4], parts[5]) ... } {code} Internally we'd have a method called setProperty, but this wouldn't be exposed initially. This is structured as a simple (String, String) hash map for ease of porting to python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org