git commit: SPARK-1775: Unneeded lock in ShuffleMapTask.deserializeInfo
Repository: spark Updated Branches: refs/heads/branch-1.0 f6323eb3b - 5c8e8de99 SPARK-1775: Unneeded lock in ShuffleMapTask.deserializeInfo This was used in the past to have a cache of deserialized ShuffleMapTasks, but that's been removed, so there's no need for a lock. It slows down Spark when task descriptions are large, e.g. due to large lineage graphs or local variables. Author: Sandeep sand...@techaddict.me Closes #707 from techaddict/SPARK-1775 and squashes the following commits: 18d8ebf [Sandeep] SPARK-1775: Unneeded lock in ShuffleMapTask.deserializeInfo This was used in the past to have a cache of deserialized ShuffleMapTasks, but that's been removed, so there's no need for a lock. It slows down Spark when task descriptions are large, e.g. due to large lineage graphs or local variables. (cherry picked from commit 7db47c463fefc244e9c100d4aab90451c3828261) Signed-off-by: Patrick Wendell pwend...@gmail.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5c8e8de9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5c8e8de9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5c8e8de9 Branch: refs/heads/branch-1.0 Commit: 5c8e8de99ffa5aadc1a130c9a3cbeb3c4936eb71 Parents: f6323eb Author: Sandeep sand...@techaddict.me Authored: Thu May 8 22:30:17 2014 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Thu May 8 22:30:58 2014 -0700 -- .../org/apache/spark/scheduler/ShuffleMapTask.scala | 16 +++- 1 file changed, 7 insertions(+), 9 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5c8e8de9/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala -- diff --git a/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala b/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala index 4b0324f..9ba586f 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala @@ -57,15 +57,13 @@ private[spark] object ShuffleMapTask { } def deserializeInfo(stageId: Int, bytes: Array[Byte]): (RDD[_], ShuffleDependency[_,_]) = { -synchronized { - val loader = Thread.currentThread.getContextClassLoader - val in = new GZIPInputStream(new ByteArrayInputStream(bytes)) - val ser = SparkEnv.get.closureSerializer.newInstance() - val objIn = ser.deserializeStream(in) - val rdd = objIn.readObject().asInstanceOf[RDD[_]] - val dep = objIn.readObject().asInstanceOf[ShuffleDependency[_,_]] - (rdd, dep) -} +val loader = Thread.currentThread.getContextClassLoader +val in = new GZIPInputStream(new ByteArrayInputStream(bytes)) +val ser = SparkEnv.get.closureSerializer.newInstance() +val objIn = ser.deserializeStream(in) +val rdd = objIn.readObject().asInstanceOf[RDD[_]] +val dep = objIn.readObject().asInstanceOf[ShuffleDependency[_,_]] +(rdd, dep) } // Since both the JarSet and FileSet have the same format this is used for both.
git commit: Revert [SPARK-1784] Add a new partitioner to allow specifying # of keys per partition
Repository: spark Updated Branches: refs/heads/master c33b8dcbf - 7bb9a521f Revert [SPARK-1784] Add a new partitioner to allow specifying # of keys per partition This reverts commit 92cebada09a7e5a00ab48bcb350a9462949c33eb. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7bb9a521 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7bb9a521 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7bb9a521 Branch: refs/heads/master Commit: 7bb9a521f35eb19576c6cc2da3fd385910270e46 Parents: c33b8dc Author: Patrick Wendell pwend...@gmail.com Authored: Tue May 13 23:24:51 2014 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Tue May 13 23:24:51 2014 -0700 -- .../scala/org/apache/spark/Partitioner.scala| 61 .../org/apache/spark/PartitioningSuite.scala| 34 --- 2 files changed, 95 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7bb9a521/core/src/main/scala/org/apache/spark/Partitioner.scala -- diff --git a/core/src/main/scala/org/apache/spark/Partitioner.scala b/core/src/main/scala/org/apache/spark/Partitioner.scala index 6274796..9155159 100644 --- a/core/src/main/scala/org/apache/spark/Partitioner.scala +++ b/core/src/main/scala/org/apache/spark/Partitioner.scala @@ -156,64 +156,3 @@ class RangePartitioner[K : Ordering : ClassTag, V]( false } } - -/** - * A [[org.apache.spark.Partitioner]] that partitions records into specified bounds - * Default value is 1000. Once all partitions have bounds elements, the partitioner - * allocates 1 element per partition so eventually the smaller partitions are at most - * off by 1 key compared to the larger partitions. - */ -class BoundaryPartitioner[K : Ordering : ClassTag, V]( -partitions: Int, -@transient rdd: RDD[_ : Product2[K,V]], -private val boundary: Int = 1000) - extends Partitioner { - - // this array keeps track of keys assigned to a partition - // counts[0] refers to # of keys in partition 0 and so on - private val counts: Array[Int] = { -new Array[Int](numPartitions) - } - - def numPartitions = math.abs(partitions) - - /* - * Ideally, this should've been calculated based on # partitions and total keys - * But we are not calling count on RDD here to avoid calling an action. - * User has the flexibility of calling count and passing in any appropriate boundary - */ - def keysPerPartition = boundary - - var currPartition = 0 - - /* - * Pick current partition for the key until we hit the bound for keys / partition, - * start allocating to next partition at that time. - * - * NOTE: In case where we have lets say 2000 keys and user says 3 partitions with 500 - * passed in as boundary, the first 500 will goto P1, 501-1000 go to P2, 1001-1500 go to P3, - * after that, next keys go to one partition at a time. So 1501 goes to P1, 1502 goes to P2, - * 1503 goes to P3 and so on. - */ - def getPartition(key: Any): Int = { -val partition = currPartition -counts(partition) = counts(partition) + 1 -/* -* Since we are filling up a partition before moving to next one (this helps in maintaining -* order of keys, in certain cases, it is possible to end up with empty partitions, like -* 3 partitions, 500 keys / partition and if rdd has 700 keys, 1 partition will be entirely -* empty. - */ -if(counts(currPartition) = keysPerPartition) - currPartition = (currPartition + 1) % numPartitions -partition - } - - override def equals(other: Any): Boolean = other match { -case r: BoundaryPartitioner[_,_] = - (r.counts.sameElements(counts) r.boundary == boundary - r.currPartition == currPartition) -case _ = - false - } -} http://git-wip-us.apache.org/repos/asf/spark/blob/7bb9a521/core/src/test/scala/org/apache/spark/PartitioningSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/PartitioningSuite.scala b/core/src/test/scala/org/apache/spark/PartitioningSuite.scala index 7d40395..7c30626 100644 --- a/core/src/test/scala/org/apache/spark/PartitioningSuite.scala +++ b/core/src/test/scala/org/apache/spark/PartitioningSuite.scala @@ -66,40 +66,6 @@ class PartitioningSuite extends FunSuite with SharedSparkContext with PrivateMet assert(descendingP4 != p4) } - test(BoundaryPartitioner equality) { -// Make an RDD where all the elements are the same so that the partition range bounds -// are deterministically all the same. -val rdd =
git commit: [SQL] Improve column pruning.
Repository: spark Updated Branches: refs/heads/master 7bb9a521f - 6ce088444 [SQL] Improve column pruning. Fixed a bug that was preventing us from ever pruning beneath Joins. ## TPC-DS Q3 ### Before: ``` Aggregate false, [d_year#12,i_brand#65,i_brand_id#64], [d_year#12,i_brand_id#64 AS brand_id#0,i_brand#65 AS brand#1,SUM(PartialSum#79) AS sum_agg#2] Exchange (HashPartitioning [d_year#12:0,i_brand#65:1,i_brand_id#64:2], 150) Aggregate true, [d_year#12,i_brand#65,i_brand_id#64], [d_year#12,i_brand#65,i_brand_id#64,SUM(CAST(ss_ext_sales_price#49, DoubleType)) AS PartialSum#79] Project [d_year#12:6,i_brand#65:59,i_brand_id#64:58,ss_ext_sales_price#49:43] HashJoin [ss_item_sk#36], [i_item_sk#57], BuildRight Exchange (HashPartitioning [ss_item_sk#36:30], 150) HashJoin [d_date_sk#6], [ss_sold_date_sk#34], BuildRight Exchange (HashPartitioning [d_date_sk#6:0], 150) Filter (d_moy#14:8 = 12) HiveTableScan [d_date_sk#6,d_date_id#7,d_date#8,d_month_seq#9,d_week_seq#10,d_quarter_seq#11,d_year#12,d_dow#13,d_moy#14,d_dom#15,d_qoy#16,d_fy_year#17,d_fy_quarter_seq#18,d_fy_week_seq#19,d_day_name#20,d_quarter_name#21,d_holiday#22,d_weekend#23,d_following_holiday#24,d_first_dom#25,d_last_dom#26,d_same_day_ly#27,d_same_day_lq#28,d_current_day#29,d_current_week#30,d_current_month#31,d_current_quarter#32,d_current_year#33], (MetastoreRelation default, date_dim, Some(dt)), None Exchange (HashPartitioning [ss_sold_date_sk#34:0], 150) HiveTableScan [ss_sold_date_sk#34,ss_sold_time_sk#35,ss_item_sk#36,ss_customer_sk#37,ss_cdemo_sk#38,ss_hdemo_sk#39,ss_addr_sk#40,ss_store_sk#41,ss_promo_sk#42,ss_ticket_number#43,ss_quantity#44,ss_wholesale_cost#45,ss_list_price#46,ss_sales_price#47,ss_ext_discount_amt#48,ss_ext_sales_price#49,ss_ext_wholesale_cost#50,ss_ext_list_price#51,ss_ext_tax#52,ss_coupon_amt#53,ss_net_paid#54,ss_net_paid_inc_tax#55,ss_net_profit#56], (MetastoreRelation default, store_sales, None), None Exchange (HashPartitioning [i_item_sk#57:0], 150) Filter (i_manufact_id#70:13 = 436) HiveTableScan [i_item_sk#57,i_item_id#58,i_rec_start_date#59,i_rec_end_date#60,i_item_desc#61,i_current_price#62,i_wholesale_cost#63,i_brand_id#64,i_brand#65,i_class_id#66,i_class#67,i_category_id#68,i_category#69,i_manufact_id#70,i_manufact#71,i_size#72,i_formulation#73,i_color#74,i_units#75,i_container#76,i_manager_id#77,i_product_name#78], (MetastoreRelation default, item, None), None ``` ### After ``` Aggregate false, [d_year#172,i_brand#225,i_brand_id#224], [d_year#172,i_brand_id#224 AS brand_id#160,i_brand#225 AS brand#161,SUM(PartialSum#239) AS sum_agg#162] Exchange (HashPartitioning [d_year#172:0,i_brand#225:1,i_brand_id#224:2], 150) Aggregate true, [d_year#172,i_brand#225,i_brand_id#224], [d_year#172,i_brand#225,i_brand_id#224,SUM(CAST(ss_ext_sales_price#209, DoubleType)) AS PartialSum#239] Project [d_year#172:1,i_brand#225:5,i_brand_id#224:3,ss_ext_sales_price#209:0] HashJoin [ss_item_sk#196], [i_item_sk#217], BuildRight Exchange (HashPartitioning [ss_item_sk#196:2], 150) Project [ss_ext_sales_price#209:2,d_year#172:1,ss_item_sk#196:3] HashJoin [d_date_sk#166], [ss_sold_date_sk#194], BuildRight Exchange (HashPartitioning [d_date_sk#166:0], 150) Project [d_date_sk#166:0,d_year#172:1] Filter (d_moy#174:2 = 12) HiveTableScan [d_date_sk#166,d_year#172,d_moy#174], (MetastoreRelation default, date_dim, Some(dt)), None Exchange (HashPartitioning [ss_sold_date_sk#194:2], 150) HiveTableScan [ss_ext_sales_price#209,ss_item_sk#196,ss_sold_date_sk#194], (MetastoreRelation default, store_sales, None), None Exchange (HashPartitioning [i_item_sk#217:1], 150) Project [i_brand_id#224:0,i_item_sk#217:1,i_brand#225:2] Filter (i_manufact_id#230:3 = 436) HiveTableScan [i_brand_id#224,i_item_sk#217,i_brand#225,i_manufact_id#230], (MetastoreRelation default, item, None), None ``` Author: Michael Armbrust mich...@databricks.com Closes #729 from marmbrus/fixPruning and squashes the following commits: 5feeff0 [Michael Armbrust] Improve column pruning. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6ce08844 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6ce08844 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6ce08844 Branch: refs/heads/master Commit: 6ce0884446d3571fd6e9d967a080a59c657543b1 Parents: 7bb9a52 Author: Michael Armbrust mich...@databricks.com Authored: Tue May 13 23:27:22 2014 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Tue May 13 23:27:22 2014 -0700 -- .../spark/sql/catalyst/optimizer/Optimizer.scala| 16 +++- 1 file changed, 11 insertions(+), 5 deletions(-) --
git commit: Revert [SPARK-1784] Add a new partitioner to allow specifying # of keys per partition
Repository: spark Updated Branches: refs/heads/branch-1.0 92b0ec9ac - 721194bda Revert [SPARK-1784] Add a new partitioner to allow specifying # of keys per partition This reverts commit 66fe4797a845bb1a2728dcdb2d7371f0e90da867. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/721194bd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/721194bd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/721194bd Branch: refs/heads/branch-1.0 Commit: 721194bdaa54429e76bb8b527154cdfd9c9d0e37 Parents: 92b0ec9 Author: Patrick Wendell pwend...@gmail.com Authored: Tue May 13 23:25:19 2014 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Tue May 13 23:25:19 2014 -0700 -- .../scala/org/apache/spark/Partitioner.scala| 61 .../org/apache/spark/PartitioningSuite.scala| 34 --- 2 files changed, 95 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/721194bd/core/src/main/scala/org/apache/spark/Partitioner.scala -- diff --git a/core/src/main/scala/org/apache/spark/Partitioner.scala b/core/src/main/scala/org/apache/spark/Partitioner.scala index 6274796..9155159 100644 --- a/core/src/main/scala/org/apache/spark/Partitioner.scala +++ b/core/src/main/scala/org/apache/spark/Partitioner.scala @@ -156,64 +156,3 @@ class RangePartitioner[K : Ordering : ClassTag, V]( false } } - -/** - * A [[org.apache.spark.Partitioner]] that partitions records into specified bounds - * Default value is 1000. Once all partitions have bounds elements, the partitioner - * allocates 1 element per partition so eventually the smaller partitions are at most - * off by 1 key compared to the larger partitions. - */ -class BoundaryPartitioner[K : Ordering : ClassTag, V]( -partitions: Int, -@transient rdd: RDD[_ : Product2[K,V]], -private val boundary: Int = 1000) - extends Partitioner { - - // this array keeps track of keys assigned to a partition - // counts[0] refers to # of keys in partition 0 and so on - private val counts: Array[Int] = { -new Array[Int](numPartitions) - } - - def numPartitions = math.abs(partitions) - - /* - * Ideally, this should've been calculated based on # partitions and total keys - * But we are not calling count on RDD here to avoid calling an action. - * User has the flexibility of calling count and passing in any appropriate boundary - */ - def keysPerPartition = boundary - - var currPartition = 0 - - /* - * Pick current partition for the key until we hit the bound for keys / partition, - * start allocating to next partition at that time. - * - * NOTE: In case where we have lets say 2000 keys and user says 3 partitions with 500 - * passed in as boundary, the first 500 will goto P1, 501-1000 go to P2, 1001-1500 go to P3, - * after that, next keys go to one partition at a time. So 1501 goes to P1, 1502 goes to P2, - * 1503 goes to P3 and so on. - */ - def getPartition(key: Any): Int = { -val partition = currPartition -counts(partition) = counts(partition) + 1 -/* -* Since we are filling up a partition before moving to next one (this helps in maintaining -* order of keys, in certain cases, it is possible to end up with empty partitions, like -* 3 partitions, 500 keys / partition and if rdd has 700 keys, 1 partition will be entirely -* empty. - */ -if(counts(currPartition) = keysPerPartition) - currPartition = (currPartition + 1) % numPartitions -partition - } - - override def equals(other: Any): Boolean = other match { -case r: BoundaryPartitioner[_,_] = - (r.counts.sameElements(counts) r.boundary == boundary - r.currPartition == currPartition) -case _ = - false - } -} http://git-wip-us.apache.org/repos/asf/spark/blob/721194bd/core/src/test/scala/org/apache/spark/PartitioningSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/PartitioningSuite.scala b/core/src/test/scala/org/apache/spark/PartitioningSuite.scala index 7d40395..7c30626 100644 --- a/core/src/test/scala/org/apache/spark/PartitioningSuite.scala +++ b/core/src/test/scala/org/apache/spark/PartitioningSuite.scala @@ -66,40 +66,6 @@ class PartitioningSuite extends FunSuite with SharedSparkContext with PrivateMet assert(descendingP4 != p4) } - test(BoundaryPartitioner equality) { -// Make an RDD where all the elements are the same so that the partition range bounds -// are deterministically all the same. -val rdd =
git commit: [SQL] Improve column pruning.
Repository: spark Updated Branches: refs/heads/branch-1.0 721194bda - f66f76648 [SQL] Improve column pruning. Fixed a bug that was preventing us from ever pruning beneath Joins. ## TPC-DS Q3 ### Before: ``` Aggregate false, [d_year#12,i_brand#65,i_brand_id#64], [d_year#12,i_brand_id#64 AS brand_id#0,i_brand#65 AS brand#1,SUM(PartialSum#79) AS sum_agg#2] Exchange (HashPartitioning [d_year#12:0,i_brand#65:1,i_brand_id#64:2], 150) Aggregate true, [d_year#12,i_brand#65,i_brand_id#64], [d_year#12,i_brand#65,i_brand_id#64,SUM(CAST(ss_ext_sales_price#49, DoubleType)) AS PartialSum#79] Project [d_year#12:6,i_brand#65:59,i_brand_id#64:58,ss_ext_sales_price#49:43] HashJoin [ss_item_sk#36], [i_item_sk#57], BuildRight Exchange (HashPartitioning [ss_item_sk#36:30], 150) HashJoin [d_date_sk#6], [ss_sold_date_sk#34], BuildRight Exchange (HashPartitioning [d_date_sk#6:0], 150) Filter (d_moy#14:8 = 12) HiveTableScan [d_date_sk#6,d_date_id#7,d_date#8,d_month_seq#9,d_week_seq#10,d_quarter_seq#11,d_year#12,d_dow#13,d_moy#14,d_dom#15,d_qoy#16,d_fy_year#17,d_fy_quarter_seq#18,d_fy_week_seq#19,d_day_name#20,d_quarter_name#21,d_holiday#22,d_weekend#23,d_following_holiday#24,d_first_dom#25,d_last_dom#26,d_same_day_ly#27,d_same_day_lq#28,d_current_day#29,d_current_week#30,d_current_month#31,d_current_quarter#32,d_current_year#33], (MetastoreRelation default, date_dim, Some(dt)), None Exchange (HashPartitioning [ss_sold_date_sk#34:0], 150) HiveTableScan [ss_sold_date_sk#34,ss_sold_time_sk#35,ss_item_sk#36,ss_customer_sk#37,ss_cdemo_sk#38,ss_hdemo_sk#39,ss_addr_sk#40,ss_store_sk#41,ss_promo_sk#42,ss_ticket_number#43,ss_quantity#44,ss_wholesale_cost#45,ss_list_price#46,ss_sales_price#47,ss_ext_discount_amt#48,ss_ext_sales_price#49,ss_ext_wholesale_cost#50,ss_ext_list_price#51,ss_ext_tax#52,ss_coupon_amt#53,ss_net_paid#54,ss_net_paid_inc_tax#55,ss_net_profit#56], (MetastoreRelation default, store_sales, None), None Exchange (HashPartitioning [i_item_sk#57:0], 150) Filter (i_manufact_id#70:13 = 436) HiveTableScan [i_item_sk#57,i_item_id#58,i_rec_start_date#59,i_rec_end_date#60,i_item_desc#61,i_current_price#62,i_wholesale_cost#63,i_brand_id#64,i_brand#65,i_class_id#66,i_class#67,i_category_id#68,i_category#69,i_manufact_id#70,i_manufact#71,i_size#72,i_formulation#73,i_color#74,i_units#75,i_container#76,i_manager_id#77,i_product_name#78], (MetastoreRelation default, item, None), None ``` ### After ``` Aggregate false, [d_year#172,i_brand#225,i_brand_id#224], [d_year#172,i_brand_id#224 AS brand_id#160,i_brand#225 AS brand#161,SUM(PartialSum#239) AS sum_agg#162] Exchange (HashPartitioning [d_year#172:0,i_brand#225:1,i_brand_id#224:2], 150) Aggregate true, [d_year#172,i_brand#225,i_brand_id#224], [d_year#172,i_brand#225,i_brand_id#224,SUM(CAST(ss_ext_sales_price#209, DoubleType)) AS PartialSum#239] Project [d_year#172:1,i_brand#225:5,i_brand_id#224:3,ss_ext_sales_price#209:0] HashJoin [ss_item_sk#196], [i_item_sk#217], BuildRight Exchange (HashPartitioning [ss_item_sk#196:2], 150) Project [ss_ext_sales_price#209:2,d_year#172:1,ss_item_sk#196:3] HashJoin [d_date_sk#166], [ss_sold_date_sk#194], BuildRight Exchange (HashPartitioning [d_date_sk#166:0], 150) Project [d_date_sk#166:0,d_year#172:1] Filter (d_moy#174:2 = 12) HiveTableScan [d_date_sk#166,d_year#172,d_moy#174], (MetastoreRelation default, date_dim, Some(dt)), None Exchange (HashPartitioning [ss_sold_date_sk#194:2], 150) HiveTableScan [ss_ext_sales_price#209,ss_item_sk#196,ss_sold_date_sk#194], (MetastoreRelation default, store_sales, None), None Exchange (HashPartitioning [i_item_sk#217:1], 150) Project [i_brand_id#224:0,i_item_sk#217:1,i_brand#225:2] Filter (i_manufact_id#230:3 = 436) HiveTableScan [i_brand_id#224,i_item_sk#217,i_brand#225,i_manufact_id#230], (MetastoreRelation default, item, None), None ``` Author: Michael Armbrust mich...@databricks.com Closes #729 from marmbrus/fixPruning and squashes the following commits: 5feeff0 [Michael Armbrust] Improve column pruning. (cherry picked from commit 6ce0884446d3571fd6e9d967a080a59c657543b1) Signed-off-by: Patrick Wendell pwend...@gmail.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f66f7664 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f66f7664 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f66f7664 Branch: refs/heads/branch-1.0 Commit: f66f76648d32f2ca274b623db395df8a9c6e7d64 Parents: 721194b Author: Michael Armbrust mich...@databricks.com Authored: Tue May 13 23:27:22 2014 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Tue May 13 23:27:29 2014 -0700 -- .../spark/sql/catalyst/optimizer/Optimizer.scala| 16 +++- 1 file
git commit: SPARK-1801. expose InterruptibleIterator and TaskKilledException in deve...
Repository: spark Updated Branches: refs/heads/master 6ce088444 - b22952fa1 SPARK-1801. expose InterruptibleIterator and TaskKilledException in deve... ...loper api Author: Koert Kuipers ko...@tresata.com Closes #764 from koertkuipers/feat-rdd-developerapi and squashes the following commits: 8516dd2 [Koert Kuipers] SPARK-1801. expose InterruptibleIterator and TaskKilledException in developer api Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b22952fa Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b22952fa Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b22952fa Branch: refs/heads/master Commit: b22952fa1f21c0b93208846b5e1941a9d2578c6f Parents: 6ce0884 Author: Koert Kuipers ko...@tresata.com Authored: Wed May 14 00:10:12 2014 -0700 Committer: Aaron Davidson aa...@databricks.com Committed: Wed May 14 00:12:35 2014 -0700 -- .../main/scala/org/apache/spark/InterruptibleIterator.scala | 6 +- .../main/scala/org/apache/spark/TaskKilledException.scala| 8 ++-- 2 files changed, 11 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b22952fa/core/src/main/scala/org/apache/spark/InterruptibleIterator.scala -- diff --git a/core/src/main/scala/org/apache/spark/InterruptibleIterator.scala b/core/src/main/scala/org/apache/spark/InterruptibleIterator.scala index ec11dbb..f40baa8 100644 --- a/core/src/main/scala/org/apache/spark/InterruptibleIterator.scala +++ b/core/src/main/scala/org/apache/spark/InterruptibleIterator.scala @@ -17,11 +17,15 @@ package org.apache.spark +import org.apache.spark.annotation.DeveloperApi + /** + * :: DeveloperApi :: * An iterator that wraps around an existing iterator to provide task killing functionality. * It works by checking the interrupted flag in [[TaskContext]]. */ -private[spark] class InterruptibleIterator[+T](val context: TaskContext, val delegate: Iterator[T]) +@DeveloperApi +class InterruptibleIterator[+T](val context: TaskContext, val delegate: Iterator[T]) extends Iterator[T] { def hasNext: Boolean = { http://git-wip-us.apache.org/repos/asf/spark/blob/b22952fa/core/src/main/scala/org/apache/spark/TaskKilledException.scala -- diff --git a/core/src/main/scala/org/apache/spark/TaskKilledException.scala b/core/src/main/scala/org/apache/spark/TaskKilledException.scala index cbd6b28..ad487c4 100644 --- a/core/src/main/scala/org/apache/spark/TaskKilledException.scala +++ b/core/src/main/scala/org/apache/spark/TaskKilledException.scala @@ -17,7 +17,11 @@ package org.apache.spark +import org.apache.spark.annotation.DeveloperApi + /** - * Exception for a task getting killed. + * :: DeveloperApi :: + * Exception thrown when a task is explicitly killed (i.e., task failure is expected). */ -private[spark] class TaskKilledException extends RuntimeException +@DeveloperApi +class TaskKilledException extends RuntimeException
git commit: SPARK-1801. expose InterruptibleIterator and TaskKilledException in deve...
Repository: spark Updated Branches: refs/heads/branch-1.0 f66f76648 - 7da80a318 SPARK-1801. expose InterruptibleIterator and TaskKilledException in deve... ...loper api Author: Koert Kuipers ko...@tresata.com Closes #764 from koertkuipers/feat-rdd-developerapi and squashes the following commits: 8516dd2 [Koert Kuipers] SPARK-1801. expose InterruptibleIterator and TaskKilledException in developer api (cherry picked from commit b22952fa1f21c0b93208846b5e1941a9d2578c6f) Signed-off-by: Aaron Davidson aa...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7da80a31 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7da80a31 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7da80a31 Branch: refs/heads/branch-1.0 Commit: 7da80a3186e9120c26ed88dc1211356a1d5eb8af Parents: f66f766 Author: Koert Kuipers ko...@tresata.com Authored: Wed May 14 00:10:12 2014 -0700 Committer: Aaron Davidson aa...@databricks.com Committed: Wed May 14 00:12:59 2014 -0700 -- .../main/scala/org/apache/spark/InterruptibleIterator.scala | 6 +- .../main/scala/org/apache/spark/TaskKilledException.scala| 8 ++-- 2 files changed, 11 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7da80a31/core/src/main/scala/org/apache/spark/InterruptibleIterator.scala -- diff --git a/core/src/main/scala/org/apache/spark/InterruptibleIterator.scala b/core/src/main/scala/org/apache/spark/InterruptibleIterator.scala index ec11dbb..f40baa8 100644 --- a/core/src/main/scala/org/apache/spark/InterruptibleIterator.scala +++ b/core/src/main/scala/org/apache/spark/InterruptibleIterator.scala @@ -17,11 +17,15 @@ package org.apache.spark +import org.apache.spark.annotation.DeveloperApi + /** + * :: DeveloperApi :: * An iterator that wraps around an existing iterator to provide task killing functionality. * It works by checking the interrupted flag in [[TaskContext]]. */ -private[spark] class InterruptibleIterator[+T](val context: TaskContext, val delegate: Iterator[T]) +@DeveloperApi +class InterruptibleIterator[+T](val context: TaskContext, val delegate: Iterator[T]) extends Iterator[T] { def hasNext: Boolean = { http://git-wip-us.apache.org/repos/asf/spark/blob/7da80a31/core/src/main/scala/org/apache/spark/TaskKilledException.scala -- diff --git a/core/src/main/scala/org/apache/spark/TaskKilledException.scala b/core/src/main/scala/org/apache/spark/TaskKilledException.scala index cbd6b28..ad487c4 100644 --- a/core/src/main/scala/org/apache/spark/TaskKilledException.scala +++ b/core/src/main/scala/org/apache/spark/TaskKilledException.scala @@ -17,7 +17,11 @@ package org.apache.spark +import org.apache.spark.annotation.DeveloperApi + /** - * Exception for a task getting killed. + * :: DeveloperApi :: + * Exception thrown when a task is explicitly killed (i.e., task failure is expected). */ -private[spark] class TaskKilledException extends RuntimeException +@DeveloperApi +class TaskKilledException extends RuntimeException
git commit: Fix dep exclusion: avro-ipc, not avro, depends on netty.
Repository: spark Updated Branches: refs/heads/master b22952fa1 - 54ae8328b Fix dep exclusion: avro-ipc, not avro, depends on netty. Author: Marcelo Vanzin van...@cloudera.com Closes #763 from vanzin/netty-dep-hell and squashes the following commits: dfb6ce2 [Marcelo Vanzin] Fix dep exclusion: avro-ipc, not avro, depends on netty. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/54ae8328 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/54ae8328 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/54ae8328 Branch: refs/heads/master Commit: 54ae8328bd7d052ba347768cfb02cb5dfdd8045e Parents: b22952f Author: Marcelo Vanzin van...@cloudera.com Authored: Wed May 14 00:37:57 2014 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Wed May 14 00:37:57 2014 -0700 -- pom.xml | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/54ae8328/pom.xml -- diff --git a/pom.xml b/pom.xml index 4d4c5f6..786b6d4 100644 --- a/pom.xml +++ b/pom.xml @@ -496,12 +496,6 @@ groupIdorg.apache.avro/groupId artifactIdavro/artifactId version${avro.version}/version -exclusions - exclusion -groupIdio.netty/groupId -artifactIdnetty/artifactId - /exclusion -/exclusions /dependency dependency groupIdorg.apache.avro/groupId @@ -509,6 +503,10 @@ version${avro.version}/version exclusions exclusion +groupIdio.netty/groupId +artifactIdnetty/artifactId + /exclusion + exclusion groupIdorg.mortbay.jetty/groupId artifactIdjetty/artifactId /exclusion
git commit: [SPARK-1769] Executor loss causes NPE race condition
Repository: spark Updated Branches: refs/heads/branch-1.0 b3d987893 - 69ec3149f [SPARK-1769] Executor loss causes NPE race condition This PR replaces the Schedulable data structures in Pool.scala with thread-safe ones from java. Note that Scala's `with SynchronizedBuffer` trait is soon to be deprecated in 2.11 because it is [inherently unreliable](http://www.scala-lang.org/api/2.11.0/index.html#scala.collection.mutable.SynchronizedBuffer). We should slowly drift away from `SynchronizedBuffer` in other places too. Note that this PR introduces an API-breaking change; `sc.getAllPools` now returns an Array rather than an ArrayBuffer. This is because we want this method to return an immutable copy rather than one may potentially confuse the user if they try to modify the copy, which takes no effect on the original data structure. Author: Andrew Or andrewo...@gmail.com Closes #762 from andrewor14/pool-npe and squashes the following commits: 383e739 [Andrew Or] JavaConverters - JavaConversions 3f32981 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pool-npe 769be19 [Andrew Or] Assorted minor changes 2189247 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pool-npe 05ad9e9 [Andrew Or] Fix test - contains is not the same as containsKey 0921ea0 [Andrew Or] var - val 07d720c [Andrew Or] Synchronize Schedulable data structures (cherry picked from commit 69f750228f3ec8537a93da08e712596fa8004143) Signed-off-by: Aaron Davidson aa...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/69ec3149 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/69ec3149 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/69ec3149 Branch: refs/heads/branch-1.0 Commit: 69ec3149fb4d732935748b9afee4f9d8a7b1244e Parents: b3d9878 Author: Andrew Or andrewo...@gmail.com Authored: Wed May 14 00:54:33 2014 -0700 Committer: Aaron Davidson aa...@databricks.com Committed: Wed May 14 00:54:49 2014 -0700 -- .../scala/org/apache/spark/SparkContext.scala | 20 - .../scala/org/apache/spark/scheduler/Pool.scala | 31 ++-- .../apache/spark/scheduler/Schedulable.scala| 6 ++-- .../spark/scheduler/TaskSchedulerImpl.scala | 2 +- .../scheduler/TaskSchedulerImplSuite.scala | 2 +- 5 files changed, 35 insertions(+), 26 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/69ec3149/core/src/main/scala/org/apache/spark/SparkContext.scala -- diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala b/core/src/main/scala/org/apache/spark/SparkContext.scala index c43b4fd..032b3d7 100644 --- a/core/src/main/scala/org/apache/spark/SparkContext.scala +++ b/core/src/main/scala/org/apache/spark/SparkContext.scala @@ -17,15 +17,17 @@ package org.apache.spark +import scala.language.implicitConversions + import java.io._ import java.net.URI import java.util.concurrent.atomic.AtomicInteger import java.util.{Properties, UUID} import java.util.UUID.randomUUID import scala.collection.{Map, Set} +import scala.collection.JavaConversions._ import scala.collection.generic.Growable -import scala.collection.mutable.{ArrayBuffer, HashMap} -import scala.language.implicitConversions +import scala.collection.mutable.HashMap import scala.reflect.{ClassTag, classTag} import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.Path @@ -836,18 +838,22 @@ class SparkContext(config: SparkConf) extends Logging { } /** - * Return pools for fair scheduler - * TODO(xiajunluan): We should take nested pools into account + * :: DeveloperApi :: + * Return pools for fair scheduler */ - def getAllPools: ArrayBuffer[Schedulable] = { -taskScheduler.rootPool.schedulableQueue + @DeveloperApi + def getAllPools: Seq[Schedulable] = { +// TODO(xiajunluan): We should take nested pools into account +taskScheduler.rootPool.schedulableQueue.toSeq } /** + * :: DeveloperApi :: * Return the pool associated with the given name, if one exists */ + @DeveloperApi def getPoolForName(pool: String): Option[Schedulable] = { -taskScheduler.rootPool.schedulableNameToSchedulable.get(pool) +Option(taskScheduler.rootPool.schedulableNameToSchedulable.get(pool)) } /** http://git-wip-us.apache.org/repos/asf/spark/blob/69ec3149/core/src/main/scala/org/apache/spark/scheduler/Pool.scala -- diff --git a/core/src/main/scala/org/apache/spark/scheduler/Pool.scala b/core/src/main/scala/org/apache/spark/scheduler/Pool.scala index 187672c..174b732 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/Pool.scala +++
git commit: Fixed streaming examples docs to use run-example instead of spark-submit
Repository: spark Updated Branches: refs/heads/branch-1.0 69ec3149f - c7571d8c6 Fixed streaming examples docs to use run-example instead of spark-submit Pretty self-explanatory Author: Tathagata Das tathagata.das1...@gmail.com Closes #722 from tdas/example-fix and squashes the following commits: 7839979 [Tathagata Das] Minor changes. 0673441 [Tathagata Das] Fixed java docs of java streaming example e687123 [Tathagata Das] Fixed scala style errors. 9b8d112 [Tathagata Das] Fixed streaming examples docs to use run-example instead of spark-submit. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c7571d8c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c7571d8c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c7571d8c Branch: refs/heads/branch-1.0 Commit: c7571d8c6ba058b67cca2b910fd0efacc06642cd Parents: 69ec314 Author: Tathagata Das tathagata.das1...@gmail.com Authored: Wed May 14 04:17:32 2014 -0700 Committer: Tathagata Das tathagata.das1...@gmail.com Committed: Wed May 14 04:24:48 2014 -0700 -- .../examples/streaming/JavaCustomReceiver.java | 13 ++--- .../examples/streaming/JavaFlumeEventCount.java | 6 +- .../examples/streaming/JavaKafkaWordCount.java | 6 +- .../streaming/JavaNetworkWordCount.java | 13 +++-- .../examples/streaming/ActorWordCount.scala | 6 +- .../examples/streaming/CustomReceiver.scala | 19 +++ .../examples/streaming/FlumeEventCount.scala| 9 ++- .../examples/streaming/HdfsWordCount.scala | 5 +- .../examples/streaming/KafkaWordCount.scala | 6 +- .../examples/streaming/MQTTWordCount.scala | 10 ++-- .../examples/streaming/NetworkWordCount.scala | 14 +++-- .../streaming/RecoverableNetworkWordCount.scala | 7 +-- .../streaming/StatefulNetworkWordCount.scala| 6 +- .../examples/streaming/TwitterPopularTags.scala | 22 +++- .../examples/streaming/ZeroMQWordCount.scala| 8 +-- .../clickstream/PageViewGenerator.scala | 10 ++-- .../streaming/clickstream/PageViewStream.scala | 7 ++- .../streaming/twitter/TwitterInputDStream.scala | 58 18 files changed, 130 insertions(+), 95 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c7571d8c/examples/src/main/java/org/apache/spark/examples/streaming/JavaCustomReceiver.java -- diff --git a/examples/src/main/java/org/apache/spark/examples/streaming/JavaCustomReceiver.java b/examples/src/main/java/org/apache/spark/examples/streaming/JavaCustomReceiver.java index 7f558f3..5622df5 100644 --- a/examples/src/main/java/org/apache/spark/examples/streaming/JavaCustomReceiver.java +++ b/examples/src/main/java/org/apache/spark/examples/streaming/JavaCustomReceiver.java @@ -19,6 +19,7 @@ package org.apache.spark.examples.streaming; import com.google.common.collect.Lists; +import org.apache.spark.SparkConf; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFunction; @@ -48,25 +49,23 @@ import java.util.regex.Pattern; * To run this on your local machine, you need to first run a Netcat server *`$ nc -lk ` * and then run the example - *`$ ./run org.apache.spark.examples.streaming.JavaCustomReceiver local[2] localhost ` + *`$ bin/run-example org.apache.spark.examples.streaming.JavaCustomReceiver localhost ` */ public class JavaCustomReceiver extends ReceiverString { private static final Pattern SPACE = Pattern.compile( ); public static void main(String[] args) { -if (args.length 3) { - System.err.println(Usage: JavaNetworkWordCount master hostname port\n + - In local mode, master should be 'local[n]' with n 1); +if (args.length 2) { + System.err.println(Usage: JavaNetworkWordCount hostname port); System.exit(1); } StreamingExamples.setStreamingLogLevels(); // Create the context with a 1 second batch size -JavaStreamingContext ssc = new JavaStreamingContext(args[0], JavaNetworkWordCount, -new Duration(1000), System.getenv(SPARK_HOME), -JavaStreamingContext.jarOfClass(JavaNetworkWordCount.class)); +SparkConf sparkConf = new SparkConf().setAppName(JavaCustomReceiver); +JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new Duration(1000)); // Create a input stream with the custom receiver on target ip:port and count the // words in input stream of \n delimited text (eg. generated by 'nc') http://git-wip-us.apache.org/repos/asf/spark/blob/c7571d8c/examples/src/main/java/org/apache/spark/examples/streaming/JavaFlumeEventCount.java
git commit: SPARK-1827. LICENSE and NOTICE files need a refresh to contain transitive dependency info
Repository: spark Updated Branches: refs/heads/branch-1.0 c7571d8c6 - 7083282ea SPARK-1827. LICENSE and NOTICE files need a refresh to contain transitive dependency info LICENSE and NOTICE policy is explained here: http://www.apache.org/dev/licensing-howto.html http://www.apache.org/legal/3party.html This leads to the following changes. First, this change enables two extensions to maven-shade-plugin in assembly/ that will try to include and merge all NOTICE and LICENSE files. This can't hurt. This generates a consolidated NOTICE file that I manually added to NOTICE. Next, a list of all dependencies and their licenses was generated: `mvn ... license:aggregate-add-third-party` to create: `target/generated-sources/license/THIRD-PARTY.txt` Each dependency is listed with one or more licenses. Determine the most-compatible license for each if there is more than one. For unknown license dependencies, I manually evaluateD their license. Many are actually Apache projects or components of projects covered already. The only non-trivial one was Colt, which has its own (compatible) license. I ignored Apache-licensed and public domain dependencies as these require no further action (beyond NOTICE above). BSD and MIT licenses (permissive Category A licenses) are evidently supposed to be mentioned in LICENSE, so I added a section without output from the THIRD-PARTY.txt file appropriately. Everything else, Category B licenses, are evidently mentioned in NOTICE (?) Same there. LICENSE contained some license statements for source code that is redistributed. I left this as I think that is the right place to put it. Author: Sean Owen so...@cloudera.com Closes #770 from srowen/SPARK-1827 and squashes the following commits: a764504 [Sean Owen] Add LICENSE and NOTICE info for all transitive dependencies as of 1.0 (cherry picked from commit 2e5a7cde223c8bf6d34e46b27ac94a965441584d) Signed-off-by: Patrick Wendell pwend...@gmail.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7083282e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7083282e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7083282e Branch: refs/heads/branch-1.0 Commit: 7083282eaea9a1256b1047c0be9c07dbaba175ce Parents: c7571d8 Author: Sean Owen so...@cloudera.com Authored: Wed May 14 09:38:33 2014 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Wed May 14 09:38:46 2014 -0700 -- LICENSE | 103 + NOTICE | 572 +- assembly/pom.xml | 2 + 3 files changed, 671 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7083282e/LICENSE -- diff --git a/LICENSE b/LICENSE index 1c1c2c0..383f079 100644 --- a/LICENSE +++ b/LICENSE @@ -428,3 +428,106 @@ LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON A THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + + +For colt: + + +Copyright (c) 1999 CERN - European Organization for Nuclear Research. +Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose is hereby granted without fee, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation. CERN makes no representations about the suitability of this software for any purpose. It is provided as is without expressed or implied warranty. + +Packages hep.aida.* + +Written by Pavel Binko, Dino Ferrero Merlino, Wolfgang Hoschek, Tony Johnson, Andreas Pfeiffer, and others. Check the FreeHEP home page for more info. Permission to use and/or redistribute this work is granted under the terms of the LGPL License, with the exception that any usage related to military applications is expressly forbidden. The software and documentation made available under the terms of this license are provided with no warranty. + + + +Fo SnapTree: + + +SNAPTREE LICENSE + +Copyright (c) 2009-2012 Stanford University, unless otherwise specified. +All rights reserved. + +This software was developed by the Pervasive Parallelism Laboratory of +Stanford University, California, USA. + +Permission to use, copy, modify, and distribute this software in
git commit: SPARK-1827. LICENSE and NOTICE files need a refresh to contain transitive dependency info
Repository: spark Updated Branches: refs/heads/master 68f28dabe - 2e5a7cde2 SPARK-1827. LICENSE and NOTICE files need a refresh to contain transitive dependency info LICENSE and NOTICE policy is explained here: http://www.apache.org/dev/licensing-howto.html http://www.apache.org/legal/3party.html This leads to the following changes. First, this change enables two extensions to maven-shade-plugin in assembly/ that will try to include and merge all NOTICE and LICENSE files. This can't hurt. This generates a consolidated NOTICE file that I manually added to NOTICE. Next, a list of all dependencies and their licenses was generated: `mvn ... license:aggregate-add-third-party` to create: `target/generated-sources/license/THIRD-PARTY.txt` Each dependency is listed with one or more licenses. Determine the most-compatible license for each if there is more than one. For unknown license dependencies, I manually evaluateD their license. Many are actually Apache projects or components of projects covered already. The only non-trivial one was Colt, which has its own (compatible) license. I ignored Apache-licensed and public domain dependencies as these require no further action (beyond NOTICE above). BSD and MIT licenses (permissive Category A licenses) are evidently supposed to be mentioned in LICENSE, so I added a section without output from the THIRD-PARTY.txt file appropriately. Everything else, Category B licenses, are evidently mentioned in NOTICE (?) Same there. LICENSE contained some license statements for source code that is redistributed. I left this as I think that is the right place to put it. Author: Sean Owen so...@cloudera.com Closes #770 from srowen/SPARK-1827 and squashes the following commits: a764504 [Sean Owen] Add LICENSE and NOTICE info for all transitive dependencies as of 1.0 Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2e5a7cde Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2e5a7cde Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2e5a7cde Branch: refs/heads/master Commit: 2e5a7cde223c8bf6d34e46b27ac94a965441584d Parents: 68f28da Author: Sean Owen so...@cloudera.com Authored: Wed May 14 09:38:33 2014 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Wed May 14 09:38:33 2014 -0700 -- LICENSE | 103 + NOTICE | 572 +- assembly/pom.xml | 2 + 3 files changed, 671 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2e5a7cde/LICENSE -- diff --git a/LICENSE b/LICENSE index 1c1c2c0..383f079 100644 --- a/LICENSE +++ b/LICENSE @@ -428,3 +428,106 @@ LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON A THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + + +For colt: + + +Copyright (c) 1999 CERN - European Organization for Nuclear Research. +Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose is hereby granted without fee, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation. CERN makes no representations about the suitability of this software for any purpose. It is provided as is without expressed or implied warranty. + +Packages hep.aida.* + +Written by Pavel Binko, Dino Ferrero Merlino, Wolfgang Hoschek, Tony Johnson, Andreas Pfeiffer, and others. Check the FreeHEP home page for more info. Permission to use and/or redistribute this work is granted under the terms of the LGPL License, with the exception that any usage related to military applications is expressly forbidden. The software and documentation made available under the terms of this license are provided with no warranty. + + + +Fo SnapTree: + + +SNAPTREE LICENSE + +Copyright (c) 2009-2012 Stanford University, unless otherwise specified. +All rights reserved. + +This software was developed by the Pervasive Parallelism Laboratory of +Stanford University, California, USA. + +Permission to use, copy, modify, and distribute this software in source +or binary form for any purpose with or without fee is hereby granted, +provided that the following conditions are met: + +
git commit: SPARK-1818 Freshen Mesos documentation
Repository: spark Updated Branches: refs/heads/master 2e5a7cde2 - d1d41ccee SPARK-1818 Freshen Mesos documentation Place more emphasis on using precompiled binary versions of Spark and Mesos instead of encouraging the reader to compile from source. Author: Andrew Ash and...@andrewash.com Closes #756 from ash211/spark-1818 and squashes the following commits: 7ef3b33 [Andrew Ash] Brief explanation of the interactions between Spark and Mesos e7dea8e [Andrew Ash] Add troubleshooting and debugging section 956362d [Andrew Ash] Don't need to pass spark.executor.uri into the spark shell de3353b [Andrew Ash] Wrap to 100char 7ebf6ef [Andrew Ash] Polish on the section on Mesos Master URLs 3dcc2c1 [Andrew Ash] Use --tgz parameter of make-distribution 41b68ed [Andrew Ash] Period at end of sentence; formatting on :5050 8bf2c53 [Andrew Ash] Update site.MESOS_VERSIOn to match /pom.xml 74f2040 [Andrew Ash] SPARK-1818 Freshen Mesos documentation Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d1d41cce Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d1d41cce Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d1d41cce Branch: refs/heads/master Commit: d1d41ccee49a5c093cb61c791c01f64f2076b83e Parents: 2e5a7cd Author: Andrew Ash and...@andrewash.com Authored: Wed May 14 09:45:33 2014 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Wed May 14 09:46:20 2014 -0700 -- docs/_config.yml | 2 +- docs/running-on-mesos.md | 200 -- 2 files changed, 174 insertions(+), 28 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d1d41cce/docs/_config.yml -- diff --git a/docs/_config.yml b/docs/_config.yml index d177e38..45b78fe 100644 --- a/docs/_config.yml +++ b/docs/_config.yml @@ -7,6 +7,6 @@ SPARK_VERSION: 1.0.0-SNAPSHOT SPARK_VERSION_SHORT: 1.0.0 SCALA_BINARY_VERSION: 2.10 SCALA_VERSION: 2.10.4 -MESOS_VERSION: 0.13.0 +MESOS_VERSION: 0.18.1 SPARK_ISSUE_TRACKER_URL: https://issues.apache.org/jira/browse/SPARK SPARK_GITHUB_URL: https://github.com/apache/spark http://git-wip-us.apache.org/repos/asf/spark/blob/d1d41cce/docs/running-on-mesos.md -- diff --git a/docs/running-on-mesos.md b/docs/running-on-mesos.md index 68259f0..ef762aa 100644 --- a/docs/running-on-mesos.md +++ b/docs/running-on-mesos.md @@ -3,19 +3,123 @@ layout: global title: Running Spark on Mesos --- -Spark can run on clusters managed by [Apache Mesos](http://mesos.apache.org/). Follow the steps below to install Mesos and Spark: - -1. Download and build Spark using the instructions [here](index.html). **Note:** Don't forget to consider what version of HDFS you might want to use! -2. Download, build, install, and start Mesos {{site.MESOS_VERSION}} on your cluster. You can download the Mesos distribution from a [mirror](http://www.apache.org/dyn/closer.cgi/mesos/{{site.MESOS_VERSION}}/). See the Mesos [Getting Started](http://mesos.apache.org/gettingstarted) page for more information. **Note:** If you want to run Mesos without installing it into the default paths on your system (e.g., if you don't have administrative privileges to install it), you should also pass the `--prefix` option to `configure` to tell it where to install. For example, pass `--prefix=/home/user/mesos`. By default the prefix is `/usr/local`. -3. Create a Spark distribution using `make-distribution.sh`. -4. Rename the `dist` directory created from `make-distribution.sh` to `spark-{{site.SPARK_VERSION}}`. -5. Create a `tar` archive: `tar czf spark-{{site.SPARK_VERSION}}.tar.gz spark-{{site.SPARK_VERSION}}` -6. Upload this archive to HDFS or another place accessible from Mesos via `http://`, e.g., [Amazon Simple Storage Service](http://aws.amazon.com/s3): `hadoop fs -put spark-{{site.SPARK_VERSION}}.tar.gz /path/to/spark-{{site.SPARK_VERSION}}.tar.gz` -7. Create a file called `spark-env.sh` in Spark's `conf` directory, by copying `conf/spark-env.sh.template`, and add the following lines to it: - * `export MESOS_NATIVE_LIBRARY=path to libmesos.so`. This path is usually `prefix/lib/libmesos.so` (where the prefix is `/usr/local` by default, see above). Also, on Mac OS X, the library is called `libmesos.dylib` instead of `libmesos.so`. - * `export SPARK_EXECUTOR_URI=path to spark-{{site.SPARK_VERSION}}.tar.gz uploaded above`. - * `export MASTER=mesos://HOST:PORT` where HOST:PORT is the host and port (default: 5050) of your Mesos master (or `zk://...` if using Mesos with ZooKeeper). -8. To run a Spark application against the cluster, when you create your `SparkContext`, pass the string `mesos://HOST:PORT` as the master URL. In
git commit: SPARK-1828: Created forked version of hive-exec that doesn't bundle other dependencies
Repository: spark Updated Branches: refs/heads/branch-1.0 fc6b65227 - 34f6fa921 SPARK-1828: Created forked version of hive-exec that doesn't bundle other dependencies See https://issues.apache.org/jira/browse/SPARK-1828 for more information. This is being submitted to Jenkin's for testing. The dependency won't fully propagate in Maven central for a few more hours. Author: Patrick Wendell pwend...@gmail.com Closes #767 from pwendell/hive-shaded and squashes the following commits: ea10ac5 [Patrick Wendell] SPARK-1828: Created forked version of hive-exec that doesn't bundle other dependencies (cherry picked from commit d58cb33ffa9e98a64cecea7b40ce7bfbed145079) Signed-off-by: Patrick Wendell pwend...@gmail.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/34f6fa92 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/34f6fa92 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/34f6fa92 Branch: refs/heads/branch-1.0 Commit: 34f6fa92155ba70d7b17315664618a007f9325ab Parents: fc6b652 Author: Patrick Wendell pwend...@gmail.com Authored: Wed May 14 09:51:01 2014 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Wed May 14 09:51:11 2014 -0700 -- project/SparkBuild.scala | 6 +++--- sql/hive/pom.xml | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/34f6fa92/project/SparkBuild.scala -- diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala index cca3fba..2d10d89 100644 --- a/project/SparkBuild.scala +++ b/project/SparkBuild.scala @@ -489,9 +489,9 @@ object SparkBuild extends Build { name := spark-hive, javaOptions += -XX:MaxPermSize=1g, libraryDependencies ++= Seq( - org.apache.hive % hive-metastore % hiveVersion, - org.apache.hive % hive-exec % hiveVersion, - org.apache.hive % hive-serde % hiveVersion + org.spark-project.hive % hive-metastore % hiveVersion, + org.spark-project.hive % hive-exec % hiveVersion, + org.spark-project.hive % hive-serde % hiveVersion ), // Multiple queries rely on the TestHive singleton. See comments there for more details. parallelExecution in Test := false, http://git-wip-us.apache.org/repos/asf/spark/blob/34f6fa92/sql/hive/pom.xml -- diff --git a/sql/hive/pom.xml b/sql/hive/pom.xml index 4fd3cb0..0c55657 100644 --- a/sql/hive/pom.xml +++ b/sql/hive/pom.xml @@ -43,12 +43,12 @@ version${project.version}/version /dependency dependency - groupIdorg.apache.hive/groupId + groupIdorg.spark-project.hive/groupId artifactIdhive-metastore/artifactId version${hive.version}/version /dependency dependency - groupIdorg.apache.hive/groupId + groupIdorg.spark-project.hive/groupId artifactIdhive-exec/artifactId version${hive.version}/version exclusions @@ -63,7 +63,7 @@ artifactIdjackson-mapper-asl/artifactId /dependency dependency - groupIdorg.apache.hive/groupId + groupIdorg.spark-project.hive/groupId artifactIdhive-serde/artifactId version${hive.version}/version exclusions
git commit: [SPARK-1620] Handle uncaught exceptions in function run by Akka scheduler
Repository: spark Updated Branches: refs/heads/master d58cb33ff - 17f3075bc [SPARK-1620] Handle uncaught exceptions in function run by Akka scheduler If the intended behavior was that uncaught exceptions thrown in functions being run by the Akka scheduler would end up being handled by the default uncaught exception handler set in Executor, and if that behavior is, in fact, correct, then this is a way to accomplish that. I'm not certain, though, that we shouldn't be doing something different to handle uncaught exceptions from some of these scheduled functions. In any event, this PR covers all of the cases I comment on in [SPARK-1620](https://issues.apache.org/jira/browse/SPARK-1620). Author: Mark Hamstra markhams...@gmail.com Closes #622 from markhamstra/SPARK-1620 and squashes the following commits: 071d193 [Mark Hamstra] refactored post-SPARK-1772 1a6a35e [Mark Hamstra] another style fix d30eb94 [Mark Hamstra] scalastyle 3573ecd [Mark Hamstra] Use wrapped try/catch in Utils.tryOrExit 8fc0439 [Mark Hamstra] Make functions run by the Akka scheduler use Executor's UncaughtExceptionHandler Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/17f3075b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/17f3075b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/17f3075b Branch: refs/heads/master Commit: 17f3075bc4aa8cbed165f7b367f70e84b1bc8db9 Parents: d58cb33 Author: Mark Hamstra markhams...@gmail.com Authored: Wed May 14 10:07:25 2014 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Wed May 14 10:07:25 2014 -0700 -- .../apache/spark/deploy/client/AppClient.scala| 18 ++ .../org/apache/spark/deploy/worker/Worker.scala | 18 ++ .../spark/scheduler/TaskSchedulerImpl.scala | 3 ++- .../org/apache/spark/storage/BlockManager.scala | 2 +- .../main/scala/org/apache/spark/util/Utils.scala | 13 + 5 files changed, 36 insertions(+), 18 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/17f3075b/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala b/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala index 896913d..d38e9e7 100644 --- a/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala +++ b/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala @@ -30,7 +30,7 @@ import org.apache.spark.{Logging, SparkConf, SparkException} import org.apache.spark.deploy.{ApplicationDescription, ExecutorState} import org.apache.spark.deploy.DeployMessages._ import org.apache.spark.deploy.master.Master -import org.apache.spark.util.AkkaUtils +import org.apache.spark.util.{Utils, AkkaUtils} /** * Interface allowing applications to speak with a Spark deploy cluster. Takes a master URL, @@ -88,13 +88,15 @@ private[spark] class AppClient( var retries = 0 registrationRetryTimer = Some { context.system.scheduler.schedule(REGISTRATION_TIMEOUT, REGISTRATION_TIMEOUT) { - retries += 1 - if (registered) { -registrationRetryTimer.foreach(_.cancel()) - } else if (retries = REGISTRATION_RETRIES) { -markDead(All masters are unresponsive! Giving up.) - } else { -tryRegisterAllMasters() + Utils.tryOrExit { +retries += 1 +if (registered) { + registrationRetryTimer.foreach(_.cancel()) +} else if (retries = REGISTRATION_RETRIES) { + markDead(All masters are unresponsive! Giving up.) +} else { + tryRegisterAllMasters() +} } } } http://git-wip-us.apache.org/repos/asf/spark/blob/17f3075b/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala b/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala index 85d25dc..134624c 100755 --- a/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala +++ b/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala @@ -166,14 +166,16 @@ private[spark] class Worker( var retries = 0 registrationRetryTimer = Some { context.system.scheduler.schedule(REGISTRATION_TIMEOUT, REGISTRATION_TIMEOUT) { -retries += 1 -if (registered) { - registrationRetryTimer.foreach(_.cancel()) -} else if (retries = REGISTRATION_RETRIES) { - logError(All masters are unresponsive! Giving up.) - System.exit(1) -} else { -
git commit: [SPARK-1620] Handle uncaught exceptions in function run by Akka scheduler
Repository: spark Updated Branches: refs/heads/branch-1.0 34f6fa921 - 9ff9078fc [SPARK-1620] Handle uncaught exceptions in function run by Akka scheduler If the intended behavior was that uncaught exceptions thrown in functions being run by the Akka scheduler would end up being handled by the default uncaught exception handler set in Executor, and if that behavior is, in fact, correct, then this is a way to accomplish that. I'm not certain, though, that we shouldn't be doing something different to handle uncaught exceptions from some of these scheduled functions. In any event, this PR covers all of the cases I comment on in [SPARK-1620](https://issues.apache.org/jira/browse/SPARK-1620). Author: Mark Hamstra markhams...@gmail.com Closes #622 from markhamstra/SPARK-1620 and squashes the following commits: 071d193 [Mark Hamstra] refactored post-SPARK-1772 1a6a35e [Mark Hamstra] another style fix d30eb94 [Mark Hamstra] scalastyle 3573ecd [Mark Hamstra] Use wrapped try/catch in Utils.tryOrExit 8fc0439 [Mark Hamstra] Make functions run by the Akka scheduler use Executor's UncaughtExceptionHandler (cherry picked from commit 17f3075bc4aa8cbed165f7b367f70e84b1bc8db9) Signed-off-by: Patrick Wendell pwend...@gmail.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9ff9078f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9ff9078f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9ff9078f Branch: refs/heads/branch-1.0 Commit: 9ff9078fc840c05c75f635b7a6acc5080b8e1185 Parents: 34f6fa9 Author: Mark Hamstra markhams...@gmail.com Authored: Wed May 14 10:07:25 2014 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Wed May 14 10:07:39 2014 -0700 -- .../apache/spark/deploy/client/AppClient.scala| 18 ++ .../org/apache/spark/deploy/worker/Worker.scala | 18 ++ .../spark/scheduler/TaskSchedulerImpl.scala | 3 ++- .../org/apache/spark/storage/BlockManager.scala | 2 +- .../main/scala/org/apache/spark/util/Utils.scala | 13 + 5 files changed, 36 insertions(+), 18 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9ff9078f/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala b/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala index 896913d..d38e9e7 100644 --- a/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala +++ b/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala @@ -30,7 +30,7 @@ import org.apache.spark.{Logging, SparkConf, SparkException} import org.apache.spark.deploy.{ApplicationDescription, ExecutorState} import org.apache.spark.deploy.DeployMessages._ import org.apache.spark.deploy.master.Master -import org.apache.spark.util.AkkaUtils +import org.apache.spark.util.{Utils, AkkaUtils} /** * Interface allowing applications to speak with a Spark deploy cluster. Takes a master URL, @@ -88,13 +88,15 @@ private[spark] class AppClient( var retries = 0 registrationRetryTimer = Some { context.system.scheduler.schedule(REGISTRATION_TIMEOUT, REGISTRATION_TIMEOUT) { - retries += 1 - if (registered) { -registrationRetryTimer.foreach(_.cancel()) - } else if (retries = REGISTRATION_RETRIES) { -markDead(All masters are unresponsive! Giving up.) - } else { -tryRegisterAllMasters() + Utils.tryOrExit { +retries += 1 +if (registered) { + registrationRetryTimer.foreach(_.cancel()) +} else if (retries = REGISTRATION_RETRIES) { + markDead(All masters are unresponsive! Giving up.) +} else { + tryRegisterAllMasters() +} } } } http://git-wip-us.apache.org/repos/asf/spark/blob/9ff9078f/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala b/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala index 85d25dc..134624c 100755 --- a/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala +++ b/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala @@ -166,14 +166,16 @@ private[spark] class Worker( var retries = 0 registrationRetryTimer = Some { context.system.scheduler.schedule(REGISTRATION_TIMEOUT, REGISTRATION_TIMEOUT) { -retries += 1 -if (registered) { - registrationRetryTimer.foreach(_.cancel()) -} else if (retries = REGISTRATION_RETRIES)
git commit: [maven-release-plugin] prepare for next development iteration
Repository: spark Updated Branches: refs/heads/branch-1.0 54133abdc - e480bcfbd [maven-release-plugin] prepare for next development iteration Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e480bcfb Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e480bcfb Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e480bcfb Branch: refs/heads/branch-1.0 Commit: e480bcfbd269ae1d7a6a92cfb50466cf192fe1fb Parents: 54133ab Author: Patrick Wendell pwend...@gmail.com Authored: Wed May 14 17:50:40 2014 + Committer: Patrick Wendell pwend...@gmail.com Committed: Wed May 14 17:50:40 2014 + -- assembly/pom.xml | 2 +- bagel/pom.xml | 2 +- core/pom.xml | 2 +- examples/pom.xml | 2 +- external/flume/pom.xml| 2 +- external/kafka/pom.xml| 2 +- external/mqtt/pom.xml | 2 +- external/twitter/pom.xml | 2 +- external/zeromq/pom.xml | 2 +- extras/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml| 2 +- mllib/pom.xml | 2 +- pom.xml | 4 ++-- repl/pom.xml | 2 +- sql/catalyst/pom.xml | 2 +- sql/core/pom.xml | 2 +- sql/hive/pom.xml | 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- yarn/pom.xml | 2 +- yarn/stable/pom.xml | 2 +- 21 files changed, 22 insertions(+), 22 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e480bcfb/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index 79b1b1f..f79766d 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent/artifactId -version1.0.0/version +version1.0.1-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/e480bcfb/bagel/pom.xml -- diff --git a/bagel/pom.xml b/bagel/pom.xml index 08932bb..85f6d99 100644 --- a/bagel/pom.xml +++ b/bagel/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent/artifactId -version1.0.0/version +version1.0.1-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/e480bcfb/core/pom.xml -- diff --git a/core/pom.xml b/core/pom.xml index 3e22641..47c2507 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent/artifactId -version1.0.0/version +version1.0.1-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/e480bcfb/examples/pom.xml -- diff --git a/examples/pom.xml b/examples/pom.xml index 006757a..b7cbb1a 100644 --- a/examples/pom.xml +++ b/examples/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent/artifactId -version1.0.0/version +version1.0.1-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/e480bcfb/external/flume/pom.xml -- diff --git a/external/flume/pom.xml b/external/flume/pom.xml index 3ba984e..b8fc07f 100644 --- a/external/flume/pom.xml +++ b/external/flume/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent/artifactId -version1.0.0/version +version1.0.1-SNAPSHOT/version relativePath../../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/e480bcfb/external/kafka/pom.xml -- diff --git a/external/kafka/pom.xml b/external/kafka/pom.xml index cb4dd47..9eeb2e1 100644 --- a/external/kafka/pom.xml +++ b/external/kafka/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent/artifactId -version1.0.0/version +version1.0.1-SNAPSHOT/version relativePath../../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/e480bcfb/external/mqtt/pom.xml -- diff --git a/external/mqtt/pom.xml b/external/mqtt/pom.xml index b10916d..f4272ce 100644 ---
Git Push Summary
Repository: spark Updated Tags: refs/tags/v1.0.0-rc6 [created] aab03f5f9
git commit: Fix: sbt test throw an java.lang.OutOfMemoryError: PermGen space
Repository: spark Updated Branches: refs/heads/branch-1.0 e480bcfbd - 379f733e9 Fix: sbt test throw an java.lang.OutOfMemoryError: PermGen space Author: witgo wi...@qq.com Closes #773 from witgo/sbt_javaOptions and squashes the following commits: 26c7d38 [witgo] Improve sbt configuration (cherry picked from commit fde82c1549c78f1eebbb21ec34e60befbbff65f5) Signed-off-by: Reynold Xin r...@apache.org Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/379f733e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/379f733e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/379f733e Branch: refs/heads/branch-1.0 Commit: 379f733e988daf2f1cae4cdac2faf1c42998b2b5 Parents: e480bcf Author: witgo wi...@qq.com Authored: Wed May 14 11:19:26 2014 -0700 Committer: Reynold Xin r...@apache.org Committed: Wed May 14 11:19:43 2014 -0700 -- .rat-excludes| 5 + project/SparkBuild.scala | 1 + 2 files changed, 6 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/379f733e/.rat-excludes -- diff --git a/.rat-excludes b/.rat-excludes index 5076695..6894678 100644 --- a/.rat-excludes +++ b/.rat-excludes @@ -43,3 +43,8 @@ test.out/* .*iml service.properties db.lck +build/* +dist/* +.*out +.*ipr +.*iws http://git-wip-us.apache.org/repos/asf/spark/blob/379f733e/project/SparkBuild.scala -- diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala index 2d10d89..9cec1be 100644 --- a/project/SparkBuild.scala +++ b/project/SparkBuild.scala @@ -183,6 +183,7 @@ object SparkBuild extends Build { javaOptions in Test += -Dspark.testing=1, javaOptions in Test += -Dsun.io.serialization.extendedDebugInfo=true, javaOptions in Test ++= System.getProperties.filter(_._1 startsWith spark).map { case (k,v) = s-D$k=$v }.toSeq, +javaOptions in Test ++= -Xmx3g -XX:PermSize=128M -XX:MaxNewSize=256m -XX:MaxPermSize=1g.split( ).toSeq, javaOptions += -Xmx3g, // Show full stack trace and duration in test cases. testOptions in Test += Tests.Argument(-oDF),
git commit: Fix: sbt test throw an java.lang.OutOfMemoryError: PermGen space
Repository: spark Updated Branches: refs/heads/master 17f3075bc - fde82c154 Fix: sbt test throw an java.lang.OutOfMemoryError: PermGen space Author: witgo wi...@qq.com Closes #773 from witgo/sbt_javaOptions and squashes the following commits: 26c7d38 [witgo] Improve sbt configuration Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fde82c15 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fde82c15 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fde82c15 Branch: refs/heads/master Commit: fde82c1549c78f1eebbb21ec34e60befbbff65f5 Parents: 17f3075 Author: witgo wi...@qq.com Authored: Wed May 14 11:19:26 2014 -0700 Committer: Reynold Xin r...@apache.org Committed: Wed May 14 11:19:26 2014 -0700 -- .rat-excludes| 5 + project/SparkBuild.scala | 1 + 2 files changed, 6 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fde82c15/.rat-excludes -- diff --git a/.rat-excludes b/.rat-excludes index 5076695..6894678 100644 --- a/.rat-excludes +++ b/.rat-excludes @@ -43,3 +43,8 @@ test.out/* .*iml service.properties db.lck +build/* +dist/* +.*out +.*ipr +.*iws http://git-wip-us.apache.org/repos/asf/spark/blob/fde82c15/project/SparkBuild.scala -- diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala index 8d56b40..6adec55 100644 --- a/project/SparkBuild.scala +++ b/project/SparkBuild.scala @@ -183,6 +183,7 @@ object SparkBuild extends Build { javaOptions in Test += -Dspark.testing=1, javaOptions in Test += -Dsun.io.serialization.extendedDebugInfo=true, javaOptions in Test ++= System.getProperties.filter(_._1 startsWith spark).map { case (k,v) = s-D$k=$v }.toSeq, +javaOptions in Test ++= -Xmx3g -XX:PermSize=128M -XX:MaxNewSize=256m -XX:MaxPermSize=1g.split( ).toSeq, javaOptions += -Xmx3g, // Show full stack trace and duration in test cases. testOptions in Test += Tests.Argument(-oDF),
git commit: SPARK-1829 Sub-second durations shouldn't round to 0 s
Repository: spark Updated Branches: refs/heads/branch-1.0 379f733e9 - 530bdf7d4 SPARK-1829 Sub-second durations shouldn't round to 0 s As 99 ms up to 99 ms As 0.1 s from 0.1 s up to 0.9 s https://issues.apache.org/jira/browse/SPARK-1829 Compare the first image to the second here: http://imgur.com/RaLEsSZ,7VTlgfo#0 Author: Andrew Ash and...@andrewash.com Closes #768 from ash211/spark-1829 and squashes the following commits: 1c15b8e [Andrew Ash] SPARK-1829 Format sub-second durations more appropriately (cherry picked from commit a3315d7f4c7584dae2ee0aa33c6ec9e97b229b48) Signed-off-by: Reynold Xin r...@apache.org Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/530bdf7d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/530bdf7d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/530bdf7d Branch: refs/heads/branch-1.0 Commit: 530bdf7d4bde2e90e1523e65b089559a2eddd793 Parents: 379f733 Author: Andrew Ash and...@andrewash.com Authored: Wed May 14 12:01:14 2014 -0700 Committer: Reynold Xin r...@apache.org Committed: Wed May 14 12:01:22 2014 -0700 -- core/src/main/scala/org/apache/spark/ui/UIUtils.scala | 6 ++ 1 file changed, 6 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/530bdf7d/core/src/main/scala/org/apache/spark/ui/UIUtils.scala -- diff --git a/core/src/main/scala/org/apache/spark/ui/UIUtils.scala b/core/src/main/scala/org/apache/spark/ui/UIUtils.scala index a3d6a18..a43314f 100644 --- a/core/src/main/scala/org/apache/spark/ui/UIUtils.scala +++ b/core/src/main/scala/org/apache/spark/ui/UIUtils.scala @@ -36,7 +36,13 @@ private[spark] object UIUtils extends Logging { def formatDate(timestamp: Long): String = dateFormat.get.format(new Date(timestamp)) def formatDuration(milliseconds: Long): String = { +if (milliseconds 100) { + return %d ms.format(milliseconds) +} val seconds = milliseconds.toDouble / 1000 +if (seconds 1) { + return %.1f s.format(seconds) +} if (seconds 60) { return %.0f s.format(seconds) }
git commit: SPARK-1833 - Have an empty SparkContext constructor.
Repository: spark Updated Branches: refs/heads/branch-1.0 530bdf7d4 - 8e13ab2fe SPARK-1833 - Have an empty SparkContext constructor. This is nicer than relying on new SparkContext(new SparkConf()) Author: Patrick Wendell pwend...@gmail.com Closes #774 from pwendell/spark-context and squashes the following commits: ef9f12f [Patrick Wendell] SPARK-1833 - Have an empty SparkContext constructor. (cherry picked from commit 65533c7ec03e7eedf5cd9756822863ab6f034ec9) Signed-off-by: Patrick Wendell pwend...@gmail.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8e13ab2f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8e13ab2f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8e13ab2f Branch: refs/heads/branch-1.0 Commit: 8e13ab2fe25d2fd50ee84a42f0f2d248432c7734 Parents: 530bdf7 Author: Patrick Wendell pwend...@gmail.com Authored: Wed May 14 12:53:30 2014 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Wed May 14 12:53:42 2014 -0700 -- core/src/main/scala/org/apache/spark/SparkContext.scala | 6 ++ 1 file changed, 6 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8e13ab2f/core/src/main/scala/org/apache/spark/SparkContext.scala -- diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala b/core/src/main/scala/org/apache/spark/SparkContext.scala index 032b3d7..634c10c 100644 --- a/core/src/main/scala/org/apache/spark/SparkContext.scala +++ b/core/src/main/scala/org/apache/spark/SparkContext.scala @@ -67,6 +67,12 @@ class SparkContext(config: SparkConf) extends Logging { private[spark] var preferredNodeLocationData: Map[String, Set[SplitInfo]] = Map() /** + * Create a SparkContext that loads settings from system properties (for instance, when + * launching with ./bin/spark-submit). + */ + def this() = this(new SparkConf()) + + /** * :: DeveloperApi :: * Alternative constructor for setting preferred locations where Spark will create executors. *
git commit: Bug fix of sparse vector conversion
Repository: spark Updated Branches: refs/heads/master 910a13b3c - 191279ce4 Bug fix of sparse vector conversion Fixed a small bug caused by the inconsistency of index/data array size and vector length. Author: Funes tianshao...@gmail.com Author: funes tianshao...@gmail.com Closes #661 from funes/bugfix and squashes the following commits: edb2b9d [funes] remove unused import 75dced3 [Funes] update test case d129a66 [Funes] Add test for sparse breeze by vector builder 64e7198 [Funes] Copy data only when necessary b85806c [Funes] Bug fix of sparse vector conversion Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/191279ce Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/191279ce Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/191279ce Branch: refs/heads/master Commit: 191279ce4edb940821d11a6b25cd33c8ad0af054 Parents: 910a13b Author: Funes tianshao...@gmail.com Authored: Thu May 8 17:54:10 2014 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Thu May 8 17:54:10 2014 -0700 -- .../main/scala/org/apache/spark/mllib/linalg/Vectors.scala | 6 +- .../spark/mllib/linalg/BreezeVectorConversionSuite.scala| 9 + 2 files changed, 14 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/191279ce/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala b/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala index 7cdf6bd..84d2239 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala @@ -136,7 +136,11 @@ object Vectors { new DenseVector(v.toArray) // Can't use underlying array directly, so make a new one } case v: BSV[Double] = -new SparseVector(v.length, v.index, v.data) +if (v.index.length == v.used) { + new SparseVector(v.length, v.index, v.data) +} else { + new SparseVector(v.length, v.index.slice(0, v.used), v.data.slice(0, v.used)) +} case v: BV[_] = sys.error(Unsupported Breeze vector type: + v.getClass.getName) } http://git-wip-us.apache.org/repos/asf/spark/blob/191279ce/mllib/src/test/scala/org/apache/spark/mllib/linalg/BreezeVectorConversionSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/mllib/linalg/BreezeVectorConversionSuite.scala b/mllib/src/test/scala/org/apache/spark/mllib/linalg/BreezeVectorConversionSuite.scala index aacaa30..8abdac7 100644 --- a/mllib/src/test/scala/org/apache/spark/mllib/linalg/BreezeVectorConversionSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/mllib/linalg/BreezeVectorConversionSuite.scala @@ -55,4 +55,13 @@ class BreezeVectorConversionSuite extends FunSuite { assert(vec.indices.eq(indices), should not copy data) assert(vec.values.eq(values), should not copy data) } + + test(sparse breeze with partially-used arrays to vector) { +val activeSize = 3 +val breeze = new BSV[Double](indices, values, activeSize, n) +val vec = Vectors.fromBreeze(breeze).asInstanceOf[SparseVector] +assert(vec.size === n) +assert(vec.indices === indices.slice(0, activeSize)) +assert(vec.values === values.slice(0, activeSize)) + } }
git commit: [SPARK-1460] Returning SchemaRDD instead of normal RDD on Set operations...
Repository: spark Updated Branches: refs/heads/branch-1.0 756c96939 - da9f9e05b [SPARK-1460] Returning SchemaRDD instead of normal RDD on Set operations... ... that do not change schema Author: Kan Zhang kzh...@apache.org Closes #448 from kanzhang/SPARK-1460 and squashes the following commits: 111e388 [Kan Zhang] silence MiMa errors in EdgeRDD and VertexRDD 91dc787 [Kan Zhang] Taking into account newly added Ordering param 79ed52a [Kan Zhang] [SPARK-1460] Returning SchemaRDD on Set operations that do not change schema (cherry picked from commit 967635a2425a769b932eea0984fe697d6721cab0) Signed-off-by: Patrick Wendell pwend...@gmail.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/da9f9e05 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/da9f9e05 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/da9f9e05 Branch: refs/heads/branch-1.0 Commit: da9f9e05b47f5745c09377de15eccca131f07d51 Parents: 756c969 Author: Kan Zhang kzh...@apache.org Authored: Wed May 7 09:41:31 2014 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Wed May 7 09:41:43 2014 -0700 -- .../main/scala/org/apache/spark/rdd/RDD.scala | 10 +- .../scala/org/apache/spark/graphx/EdgeRDD.scala | 10 +- .../org/apache/spark/graphx/VertexRDD.scala | 10 +- project/MimaBuild.scala | 2 + python/pyspark/sql.py | 29 .../scala/org/apache/spark/sql/SchemaRDD.scala | 67 - .../spark/sql/api/java/JavaSchemaRDD.scala | 140 +++ 7 files changed, 246 insertions(+), 22 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/da9f9e05/core/src/main/scala/org/apache/spark/rdd/RDD.scala -- diff --git a/core/src/main/scala/org/apache/spark/rdd/RDD.scala b/core/src/main/scala/org/apache/spark/rdd/RDD.scala index 3b3524f..a1ca612 100644 --- a/core/src/main/scala/org/apache/spark/rdd/RDD.scala +++ b/core/src/main/scala/org/apache/spark/rdd/RDD.scala @@ -128,7 +128,7 @@ abstract class RDD[T: ClassTag]( @transient var name: String = null /** Assign a name to this RDD */ - def setName(_name: String): RDD[T] = { + def setName(_name: String): this.type = { name = _name this } @@ -138,7 +138,7 @@ abstract class RDD[T: ClassTag]( * it is computed. This can only be used to assign a new storage level if the RDD does not * have a storage level set yet.. */ - def persist(newLevel: StorageLevel): RDD[T] = { + def persist(newLevel: StorageLevel): this.type = { // TODO: Handle changes of StorageLevel if (storageLevel != StorageLevel.NONE newLevel != storageLevel) { throw new UnsupportedOperationException( @@ -152,10 +152,10 @@ abstract class RDD[T: ClassTag]( } /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */ - def persist(): RDD[T] = persist(StorageLevel.MEMORY_ONLY) + def persist(): this.type = persist(StorageLevel.MEMORY_ONLY) /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */ - def cache(): RDD[T] = persist() + def cache(): this.type = persist() /** * Mark the RDD as non-persistent, and remove all blocks for it from memory and disk. @@ -163,7 +163,7 @@ abstract class RDD[T: ClassTag]( * @param blocking Whether to block until all blocks are deleted. * @return This RDD. */ - def unpersist(blocking: Boolean = true): RDD[T] = { + def unpersist(blocking: Boolean = true): this.type = { logInfo(Removing RDD + id + from persistence list) sc.unpersistRDD(id, blocking) storageLevel = StorageLevel.NONE http://git-wip-us.apache.org/repos/asf/spark/blob/da9f9e05/graphx/src/main/scala/org/apache/spark/graphx/EdgeRDD.scala -- diff --git a/graphx/src/main/scala/org/apache/spark/graphx/EdgeRDD.scala b/graphx/src/main/scala/org/apache/spark/graphx/EdgeRDD.scala index 6d04bf7..fa78ca9 100644 --- a/graphx/src/main/scala/org/apache/spark/graphx/EdgeRDD.scala +++ b/graphx/src/main/scala/org/apache/spark/graphx/EdgeRDD.scala @@ -51,18 +51,12 @@ class EdgeRDD[@specialized ED: ClassTag]( override def collect(): Array[Edge[ED]] = this.map(_.copy()).collect() - override def persist(newLevel: StorageLevel): EdgeRDD[ED] = { + override def persist(newLevel: StorageLevel): this.type = { partitionsRDD.persist(newLevel) this } - /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */ - override def persist(): EdgeRDD[ED] = persist(StorageLevel.MEMORY_ONLY) - - /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */ - override def cache(): EdgeRDD[ED] = persist() - - override def unpersist(blocking: Boolean =
git commit: [SPARK-1644] The org.datanucleus:* should not be packaged into spark-assembly-*.jar
Repository: spark Updated Branches: refs/heads/branch-1.0 adf8cdd0b - 2a878dab6 [SPARK-1644] The org.datanucleus:* should not be packaged into spark-assembly-*.jar Author: witgo wi...@qq.com Closes #688 from witgo/SPARK-1644 and squashes the following commits: 56ad6ac [witgo] review commit 87c03e4 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1644 6ffa7e4 [witgo] review commit a597414 [witgo] The org.datanucleus:* should not be packaged into spark-assembly-*.jar (cherry picked from commit 561510867a1b79beef57acf9df65c9f88481435d) Signed-off-by: Patrick Wendell pwend...@gmail.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2a878dab Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2a878dab Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2a878dab Branch: refs/heads/branch-1.0 Commit: 2a878dab6a04e39c1ff50a322a61abbc9a08d5db Parents: adf8cdd Author: witgo wi...@qq.com Authored: Sat May 10 10:15:04 2014 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Sat May 10 10:15:16 2014 -0700 -- assembly/pom.xml | 1 + project/SparkBuild.scala | 11 ++- 2 files changed, 7 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2a878dab/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index 4a5df66..208794f 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -96,6 +96,7 @@ filter artifact*:*/artifact excludes +excludeorg.datanucleus:*/exclude excludeMETA-INF/*.SF/exclude excludeMETA-INF/*.DSA/exclude excludeMETA-INF/*.RSA/exclude http://git-wip-us.apache.org/repos/asf/spark/blob/2a878dab/project/SparkBuild.scala -- diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala index 79c1972..d0b5409 100644 --- a/project/SparkBuild.scala +++ b/project/SparkBuild.scala @@ -579,12 +579,13 @@ object SparkBuild extends Build { def extraAssemblySettings() = Seq( test in assembly := {}, mergeStrategy in assembly := { - case m if m.toLowerCase.endsWith(manifest.mf) = MergeStrategy.discard - case m if m.toLowerCase.matches(meta-inf.*\\.sf$) = MergeStrategy.discard - case log4j.properties = MergeStrategy.discard + case PathList(org, datanucleus, xs @ _*) = MergeStrategy.discard + case m if m.toLowerCase.endsWith(manifest.mf) = MergeStrategy.discard + case m if m.toLowerCase.matches(meta-inf.*\\.sf$) = MergeStrategy.discard + case log4j.properties = MergeStrategy.discard case m if m.toLowerCase.startsWith(meta-inf/services/) = MergeStrategy.filterDistinctLines - case reference.conf = MergeStrategy.concat - case _ = MergeStrategy.first + case reference.conf= MergeStrategy.concat + case _ = MergeStrategy.first } )
git commit: Converted bang to ask to avoid scary warning when a block is removed
Repository: spark Updated Branches: refs/heads/branch-1.0 1d56cd544 - b8c17e392 Converted bang to ask to avoid scary warning when a block is removed Removing a block through the blockmanager gave a scary warning messages in the driver. ``` 2014-05-08 20:16:19,172 WARN BlockManagerMasterActor: Got unknown message: true 2014-05-08 20:16:19,172 WARN BlockManagerMasterActor: Got unknown message: true 2014-05-08 20:16:19,172 WARN BlockManagerMasterActor: Got unknown message: true ``` This is because the [BlockManagerSlaveActor](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManagerSlaveActor.scala#L44) would send back an acknowledgement (true). But the BlockManagerMasterActor would have sent the RemoveBlock message as a send, not as ask(), so would reject the receiver true as a unknown message. @pwendell Author: Tathagata Das tathagata.das1...@gmail.com Closes #708 from tdas/bm-fix and squashes the following commits: ed4ef15 [Tathagata Das] Converted bang to ask to avoid scary warning when a block is removed. (cherry picked from commit 32868f31f88aebd580ab9329dc51a30c26af7a74) Signed-off-by: Tathagata Das tathagata.das1...@gmail.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b8c17e39 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b8c17e39 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b8c17e39 Branch: refs/heads/branch-1.0 Commit: b8c17e3928d070d4757d44995516b8872196e5c9 Parents: 1d56cd5 Author: Tathagata Das tathagata.das1...@gmail.com Authored: Thu May 8 22:34:08 2014 -0700 Committer: Tathagata Das tathagata.das1...@gmail.com Committed: Thu May 8 22:34:21 2014 -0700 -- .../scala/org/apache/spark/storage/BlockManagerMasterActor.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b8c17e39/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala -- diff --git a/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala b/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala index 98fa0df..6aed322 100644 --- a/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala +++ b/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala @@ -250,7 +250,7 @@ class BlockManagerMasterActor(val isLocal: Boolean, conf: SparkConf, listenerBus // Remove the block from the slave's BlockManager. // Doesn't actually wait for a confirmation and the message might get lost. // If message loss becomes frequent, we should add retry logic here. - blockManager.get.slaveActor ! RemoveBlock(blockId) + blockManager.get.slaveActor.ask(RemoveBlock(blockId))(akkaTimeout) } } }
git commit: SPARK-1770: Revert accidental(?) fix
Repository: spark Updated Branches: refs/heads/branch-1.0 80f292a21 - 8202276c9 SPARK-1770: Revert accidental(?) fix Looks like this change was accidentally committed here: https://github.com/apache/spark/commit/06b15baab25951d124bbe6b64906f4139e037deb but the change does not show up in the PR itself (#704). Other than not intending to go in with that PR, this also broke the test JavaAPISuite.repartition. Author: Aaron Davidson aa...@databricks.com Closes #716 from aarondav/shufflerand and squashes the following commits: b1cf70b [Aaron Davidson] SPARK-1770: Revert accidental(?) fix (cherry picked from commit 59577df14c06417676a9ffdd599f5713c448e299) Signed-off-by: Aaron Davidson aa...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8202276c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8202276c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8202276c Branch: refs/heads/branch-1.0 Commit: 8202276c916879eeb64e2b5591aa0faf5b0172bd Parents: 80f292a Author: Aaron Davidson aa...@databricks.com Authored: Fri May 9 14:51:34 2014 -0700 Committer: Aaron Davidson aa...@databricks.com Committed: Fri May 9 14:52:13 2014 -0700 -- core/src/main/scala/org/apache/spark/rdd/RDD.scala | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8202276c/core/src/main/scala/org/apache/spark/rdd/RDD.scala -- diff --git a/core/src/main/scala/org/apache/spark/rdd/RDD.scala b/core/src/main/scala/org/apache/spark/rdd/RDD.scala index 9d8d804..a1ca612 100644 --- a/core/src/main/scala/org/apache/spark/rdd/RDD.scala +++ b/core/src/main/scala/org/apache/spark/rdd/RDD.scala @@ -330,9 +330,9 @@ abstract class RDD[T: ClassTag]( if (shuffle) { // include a shuffle step so that our upstream tasks are still distributed new CoalescedRDD( -new ShuffledRDD[Int, T, (Int, T)](map(x = (Utils.random.nextInt(), x)), +new ShuffledRDD[T, Null, (T, Null)](map(x = (x, null)), new HashPartitioner(numPartitions)), -numPartitions).values +numPartitions).keys } else { new CoalescedRDD(this, numPartitions) }
git commit: SPARK-1668: Add implicit preference as an option to examples/MovieLensALS
Repository: spark Updated Branches: refs/heads/branch-1.0 c7b27043a - 35aa2448a SPARK-1668: Add implicit preference as an option to examples/MovieLensALS Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/ Author: Sandeep sand...@techaddict.me Closes #597 from techaddict/SPARK-1668 and squashes the following commits: 8b371dc [Sandeep] Second Pass on reviews by mengxr eca9d37 [Sandeep] based on mengxr's suggestions 937e54c [Sandeep] Changes 5149d40 [Sandeep] Changes based on review 1dd7657 [Sandeep] use mean() 42444d7 [Sandeep] Based on Suggestions by mengxr e3082fa [Sandeep] SPARK-1668: Add implicit preference as an option to examples/MovieLensALS Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/ (cherry picked from commit 108c4c16cc82af2e161d569d2c23849bdbf4aadb) Signed-off-by: Reynold Xin r...@apache.org Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/35aa2448 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/35aa2448 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/35aa2448 Branch: refs/heads/branch-1.0 Commit: 35aa2448ab2f02e822aeef0fbfacf297f0ca39ec Parents: c7b2704 Author: Sandeep sand...@techaddict.me Authored: Thu May 8 00:15:05 2014 -0400 Committer: Reynold Xin r...@apache.org Committed: Thu May 8 00:15:15 2014 -0400 -- .../spark/examples/mllib/MovieLensALS.scala | 55 1 file changed, 46 insertions(+), 9 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/35aa2448/examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala -- diff --git a/examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala b/examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala index 703f022..0e4447e 100644 --- a/examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala +++ b/examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala @@ -43,7 +43,8 @@ object MovieLensALS { kryo: Boolean = false, numIterations: Int = 20, lambda: Double = 1.0, - rank: Int = 10) + rank: Int = 10, + implicitPrefs: Boolean = false) def main(args: Array[String]) { val defaultParams = Params() @@ -62,6 +63,9 @@ object MovieLensALS { opt[Unit](kryo) .text(suse Kryo serialization) .action((_, c) = c.copy(kryo = true)) + opt[Unit](implicitPrefs) +.text(use implicit preference) +.action((_, c) = c.copy(implicitPrefs = true)) arg[String](input) .required() .text(input paths to a MovieLens dataset of ratings) @@ -88,7 +92,25 @@ object MovieLensALS { val ratings = sc.textFile(params.input).map { line = val fields = line.split(::) - Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble) + if (params.implicitPrefs) { +/* + * MovieLens ratings are on a scale of 1-5: + * 5: Must see + * 4: Will enjoy + * 3: It's okay + * 2: Fairly bad + * 1: Awful + * So we should not recommend a movie if the predicted rating is less than 3. + * To map ratings to confidence scores, we use + * 5 - 2.5, 4 - 1.5, 3 - 0.5, 2 - -0.5, 1 - -1.5. This mappings means unobserved + * entries are generally between It's okay and Fairly bad. + * The semantics of 0 in this expanded world of non-positive weights + * are the same as never having interacted at all. + */ +Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble - 2.5) + } else { +Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble) + } }.cache() val numRatings = ratings.count() @@ -99,7 +121,18 @@ object MovieLensALS { val splits = ratings.randomSplit(Array(0.8, 0.2)) val training = splits(0).cache() -val test = splits(1).cache() +val test = if (params.implicitPrefs) { + /* + * 0 means don't know and positive values mean confident that the prediction should be 1. + * Negative values means confident that the prediction should be 0. + * We have in this case used some kind of weighted RMSE. The weight is the absolute value of + * the confidence. The error is the difference between prediction and either 1 or 0, + * depending on whether r is positive or negative. + */ + splits(1).map(x = Rating(x.user, x.product, if (x.rating 0) 1.0 else 0.0)) +} else { + splits(1) +}.cache() val numTraining = training.count() val numTest = test.count() @@ -111,9 +144,10 @@ object MovieLensALS { .setRank(params.rank)
[1/2] SPARK-1565, update examples to be used with spark-submit script.
Repository: spark Updated Branches: refs/heads/master 19c8fb02b - 44dd57fb6 http://git-wip-us.apache.org/repos/asf/spark/blob/44dd57fb/examples/src/main/scala/org/apache/spark/examples/streaming/MQTTWordCount.scala -- diff --git a/examples/src/main/scala/org/apache/spark/examples/streaming/MQTTWordCount.scala b/examples/src/main/scala/org/apache/spark/examples/streaming/MQTTWordCount.scala index 47bf1e5..3a10daa 100644 --- a/examples/src/main/scala/org/apache/spark/examples/streaming/MQTTWordCount.scala +++ b/examples/src/main/scala/org/apache/spark/examples/streaming/MQTTWordCount.scala @@ -24,6 +24,7 @@ import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.mqtt._ +import org.apache.spark.SparkConf /** * A simple Mqtt publisher for demonstration purposes, repeatedly publishes @@ -64,7 +65,6 @@ object MQTTPublisher { } } -// scalastyle:off /** * A sample wordcount with MqttStream stream * @@ -74,30 +74,28 @@ object MQTTPublisher { * Eclipse paho project provides Java library for Mqtt Client http://www.eclipse.org/paho/ * Example Java code for Mqtt Publisher and Subscriber can be found here * https://bitbucket.org/mkjinesh/mqttclient - * Usage: MQTTWordCount master MqttbrokerUrl topic - * In local mode, master should be 'local[n]' with n 1 - * MqttbrokerUrl and topic describe where Mqtt publisher is running. + * Usage: MQTTWordCount MqttbrokerUrl topic +\ * MqttbrokerUrl and topic describe where Mqtt publisher is running. * * To run this example locally, you may run publisher as - *`$ ./bin/run-example org.apache.spark.examples.streaming.MQTTPublisher tcp://localhost:1883 foo` + *`$ ./bin/spark-submit examples.jar \ + *--class org.apache.spark.examples.streaming.MQTTPublisher tcp://localhost:1883 foo` * and run the example as - *`$ ./bin/run-example org.apache.spark.examples.streaming.MQTTWordCount local[2] tcp://localhost:1883 foo` + *`$ ./bin/spark-submit examples.jar \ + *--class org.apache.spark.examples.streaming.MQTTWordCount tcp://localhost:1883 foo` */ -// scalastyle:on object MQTTWordCount { def main(args: Array[String]) { -if (args.length 3) { +if (args.length 2) { System.err.println( -Usage: MQTTWordCount master MqttbrokerUrl topic + - In local mode, master should be 'local[n]' with n 1) +Usage: MQTTWordCount MqttbrokerUrl topic) System.exit(1) } -val Seq(master, brokerUrl, topic) = args.toSeq - -val ssc = new StreamingContext(master, MqttWordCount, Seconds(2), System.getenv(SPARK_HOME), -StreamingContext.jarOfClass(this.getClass).toSeq) +val Seq(brokerUrl, topic) = args.toSeq +val sparkConf = new SparkConf().setAppName(MQTTWordCount) +val ssc = new StreamingContext(sparkConf, Seconds(2)) val lines = MQTTUtils.createStream(ssc, brokerUrl, topic, StorageLevel.MEMORY_ONLY_SER_2) val words = lines.flatMap(x = x.toString.split( )) http://git-wip-us.apache.org/repos/asf/spark/blob/44dd57fb/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala -- diff --git a/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala b/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala index acfe9a4..ad7a199 100644 --- a/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala +++ b/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala @@ -17,41 +17,38 @@ package org.apache.spark.examples.streaming +import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.storage.StorageLevel -// scalastyle:off /** * Counts words in text encoded with UTF8 received from the network every second. * - * Usage: NetworkWordCount master hostname port - * master is the Spark master URL. In local mode, master should be 'local[n]' with n 1. - * hostname and port describe the TCP server that Spark Streaming would connect to receive data. + * Usage: NetworkWordCount hostname port + * hostname and port describe the TCP server that Spark Streaming would connect to receive data. * * To run this on your local machine, you need to first run a Netcat server *`$ nc -lk ` * and then run the example - *`$ ./bin/run-example org.apache.spark.examples.streaming.NetworkWordCount local[2] localhost ` + *`$ ./bin/spark-submit examples.jar \ + *--class org.apache.spark.examples.streaming.NetworkWordCount localhost ` */ -// scalastyle:on object NetworkWordCount { def main(args: Array[String])
svn commit: r1592937 - in /spark: documentation.md site/documentation.html
Author: andrew Date: Wed May 7 03:45:54 2014 New Revision: 1592937 URL: http://svn.apache.org/r1592937 Log: adding more videos from past meetups to documentation page Modified: spark/documentation.md spark/site/documentation.html Modified: spark/documentation.md URL: http://svn.apache.org/viewvc/spark/documentation.md?rev=1592937r1=1592936r2=1592937view=diff == --- spark/documentation.md (original) +++ spark/documentation.md Wed May 7 03:45:54 2014 @@ -49,18 +49,37 @@ See the a href=http://www.youtube.com/ /ul h4Meetup Talk Videos/h4 +In addition to the videos listed below, you can also view a href=http://www.meetup.com/spark-users/files/;all slides from Bay Area meetups here/a. +style type=text/css + .video-meta-info { +font-size: 0.95em; + } +/style ul - lia href=http://www.youtube.com/watch?v=NUQ-8to2XAklist=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a;Spark 1.0 and Beyond - Patrick Wendell/a (a href=http://files.meetup.com/3138542/Spark%201.0%20Meetup.ppt;slides/a) by Patrick Wendell, at Cisco in San Jose, 2014-04-23/li + lia href=http://www.youtube.com/watch?v=NUQ-8to2XAklist=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a;Spark 1.0 and Beyond/a (a href=http://files.meetup.com/3138542/Spark%201.0%20Meetup.ppt;slides/a) span class=video-meta-infoby Patrick Wendell, at Cisco in San Jose, 2014-04-23/span/li - lia href=http://www.youtube.com/watch?v=ju2OQEXqONUlist=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a;Adding Native SQL Support to Spark with Catalyst/a (a href=http://files.meetup.com/3138542/Spark%20SQL%20Meetup%20-%204-8-2012.pdf;slides/a) by Michael Armbrust, at Tagged in San Francisco, 2014-04-08/li + lia href=http://www.youtube.com/watch?v=ju2OQEXqONUlist=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a;Adding Native SQL Support to Spark with Catalyst/a (a href=http://files.meetup.com/3138542/Spark%20SQL%20Meetup%20-%204-8-2012.pdf;slides/a) span class=video-meta-infoby Michael Armbrust, at Tagged in SF, 2014-04-08/span/li - lia href=http://www.youtube.com/watch?v=MY0NkZY_tJwlist=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a;SparkR and GraphX/a (a href=http://files.meetup.com/3138542/SparkR-meetup.pdf;SparkR slides/a, a href=http://files.meetup.com/3138542/graphx%40spark_meetup03_2014.pdf;GraphX slides/a) by Shivaram Venkataraman amp; Dan Crankshaw, at SkyDeck in Berkeley, 2014-03-25/li + lia href=http://www.youtube.com/watch?v=MY0NkZY_tJwlist=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a;SparkR and GraphX/a (slides: a href=http://files.meetup.com/3138542/SparkR-meetup.pdf;SparkR/a, a href=http://files.meetup.com/3138542/graphx%40spark_meetup03_2014.pdf;GraphX/a) span class=video-meta-infoby Shivaram Venkataraman amp; Dan Crankshaw, at SkyDeck in Berkeley, 2014-03-25/span/li + + lia href=http://www.youtube.com/watch?v=5niXiiEX5pElist=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a;Simple deployment w/ SIMR amp; Advanced Shark Analytics w/ TGFs/a (a href=http://files.meetup.com/3138542/tgf.pptx;slides/a) span class=video-meta-infoby Ali Ghodsi, at Huawei in Santa Clara, 2014-02-05/span/li + + lia href=http://www.youtube.com/watch?v=C7gWtxelYNMlist=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a;Stores, Monoids amp; Dependency Injection - Abstractions for Spark/a (a href=http://files.meetup.com/3138542/Abstractions%20for%20spark%20streaming%20-%20spark%20meetup%20presentation.pdf;slides/a) span class=video-meta-infoby Ryan Weald, at Sharethrough in SF, 2014-01-17/span/li + + lia href=https://www.youtube.com/watch?v=IxDnF_X4M-8;Distributed Machine Learning using MLbase/a (a href=http://files.meetup.com/3138542/sparkmeetup_8_6_13_final_reduced.pdf;slides/a) span class=video-meta-infoby Evan Sparks amp; Ameet Talwalkar, at Twitter in SF, 2013-08-06/span/li + + lia href=https://www.youtube.com/watch?v=vJQ2RZj9hqs;GraphX Preview: Graph Analysis on Spark/a span class=video-meta-infoby Reynold Xin amp; Joseph Gonzalez, at Flurry in SF, 2013-07-02/span/li + + lia href=http://www.youtube.com/watch?v=D1knCQZQQnw;Deep Dive with Spark Streaming/a (a href=http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617;slides/a) span class=video-meta-infoby Tathagata Das, at Plug and Play in Sunnyvale, 2013-06-17/span/li + + lia href=https://www.youtube.com/watch?v=cAZ624-69PQ;Tachyon and Shark update/a (slides: a href=http://files.meetup.com/3138542/2013-05-09%20Shark%20%40%20Spark%20Meetup.pdf;Shark/a, a href=http://files.meetup.com/3138542/Tachyon_2013-05-09_Spark_Meetup.pdf;Tachyon/a) span class=video-meta-infoby Ali Ghodsi, Haoyuan Li, Reynold Xin, Google Ventures, 2013-05-09/span/li + + lia href=https://www.youtube.com/playlist?list=PLxwbieuTaYXmWTBovyyw2NibPfUaJk-h4;Spark 0.7: Overview, pySpark, amp; Streaming/a span class=video-meta-infoby Matei Zaharia, Josh Rosen, Tathagata Das, at Conviva on 2013-02-21/span/li + + lia href=https://www.youtube.com/watch?v=49Hr5xZyTEA;Introduction
git commit: SPARK-1569 Spark on Yarn, authentication broken by pr299
Repository: spark Updated Branches: refs/heads/master 520087224 - 4bec84b6a SPARK-1569 Spark on Yarn, authentication broken by pr299 Pass the configs as java options since the executor needs to know before it registers whether to create the connection using authentication or not.We could see about passing only the authentication configs but for now I just had it pass them all. I also updating it to use a list to construct the command to make it the same as ClientBase and avoid any issues with spaces. Author: Thomas Graves tgra...@apache.org Closes #649 from tgravescs/SPARK-1569 and squashes the following commits: 0178ab8 [Thomas Graves] add akka settings 22a8735 [Thomas Graves] Change to only path spark.auth* configs 8ccc1d4 [Thomas Graves] SPARK-1569 Spark on Yarn, authentication broken Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4bec84b6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4bec84b6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4bec84b6 Branch: refs/heads/master Commit: 4bec84b6a23e1e642708a70a6c7ef7b3d1df9b3e Parents: 5200872 Author: Thomas Graves tgra...@apache.org Authored: Wed May 7 15:51:53 2014 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Wed May 7 15:51:53 2014 -0700 -- .../deploy/yarn/ExecutorRunnableUtil.scala | 49 1 file changed, 30 insertions(+), 19 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4bec84b6/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala -- diff --git a/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala b/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala index 96f8aa9..32f8861 100644 --- a/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala +++ b/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala @@ -21,7 +21,7 @@ import java.io.File import java.net.URI import scala.collection.JavaConversions._ -import scala.collection.mutable.HashMap +import scala.collection.mutable.{HashMap, ListBuffer} import org.apache.hadoop.fs.Path import org.apache.hadoop.yarn.api._ @@ -44,9 +44,9 @@ trait ExecutorRunnableUtil extends Logging { hostname: String, executorMemory: Int, executorCores: Int, - localResources: HashMap[String, LocalResource]) = { + localResources: HashMap[String, LocalResource]): List[String] = { // Extra options for the JVM -var JAVA_OPTS = +val JAVA_OPTS = ListBuffer[String]() // Set the JVM memory val executorMemoryString = executorMemory + m JAVA_OPTS += -Xms + executorMemoryString + -Xmx + executorMemoryString + @@ -56,10 +56,21 @@ trait ExecutorRunnableUtil extends Logging { JAVA_OPTS += opts } -JAVA_OPTS += -Djava.io.tmpdir= + - new Path(Environment.PWD.$(), YarnConfiguration.DEFAULT_CONTAINER_TEMP_DIR) + +JAVA_OPTS += -Djava.io.tmpdir= + + new Path(Environment.PWD.$(), YarnConfiguration.DEFAULT_CONTAINER_TEMP_DIR) JAVA_OPTS += ClientBase.getLog4jConfiguration(localResources) +// Certain configs need to be passed here because they are needed before the Executor +// registers with the Scheduler and transfers the spark configs. Since the Executor backend +// uses Akka to connect to the scheduler, the akka settings are needed as well as the +// authentication settings. +sparkConf.getAll. + filter { case (k, v) = k.startsWith(spark.auth) || k.startsWith(spark.akka) }. + foreach { case (k, v) = JAVA_OPTS += -D + k + = + \\\ + v + \\\ } + +sparkConf.getAkkaConf. + foreach { case (k, v) = JAVA_OPTS += -D + k + = + \\\ + v + \\\ } + // Commenting it out for now - so that people can refer to the properties if required. Remove // it once cpuset version is pushed out. // The context is, default gc for server class machines end up using all cores to do gc - hence @@ -85,25 +96,25 @@ trait ExecutorRunnableUtil extends Logging { } */ -val commands = List[String]( - Environment.JAVA_HOME.$() + /bin/java + - -server + +val commands = Seq(Environment.JAVA_HOME.$() + /bin/java, + -server, // Kill if OOM is raised - leverage yarn's failure handling to cause rescheduling. // Not killing the task leaves various aspects of the executor and (to some extent) the jvm in // an inconsistent state. // TODO: If the OOM is not recoverable by rescheduling it on different node, then do // 'something' to fail job ... akin to blacklisting trackers in mapred ? - -XX:OnOutOfMemoryError='kill %p'
git commit: [SPARK-1754] [SQL] Add missing arithmetic DSL operations.
Repository: spark Updated Branches: refs/heads/master 5c5e7d580 - 322b1808d [SPARK-1754] [SQL] Add missing arithmetic DSL operations. Add missing arithmetic DSL operations: `unary_-`, `%`. Author: Takuya UESHIN ues...@happy-camper.st Closes #689 from ueshin/issues/SPARK-1754 and squashes the following commits: a09ef69 [Takuya UESHIN] Add also missing ! (not) operation. f73ae2c [Takuya UESHIN] Remove redundant tests. 5b3f087 [Takuya UESHIN] Add tests relating DSL operations. e09c5b8 [Takuya UESHIN] Add missing arithmetic DSL operations. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/322b1808 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/322b1808 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/322b1808 Branch: refs/heads/master Commit: 322b1808d21143dc323493203929488d69e8878a Parents: 5c5e7d5 Author: Takuya UESHIN ues...@happy-camper.st Authored: Thu May 8 15:31:47 2014 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Thu May 8 15:31:47 2014 -0700 -- .../org/apache/spark/sql/catalyst/dsl/package.scala | 4 .../expressions/ExpressionEvaluationSuite.scala | 16 +++- 2 files changed, 19 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/322b1808/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala index dc83485..78d3a1d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala @@ -57,10 +57,14 @@ package object dsl { trait ImplicitOperators { def expr: Expression +def unary_- = UnaryMinus(expr) +def unary_! = Not(expr) + def + (other: Expression) = Add(expr, other) def - (other: Expression) = Subtract(expr, other) def * (other: Expression) = Multiply(expr, other) def / (other: Expression) = Divide(expr, other) +def % (other: Expression) = Remainder(expr, other) def (other: Expression) = And(expr, other) def || (other: Expression) = Or(expr, other) http://git-wip-us.apache.org/repos/asf/spark/blob/322b1808/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvaluationSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvaluationSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvaluationSuite.scala index 91605d0..344d8a3 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvaluationSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvaluationSuite.scala @@ -61,7 +61,7 @@ class ExpressionEvaluationSuite extends FunSuite { test(3VL Not) { notTrueTable.foreach { case (v, answer) = -val expr = Not(Literal(v, BooleanType)) +val expr = ! Literal(v, BooleanType) val result = expr.eval(null) if (result != answer) fail(s$expr should not evaluate to $result, expected: $answer)} @@ -381,6 +381,13 @@ class ExpressionEvaluationSuite extends FunSuite { checkEvaluation(Add(c1, Literal(null, IntegerType)), null, row) checkEvaluation(Add(Literal(null, IntegerType), c2), null, row) checkEvaluation(Add(Literal(null, IntegerType), Literal(null, IntegerType)), null, row) + +checkEvaluation(-c1, -1, row) +checkEvaluation(c1 + c2, 3, row) +checkEvaluation(c1 - c2, -1, row) +checkEvaluation(c1 * c2, 2, row) +checkEvaluation(c1 / c2, 0, row) +checkEvaluation(c1 % c2, 1, row) } test(BinaryComparison) { @@ -395,6 +402,13 @@ class ExpressionEvaluationSuite extends FunSuite { checkEvaluation(LessThan(c1, Literal(null, IntegerType)), null, row) checkEvaluation(LessThan(Literal(null, IntegerType), c2), null, row) checkEvaluation(LessThan(Literal(null, IntegerType), Literal(null, IntegerType)), null, row) + +checkEvaluation(c1 c2, true, row) +checkEvaluation(c1 = c2, true, row) +checkEvaluation(c1 c2, false, row) +checkEvaluation(c1 = c2, false, row) +checkEvaluation(c1 === c2, false, row) +checkEvaluation(c1 !== c2, true, row) } }
git commit: Use numpy directly for matrix multiply.
Repository: spark Updated Branches: refs/heads/branch-1.0 35aa2448a - 010040fd0 Use numpy directly for matrix multiply. Using matrix multiply to compute XtX and XtY yields a 5-20x speedup depending on problem size. For example - the following takes 19s locally after this change vs. 5m21s before the change. (16x speedup). bin/pyspark examples/src/main/python/als.py local[8] 1000 1000 50 10 10 Author: Evan Sparks evan.spa...@gmail.com Closes #687 from etrain/patch-1 and squashes the following commits: e094dbc [Evan Sparks] Touching only diaganols on update. d1ab9b6 [Evan Sparks] Use numpy directly for matrix multiply. (cherry picked from commit 6ed7e2cd01955adfbb3960e2986b6d19eaee8717) Signed-off-by: Reynold Xin r...@apache.org Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/010040fd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/010040fd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/010040fd Branch: refs/heads/branch-1.0 Commit: 010040fd0ddd38717e8747c884bc8b1cbf684d38 Parents: 35aa244 Author: Evan Sparks evan.spa...@gmail.com Authored: Thu May 8 00:24:36 2014 -0400 Committer: Reynold Xin r...@apache.org Committed: Thu May 8 00:24:44 2014 -0400 -- examples/src/main/python/als.py | 15 +++ 1 file changed, 7 insertions(+), 8 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/010040fd/examples/src/main/python/als.py -- diff --git a/examples/src/main/python/als.py b/examples/src/main/python/als.py index a77dfb2..33700ab 100755 --- a/examples/src/main/python/als.py +++ b/examples/src/main/python/als.py @@ -36,14 +36,13 @@ def rmse(R, ms, us): def update(i, vec, mat, ratings): uu = mat.shape[0] ff = mat.shape[1] -XtX = matrix(np.zeros((ff, ff))) -Xty = np.zeros((ff, 1)) - -for j in range(uu): -v = mat[j, :] -XtX += v.T * v -Xty += v.T * ratings[i, j] -XtX += np.eye(ff, ff) * LAMBDA * uu + +XtX = mat.T * mat +XtY = mat.T * ratings[i, :].T + +for j in range(ff): +XtX[j,j] += LAMBDA * uu + return np.linalg.solve(XtX, Xty) if __name__ == __main__:
git commit: Nicer logging for SecurityManager startup
Repository: spark Updated Branches: refs/heads/master ca4318686 - 7f6f4a103 Nicer logging for SecurityManager startup Happy to open a jira ticket if you'd like to track one there. Author: Andrew Ash and...@andrewash.com Closes #678 from ash211/SecurityManagerLogging and squashes the following commits: 2aa0b7a [Andrew Ash] Nicer logging for SecurityManager startup Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7f6f4a10 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7f6f4a10 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7f6f4a10 Branch: refs/heads/master Commit: 7f6f4a1035ae0c9fa2029fe991f621ca263d53e0 Parents: ca43186 Author: Andrew Ash and...@andrewash.com Authored: Wed May 7 17:24:12 2014 -0400 Committer: Reynold Xin r...@apache.org Committed: Wed May 7 17:24:12 2014 -0400 -- core/src/main/scala/org/apache/spark/SecurityManager.scala | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7f6f4a10/core/src/main/scala/org/apache/spark/SecurityManager.scala -- diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala b/core/src/main/scala/org/apache/spark/SecurityManager.scala index b4b0067..74aa441 100644 --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala @@ -146,8 +146,9 @@ private[spark] class SecurityManager(sparkConf: SparkConf) extends Logging { setViewAcls(defaultAclUsers, sparkConf.get(spark.ui.view.acls, )) private val secretKey = generateSecretKey() - logInfo(SecurityManager, is authentication enabled: + authOn + - are ui acls enabled: + uiAclsOn + users with view permissions: + viewAcls.toString()) + logInfo(SecurityManager: authentication + (if (authOn) enabled else disabled) + +; ui acls + (if (uiAclsOn) enabled else disabled) + +; users with view permissions: + viewAcls.toString()) // Set our own authenticator to properly negotiate user/password for HTTP connections. // This is needed by the HTTP client fetching from the HttpServer. Put here so its
git commit: Update GradientDescentSuite.scala
Repository: spark Updated Branches: refs/heads/master 3188553f7 - 0c19bb161 Update GradientDescentSuite.scala use more faster way to construct an array Author: baishuo(ç½ç¡) vc_j...@hotmail.com Closes #588 from baishuo/master and squashes the following commits: 45b95fb [baishuo(ç½ç¡)] Update GradientDescentSuite.scala c03b61c [baishuo(ç½ç¡)] Update GradientDescentSuite.scala b666d27 [baishuo(ç½ç¡)] Update GradientDescentSuite.scala Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0c19bb16 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0c19bb16 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0c19bb16 Branch: refs/heads/master Commit: 0c19bb161b9b2b96c0c55d3ea09e81fd798cbec0 Parents: 3188553 Author: baishuo(ç½ç¡) vc_j...@hotmail.com Authored: Wed May 7 16:02:55 2014 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Wed May 7 16:02:55 2014 -0700 -- .../apache/spark/mllib/optimization/GradientDescentSuite.scala | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0c19bb16/mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala b/mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala index c4b4334..8a16284 100644 --- a/mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala @@ -81,11 +81,11 @@ class GradientDescentSuite extends FunSuite with LocalSparkContext with ShouldMa // Add a extra variable consisting of all 1.0's for the intercept. val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42) val data = testData.map { case LabeledPoint(label, features) = - label - Vectors.dense(1.0, features.toArray: _*) + label - Vectors.dense(1.0 +: features.toArray) } val dataRDD = sc.parallelize(data, 2).cache() -val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*) +val initialWeightsWithIntercept = Vectors.dense(1.0 +: initialWeights.toArray) val (_, loss) = GradientDescent.runMiniBatchSGD( dataRDD, @@ -111,7 +111,7 @@ class GradientDescentSuite extends FunSuite with LocalSparkContext with ShouldMa // Add a extra variable consisting of all 1.0's for the intercept. val testData = GradientDescentSuite.generateGDInput(2.0, -1.5, 1, 42) val data = testData.map { case LabeledPoint(label, features) = - label - Vectors.dense(1.0, features.toArray: _*) + label - Vectors.dense(1.0 +: features.toArray) } val dataRDD = sc.parallelize(data, 2).cache()
git commit: [SPARK-1755] Respect SparkSubmit --name on YARN
Repository: spark Updated Branches: refs/heads/branch-1.0 ab912271a - 666bebe63 [SPARK-1755] Respect SparkSubmit --name on YARN Right now, SparkSubmit ignores the `--name` flag for both yarn-client and yarn-cluster. This is a bug. In client mode, SparkSubmit treats `--name` as a [cluster config](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L170) and does not propagate this to SparkContext. In cluster mode, SparkSubmit passes this flag to `org.apache.spark.deploy.yarn.Client`, which only uses it for the [YARN ResourceManager](https://github.com/apache/spark/blob/master/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L80), but does not propagate this to SparkContext. This PR ensures that `spark.app.name` is always set if SparkSubmit receives the `--name` flag, which is what the usage promises. This makes it possible for applications to start a SparkContext with an empty conf `val sc = new SparkContext(new SparkConf)`, and inherit the app name from SparkSubmit. Tested both modes on a YARN cluster. Author: Andrew Or andrewo...@gmail.com Closes #699 from andrewor14/yarn-app-name and squashes the following commits: 98f6a79 [Andrew Or] Fix tests dea932f [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-app-name c86d9ca [Andrew Or] Respect SparkSubmit --name on YARN (cherry picked from commit 8b7841299439b7dc590b2f7e2339f24e8f3e19f6) Signed-off-by: Patrick Wendell pwend...@gmail.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/666bebe6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/666bebe6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/666bebe6 Branch: refs/heads/branch-1.0 Commit: 666bebe63ef30be80dd1496e5f9164dd4cdb2016 Parents: ab91227 Author: Andrew Or andrewo...@gmail.com Authored: Thu May 8 20:45:29 2014 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Thu May 8 20:45:37 2014 -0700 -- .../main/scala/org/apache/spark/deploy/SparkSubmit.scala | 9 + .../scala/org/apache/spark/deploy/SparkSubmitSuite.scala | 10 ++ 2 files changed, 11 insertions(+), 8 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/666bebe6/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala index e39723f..16de6f7 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala @@ -160,6 +160,7 @@ object SparkSubmit { // each deploy mode; we iterate through these below val options = List[OptionAssigner]( OptionAssigner(args.master, ALL_CLUSTER_MGRS, false, sysProp = spark.master), + OptionAssigner(args.name, ALL_CLUSTER_MGRS, false, sysProp = spark.app.name), OptionAssigner(args.driverExtraClassPath, STANDALONE | YARN, true, sysProp = spark.driver.extraClassPath), OptionAssigner(args.driverExtraJavaOptions, STANDALONE | YARN, true, @@ -167,7 +168,7 @@ object SparkSubmit { OptionAssigner(args.driverExtraLibraryPath, STANDALONE | YARN, true, sysProp = spark.driver.extraLibraryPath), OptionAssigner(args.driverMemory, YARN, true, clOption = --driver-memory), - OptionAssigner(args.name, YARN, true, clOption = --name), + OptionAssigner(args.name, YARN, true, clOption = --name, sysProp = spark.app.name), OptionAssigner(args.queue, YARN, true, clOption = --queue), OptionAssigner(args.queue, YARN, false, sysProp = spark.yarn.queue), OptionAssigner(args.numExecutors, YARN, true, clOption = --num-executors), @@ -188,8 +189,7 @@ object SparkSubmit { OptionAssigner(args.jars, YARN, true, clOption = --addJars), OptionAssigner(args.files, LOCAL | STANDALONE | MESOS, false, sysProp = spark.files), OptionAssigner(args.files, LOCAL | STANDALONE | MESOS, true, sysProp = spark.files), - OptionAssigner(args.jars, LOCAL | STANDALONE | MESOS, false, sysProp = spark.jars), - OptionAssigner(args.name, LOCAL | STANDALONE | MESOS, false, sysProp = spark.app.name) + OptionAssigner(args.jars, LOCAL | STANDALONE | MESOS, false, sysProp = spark.jars) ) // For client mode make any added jars immediately visible on the classpath @@ -205,7 +205,8 @@ object SparkSubmit { (clusterManager opt.clusterManager) != 0) { if (opt.clOption != null) { childArgs += (opt.clOption, opt.value) -} else if (opt.sysProp != null) { +} +if (opt.sysProp != null) {
git commit: SPARK-1544 Add support for deep decision trees.
Repository: spark Updated Branches: refs/heads/master 0c19bb161 - f269b016a SPARK-1544 Add support for deep decision trees. @etrain and I came with a PR for arbitrarily deep decision trees at the cost of multiple passes over the data at deep tree levels. To summarize: 1) We take a parameter that indicates the amount of memory users want to reserve for computation on each worker (and 2x that at the driver). 2) Using that information, we calculate two things - the maximum depth to which we train as usual (which is, implicitly, the maximum number of nodes we want to train in parallel), and the size of the groups we should use in the case where we exceed this depth. cc: @atalwalkar, @hirakendu, @mengxr Author: Manish Amde manish...@gmail.com Author: manishamde manish...@gmail.com Author: Evan Sparks spa...@cs.berkeley.edu Closes #475 from manishamde/deep_tree and squashes the following commits: 968ca9d [Manish Amde] merged master 7fc9545 [Manish Amde] added docs ce004a1 [Manish Amde] minor formatting b27ad2c [Manish Amde] formatting 426bb28 [Manish Amde] programming guide blurb 8053fed [Manish Amde] more formatting 5eca9e4 [Manish Amde] grammar 4731cda [Manish Amde] formatting 5e82202 [Manish Amde] added documentation, fixed off by 1 error in max level calculation cbd9f14 [Manish Amde] modified scala.math to math dad9652 [Manish Amde] removed unused imports e0426ee [Manish Amde] renamed parameter 718506b [Manish Amde] added unit test 1517155 [Manish Amde] updated documentation 9dbdabe [Manish Amde] merge from master 719d009 [Manish Amde] updating user documentation fecf89a [manishamde] Merge pull request #6 from etrain/deep_tree 0287772 [Evan Sparks] Fixing scalastyle issue. 2f1e093 [Manish Amde] minor: added doc for maxMemory parameter 2f6072c [manishamde] Merge pull request #5 from etrain/deep_tree abc5a23 [Evan Sparks] Parameterizing max memory. 50b143a [Manish Amde] adding support for very deep trees Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f269b016 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f269b016 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f269b016 Branch: refs/heads/master Commit: f269b016acb17b24d106dc2b32a1be389489bb01 Parents: 0c19bb1 Author: Manish Amde manish...@gmail.com Authored: Wed May 7 17:08:38 2014 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Wed May 7 17:08:38 2014 -0700 -- docs/mllib-decision-tree.md | 15 ++- .../examples/mllib/DecisionTreeRunner.scala | 2 +- .../apache/spark/mllib/tree/DecisionTree.scala | 103 +-- .../mllib/tree/configuration/Strategy.scala | 6 +- .../spark/mllib/tree/DecisionTreeSuite.scala| 84 +-- 5 files changed, 177 insertions(+), 33 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f269b016/docs/mllib-decision-tree.md -- diff --git a/docs/mllib-decision-tree.md b/docs/mllib-decision-tree.md index 296277e..acf0fef 100644 --- a/docs/mllib-decision-tree.md +++ b/docs/mllib-decision-tree.md @@ -93,17 +93,14 @@ The recursive tree construction is stopped at a node when one of the two conditi 1. The node depth is equal to the `maxDepth` training parameter 2. No split candidate leads to an information gain at the node. +### Max memory requirements + +For faster processing, the decision tree algorithm performs simultaneous histogram computations for all nodes at each level of the tree. This could lead to high memory requirements at deeper levels of the tree leading to memory overflow errors. To alleviate this problem, a 'maxMemoryInMB' training parameter is provided which specifies the maximum amount of memory at the workers (twice as much at the master) to be allocated to the histogram computation. The default value is conservatively chosen to be 128 MB to allow the decision algorithm to work in most scenarios. Once the memory requirements for a level-wise computation crosses the `maxMemoryInMB` threshold, the node training tasks at each subsequent level is split into smaller tasks. + ### Practical limitations -1. The tree implementation stores an `Array[Double]` of size *O(#features \* #splits \* 2^maxDepth)* - in memory for aggregating histograms over partitions. The current implementation might not scale - to very deep trees since the memory requirement grows exponentially with tree depth. -2. The implemented algorithm reads both sparse and dense data. However, it is not optimized for - sparse input. -3. Python is not supported in this release. - -We are planning to solve these problems in the near future. Please drop us a line if you encounter -any issues. +1. The implemented algorithm reads both sparse and
git commit: Typo fix: fetchting - fetching
Repository: spark Updated Branches: refs/heads/branch-1.0 69e2726d4 - 0759ee790 Typo fix: fetchting - fetching Author: Andrew Ash and...@andrewash.com Closes #680 from ash211/patch-3 and squashes the following commits: 9ce3746 [Andrew Ash] Typo fix: fetchting - fetching (cherry picked from commit d00981a95185229fd1594d5c030a00f219fb1a14) Signed-off-by: Reynold Xin r...@apache.org Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0759ee79 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0759ee79 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0759ee79 Branch: refs/heads/branch-1.0 Commit: 0759ee790527f61bf9f4bcef4aa0befa1d430370 Parents: 69e2726 Author: Andrew Ash and...@andrewash.com Authored: Wed May 7 17:24:49 2014 -0400 Committer: Reynold Xin r...@apache.org Committed: Wed May 7 17:29:35 2014 -0400 -- make-distribution.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0759ee79/make-distribution.sh -- diff --git a/make-distribution.sh b/make-distribution.sh index ebcd8c7..759e555 100755 --- a/make-distribution.sh +++ b/make-distribution.sh @@ -189,7 +189,7 @@ if [ $SPARK_TACHYON == true ]; then TMPD=`mktemp -d 2/dev/null || mktemp -d -t 'disttmp'` pushd $TMPD /dev/null - echo Fetchting tachyon tgz + echo Fetching tachyon tgz wget $TACHYON_URL tar xf tachyon-${TACHYON_VERSION}-bin.tar.gz