date:20170317

[GitHub] spark pull request #17297: [SPARK-14649][CORE] DagScheduler should not run d...

2017-03-17 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/17297#discussion_r106774285
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -193,13 +193,6 @@ private[spark] class TaskSchedulerImpl 
private[scheduler](
   val stageTaskSets =
 taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new 
HashMap[Int, TaskSetManager])
   stageTaskSets(taskSet.stageAttemptId) = manager
-  val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
--- End diff --

actually, that is not really the point of this check.  Its just checking if 
one stage has two tasksets (aka stage attempts), where both are in the 
"non-zombie" state.  It doesn't do any checks at all on what tasks are actually 
in those tasksets.

This is just checking an invariant which we believe to always be true, but 
we figure its better to fail-fast if we hit this condition, rather than proceed 
with some inconsistent state.  This check was added because behavior gets 
*really* confusing when the invariant is violated, and though we think it 
should always be true, we've still hit cases where it happens.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16209: [WIP][SPARK-10849][SQL] Adds option to the JDBC data sou...

2017-03-17 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/16209
  
Sorry to cutting in though, IMHO we need to have general logic to inject 
user-defined types via `UDTRegistration` into the DDL parser 
(`CatalystSqlParser`). If we have the logic, we could use the types in a 
expression string.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17312: [SPARK-19973] Display num of executors for the stage.

2017-03-17 Thread jinxing64

Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/17312
  
@rxin Thanks a lot. I added a number after `Aggregated Metrics by Executor`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17312: [SPARK-19973] Display num of executors for the stage.

2017-03-17 Thread jinxing64

Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/17312
  

![screenshot](https://cloud.githubusercontent.com/assets/4058918/24069386/0f556622-0be2-11e7-9f48-cc096cdd7d9b.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

2017-03-17 Thread squito

Github user squito commented on the issue:

https://github.com/apache/spark/pull/17297

I'm a bit confused by the description:

> 1. When a fetch failure happens, the task set manager ask the dag
scheduler to abort all the non-running tasks. However, the running tasks in the
task set are not killed.

this is already true. when there is a fetch failure, the [TaskSetManager
is marked as
zombie](https://demo.fluentcode.com/source/spark/master/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala?squery=TaskSetManager#L755),
and the DAGScheduler resubmits stages, but nothing actively kills running
tasks.

> re-launches all tasks in the stage with the fetch failure that hadn't
completed when the fetch failure occurred (the DAGScheduler re-lanches all of
the tasks whose output data is not available -- which is equivalent to the set
of tasks that hadn't yet completed).

I don't think its true that it relaunches all tasks that hadn't completed
_when the fetch failure occurred_. it relaunches all the tasks haven't
completed, by the time the stage gets resubmitted. More tasks can complete in
between the time of the first failure, and the time the stage is resubmitted.

But there are several other potential issues you may be trying to address.

Say there is stage 0 and stage 1, each one has 10 tasks. Stage 0 completes
fine on the first attempt, then stage 1 starts. Tasks 0 & 1 in stage 1
complete, but then there is a fetch failure in task 2. Lets also say we have
an abundance of cluster resources so tasks 3 - 9 from stage 1, attempt 0 are
still running.

Stage 0 get resubmitted as attempt 1, just to regenerate the map output for
whatever executor had the data for the fetch failure -- perhaps its just one
task from stage 0 that needs to resubmitted. Now, lots of different scenarios
are possible:

(a) Tasks 3 - 9 from stage 1 attempt 0 all finish successfully while stage
0 attempt 1 is running. So when stage 0 attempt 1 finishes, then stage 1
attempt 1 is submitted, just with Task 2. If it completely succesfully, we're
done (no wasted work).

(b) stage 0 attempt 1 finishes, before tasks 3 - 9 from stage 1 attempt 0
have finished. So stage 1 gets submitted again as stage 1 attempt 1, with
tasks 2 - 9. So there are now two copies running for tasks 3 - 9. Maybe all
the tasks from attempt 0 actually finish shortly after attempt 1 starts. In
this case, the stage is complete as soon as there is one complete attempt for
each task. But even after the stage completes successfully, all the other
tasks keep running anyway. (plenty of wasted work)

(c) like (b), but shortly after stage 1 attempt 1 is submitted, we get
another fetch failure in one of the old "zombie" tasks from stage 1 attempt 0.
But the [DAGScheduler realizes it already has a more recent attempt for this
stage](https://demo.fluentcode.com/source/spark/master/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala?squery=DAgScheduler#L1268),
so it ignores the fetch failure. All the other tasks keep running as usual.
If there aren't any other issues, the stage completes when there is one
completed attempt for each task. (same amount of wasted work as (b)).

(d) While stage 0 attempt 1 is running, we get another fetch failure from
stage 1 attempt 0, say in Task 3, which has a failure from a *different
executor*. Maybe its from a completely different host (just by chance, or
there may be cluster maintenance where multiple hosts are serviced at once); or
maybe its from another executor on the same host (at least, until we do
something about your other pr on unregistering all shuffle files on a host).
To be honest, I don't understand how things work in this scenario. We [mark
stage 0 as
failed](https://demo.fluentcode.com/source/spark/master/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala?squery=DAgScheduler#L1303),
we [unregister some shuffle
output](https://demo.fluentcode.com/source/spark/master/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala?squery=DAgScheduler#L1328),
and [we resubmit stage
0](https://demo.fluentcode.com/source/spark/master/master/core/src/main/scala/org/apache/s
park/scheduler/DAGScheduler.scala?squery=DAgScheduler#L1319). But stage 0
attempt 1 is still running, so I would have expected us to end up with
conflicting task sets. Whatever the real behavior is here, it seems we're at
risk of having even more duplicated work for yet another attempt for stage 1.

etc.

So I think in (b) and (c), you are trying to avoid resubmitting tasks 3-9
on stage 1 attempt 1. the thing is, there is a strong reason to believe that
the original version of those tasks will fail. Most likely, those tasks needs
map output from the same executor that caused the first fetch failure. So Kay
is

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106774155
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
--- End diff --

why the return type is `Seq[Seq[LogicalPlan]]`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106774124
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
+case PhysicalOperation(_, _, t: LeafNode) if 
t.stats(conf).rowCount.isDefined => true
+case _ => false
+  }
+
+  if (!foundEligibleJoin) {
+// Some plans don't have stats or are complex plans. 
Conservatively,
+// return an empty star join. This restriction can be lifted
+// once statistics are propagated in the plan.
+emptyStarJoinPlan
+  } else {
+// Find the fact table using cardinality based heuristics i.e.
+// the table with the largest number of rows.
+val sortedFactTables = input.map { plan =>
+

[GitHub] spark pull request #17088: [SPARK-19753][CORE] Un-register all shuffle outpu...

2017-03-17 Thread sitalkedia

Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/17088#discussion_r106774067
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -394,6 +394,68 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with Timeou
 assertDataStructuresEmpty()
   }
 
+  test("All shuffle files should on the slave should be cleaned up when 
slave lost") {
+// reset the test context with the right shuffle service config
+afterEach()
+val conf = new SparkConf()
+conf.set("spark.shuffle.service.enabled", "true")
+init(conf)
+runEvent(ExecutorAdded("exec-hostA1", "hostA"))
+runEvent(ExecutorAdded("exec-hostA2", "hostA"))
+runEvent(ExecutorAdded("exec-hostB", "hostB"))
+val firstRDD = new MyRDD(sc, 3, Nil)
+val firstShuffleDep = new ShuffleDependency(firstRDD, new 
HashPartitioner(2))
+val firstShuffleId = firstShuffleDep.shuffleId
+val shuffleMapRdd = new MyRDD(sc, 3, List(firstShuffleDep))
--- End diff --

You are right, it was confusing before. Changed accordingly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17088: [SPARK-19753][CORE] Un-register all shuffle outpu...

2017-03-17 Thread sitalkedia

Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/17088#discussion_r106774060
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1390,7 +1401,34 @@ class DAGScheduler(
   }
 } else {
   logDebug("Additional executor lost message for " + execId +
-   "(epoch " + currentEpoch + ")")
+"(epoch " + currentEpoch + ")")
+}
+  }
+
+  private[scheduler] def removeExecutorAndUnregisterOutputOnHost(
+  execId: String,
+  host: String,
+  maybeEpoch: Option[Long] = None) {
+val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+if (!failedEpoch.contains(execId) || failedEpoch(execId) < 
currentEpoch) {
+  failedEpoch(execId) = currentEpoch
+  logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
+  blockManagerMaster.removeExecutor(execId)
+  for ((shuffleId, stage) <- shuffleIdToMapStage) {
+logInfo("Shuffle files lost for host: %s (epoch %d)".format(host, 
currentEpoch))
+stage.removeOutputsOnHost(host)
+mapOutputTracker.registerMapOutputs(
+  shuffleId,
+  stage.outputLocInMapOutputTrackerFormat(),
+  changeEpoch = true)
+  }
+  if (shuffleIdToMapStage.isEmpty) {
+mapOutputTracker.incrementEpoch()
+  }
+  clearCacheLocs()
+} else {
+  logDebug("Additional executor lost message for " + execId +
+"(epoch " + currentEpoch + ")")
--- End diff --

Made changes as suggested, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17088: [SPARK-19753][CORE] Un-register all shuffle outpu...

2017-03-17 Thread sitalkedia

Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/17088#discussion_r106774040
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1365,18 +1369,25 @@ class DAGScheduler(
*/
   private[scheduler] def handleExecutorLost(
   execId: String,
-  filesLost: Boolean,
+  fileLost: Boolean) {
+removeExecutorAndUnregisterOutputOnExecutor(execId,
+  fileLost || !env.blockManager.externalShuffleServiceEnabled, None)
+  }
+
+
+  private[scheduler] def removeExecutorAndUnregisterOutputOnExecutor(
+  execId: String,
+  fileLost: Boolean,
   maybeEpoch: Option[Long] = None) {
 val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
 if (!failedEpoch.contains(execId) || failedEpoch(execId) < 
currentEpoch) {
   failedEpoch(execId) = currentEpoch
   logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
   blockManagerMaster.removeExecutor(execId)
-
-  if (filesLost || !env.blockManager.externalShuffleServiceEnabled) {
-logInfo("Shuffle files lost for executor: %s (epoch 
%d)".format(execId, currentEpoch))
+  if (fileLost) {
 // TODO: This will be really slow if we keep accumulating shuffle 
map stages
 for ((shuffleId, stage) <- shuffleIdToMapStage) {
+  logInfo("Shuffle files lost for executor: %s (epoch 
%d)".format(execId, currentEpoch))
--- End diff --

Ah, my bad, thanks for noticing. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17088: [SPARK-19753][CORE] Un-register all shuffle outpu...

2017-03-17 Thread sitalkedia

Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/17088#discussion_r106774047
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1390,7 +1401,34 @@ class DAGScheduler(
   }
 } else {
   logDebug("Additional executor lost message for " + execId +
-   "(epoch " + currentEpoch + ")")
+"(epoch " + currentEpoch + ")")
+}
+  }
+
+  private[scheduler] def removeExecutorAndUnregisterOutputOnHost(
+  execId: String,
+  host: String,
+  maybeEpoch: Option[Long] = None) {
--- End diff --

done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17088: [SPARK-19753][CORE] Un-register all shuffle outpu...

2017-03-17 Thread sitalkedia

Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/17088#discussion_r106774045
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1390,7 +1401,34 @@ class DAGScheduler(
   }
 } else {
   logDebug("Additional executor lost message for " + execId +
-   "(epoch " + currentEpoch + ")")
+"(epoch " + currentEpoch + ")")
--- End diff --

done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106773943
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
+case PhysicalOperation(_, _, t: LeafNode) if 
t.stats(conf).rowCount.isDefined => true
+case _ => false
+  }
+
+  if (!foundEligibleJoin) {
+// Some plans don't have stats or are complex plans. 
Conservatively,
+// return an empty star join. This restriction can be lifted
+// once statistics are propagated in the plan.
+emptyStarJoinPlan
+  } else {
+// Find the fact table using cardinality based heuristics i.e.
+// the table with the largest number of rows.
+val sortedFactTables = input.map { plan =>
+

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106773716
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -83,9 +411,19 @@ object ReorderJoin extends Rule[LogicalPlan] with 
PredicateHelper {
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
-case j @ ExtractFiltersAndInnerJoins(input, conditions)
+case ExtractFiltersAndInnerJoins(input, conditions)
 if input.size > 2 && conditions.nonEmpty =>
-  createOrderedJoin(input, conditions)
+  if (conf.starSchemaDetection && !conf.cboEnabled) {
--- End diff --

I don't think this algorithm conflicts with CBO, thought it conflicts with 
the other join reordering rule.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106773686
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
+case PhysicalOperation(_, _, t: LeafNode) if 
t.stats(conf).rowCount.isDefined => true
+case _ => false
+  }
+
+  if (!foundEligibleJoin) {
+// Some plans don't have stats or are complex plans. 
Conservatively,
+// return an empty star join. This restriction can be lifted
+// once statistics are propagated in the plan.
+emptyStarJoinPlan
+  } else {
+// Find the fact table using cardinality based heuristics i.e.
+// the table with the largest number of rows.
+val sortedFactTables = input.map { plan =>
+

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106773624
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
+case PhysicalOperation(_, _, t: LeafNode) if 
t.stats(conf).rowCount.isDefined => true
+case _ => false
+  }
+
+  if (!foundEligibleJoin) {
+// Some plans don't have stats or are complex plans. 
Conservatively,
+// return an empty star join. This restriction can be lifted
+// once statistics are propagated in the plan.
+emptyStarJoinPlan
+  } else {
+// Find the fact table using cardinality based heuristics i.e.
+// the table with the largest number of rows.
+val sortedFactTables = input.map { plan =>
+

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106773584
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
+case PhysicalOperation(_, _, t: LeafNode) if 
t.stats(conf).rowCount.isDefined => true
+case _ => false
+  }
+
+  if (!foundEligibleJoin) {
+// Some plans don't have stats or are complex plans. 
Conservatively,
+// return an empty star join. This restriction can be lifted
+// once statistics are propagated in the plan.
+emptyStarJoinPlan
+  } else {
+// Find the fact table using cardinality based heuristics i.e.
+// the table with the largest number of rows.
+val sortedFactTables = input.map { plan =>
+

[GitHub] spark issue #17312: [SPARK-19973] Display num of executors for the stage.

2017-03-17 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17312
  
Can you put a screenshot here? Might actually be useful to have.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17318: [SPARK-19896][SQL] Throw an exception if case classes ha...

2017-03-17 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17318
  
Can you put the after exception in the pr description as well?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17337: [SQL][MINOR] Fix scaladoc for UDFRegistration

2017-03-17 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17337


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17337: [SQL][MINOR] Fix scaladoc for UDFRegistration

2017-03-17 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17337
  
Merging in master/branch-2.1.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17138: [SPARK-17080] [SQL] join reorder

2017-03-17 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17138#discussion_r106773079
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, 
AttributeSet, Expression, PredicateHelper}
+import org.apache.spark.sql.catalyst.plans.{Inner, InnerLike}
+import org.apache.spark.sql.catalyst.plans.logical.{BinaryNode, Join, 
LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+
+/**
+ * Cost-based join reorder.
+ * We may have several join reorder algorithms in the future. This class 
is the entry of these
+ * algorithms, and chooses which one to use.
+ */
+case class CostBasedJoinReorder(conf: CatalystConf) extends 
Rule[LogicalPlan] with PredicateHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.cboEnabled || !conf.joinReorderEnabled) {
+  plan
+} else {
+  val result = plan transform {
+case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) =>
+  reorder(p, p.outputSet)
+case j @ Join(_, _, _: InnerLike, _) =>
+  reorder(j, j.outputSet)
+  }
+  // After reordering is finished, convert OrderedJoin back to Join
+  result transform {
+case oj: OrderedJoin => oj.join
+  }
+}
+  }
+
+  def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = {
+val (items, conditions) = extractInnerJoins(plan)
+val result =
+  // Do reordering if the number of items is appropriate and join 
conditions exist.
+  // We also need to check if costs of all items can be evaluated.
+  if (items.size > 2 && items.size <= conf.joinReorderDPThreshold && 
conditions.nonEmpty &&
+  items.forall(_.stats(conf).rowCount.isDefined)) {
+JoinReorderDP.search(conf, items, conditions, 
output).getOrElse(plan)
+  } else {
+plan
+  }
+// Set consecutive join nodes ordered.
+replaceWithOrderedJoin(result)
+  }
+
+  /**
+   * Extract consecutive inner joinable items and join conditions.
+   * This method works for bushy trees and left/right deep trees.
+   */
+  private def extractInnerJoins(plan: LogicalPlan): (Seq[LogicalPlan], 
Set[Expression]) = {
+plan match {
+  case Join(left, right, _: InnerLike, cond) =>
+val (leftPlans, leftConditions) = extractInnerJoins(left)
+val (rightPlans, rightConditions) = extractInnerJoins(right)
+(leftPlans ++ rightPlans, 
cond.toSet.flatMap(splitConjunctivePredicates) ++
+  leftConditions ++ rightConditions)
+  case Project(projectList, join) if 
projectList.forall(_.isInstanceOf[Attribute]) =>
+extractInnerJoins(join)
+  case _ =>
+(Seq(plan), Set())
+}
+  }
+
+  private def replaceWithOrderedJoin(plan: LogicalPlan): LogicalPlan = 
plan match {
+case j @ Join(left, right, _: InnerLike, cond) =>
--- End diff --

Nit: `cond` -> `_`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17138: [SPARK-17080] [SQL] join reorder

2017-03-17 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17138#discussion_r106773023
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, 
AttributeSet, Expression, PredicateHelper}
+import org.apache.spark.sql.catalyst.plans.{Inner, InnerLike}
+import org.apache.spark.sql.catalyst.plans.logical.{BinaryNode, Join, 
LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+
+/**
+ * Cost-based join reorder.
+ * We may have several join reorder algorithms in the future. This class 
is the entry of these
+ * algorithms, and chooses which one to use.
+ */
+case class CostBasedJoinReorder(conf: CatalystConf) extends 
Rule[LogicalPlan] with PredicateHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.cboEnabled || !conf.joinReorderEnabled) {
+  plan
+} else {
+  val result = plan transform {
+case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) =>
+  reorder(p, p.outputSet)
+case j @ Join(_, _, _: InnerLike, _) =>
+  reorder(j, j.outputSet)
+  }
+  // After reordering is finished, convert OrderedJoin back to Join
+  result transform {
+case oj: OrderedJoin => oj.join
+  }
+}
+  }
+
+  def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = {
+val (items, conditions) = extractInnerJoins(plan)
--- End diff --

Nit: `items` -> `subplans`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17138: [SPARK-17080] [SQL] join reorder

2017-03-17 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17138#discussion_r106773013
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, 
AttributeSet, Expression, PredicateHelper}
+import org.apache.spark.sql.catalyst.plans.{Inner, InnerLike}
+import org.apache.spark.sql.catalyst.plans.logical.{BinaryNode, Join, 
LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+
+/**
+ * Cost-based join reorder.
+ * We may have several join reorder algorithms in the future. This class 
is the entry of these
+ * algorithms, and chooses which one to use.
+ */
+case class CostBasedJoinReorder(conf: CatalystConf) extends 
Rule[LogicalPlan] with PredicateHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.cboEnabled || !conf.joinReorderEnabled) {
+  plan
+} else {
+  val result = plan transform {
+case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) =>
+  reorder(p, p.outputSet)
+case j @ Join(_, _, _: InnerLike, _) =>
+  reorder(j, j.outputSet)
+  }
+  // After reordering is finished, convert OrderedJoin back to Join
+  result transform {
+case oj: OrderedJoin => oj.join
+  }
+}
+  }
+
+  def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = {
+val (items, conditions) = extractInnerJoins(plan)
+val result =
+  // Do reordering if the number of items is appropriate and join 
conditions exist.
+  // We also need to check if costs of all items can be evaluated.
+  if (items.size > 2 && items.size <= conf.joinReorderDPThreshold && 
conditions.nonEmpty &&
+  items.forall(_.stats(conf).rowCount.isDefined)) {
+JoinReorderDP.search(conf, items, conditions, 
output).getOrElse(plan)
+  } else {
+plan
+  }
+// Set consecutive join nodes ordered.
+replaceWithOrderedJoin(result)
+  }
+
+  /**
+   * Extract consecutive inner joinable items and join conditions.
--- End diff --

How about

`Extracts the join conditions and sub-plans of consecutive inner joins`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17333: [SPARK-19997] [SQL]fix proxy ugi could not get tgt to ca...

2017-03-17 Thread yaooqinn

Github user yaooqinn commented on the issue:

https://github.com/apache/spark/pull/17333
  
@jerryshao ð


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106772868
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
+case PhysicalOperation(_, _, t: LeafNode) if 
t.stats(conf).rowCount.isDefined => true
+case _ => false
+  }
+
+  if (!foundEligibleJoin) {
+// Some plans don't have stats or are complex plans. 
Conservatively,
+// return an empty star join. This restriction can be lifted
+// once statistics are propagated in the plan.
+emptyStarJoinPlan
+  } else {
+// Find the fact table using cardinality based heuristics i.e.
+// the table with the largest number of rows.
+val sortedFactTables = input.map { plan =>
+

[GitHub] spark pull request #17240: [SPARK-19915][SQL] Improve join reorder: simplify...

2017-03-17 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17240#discussion_r106772853
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -36,27 +36,24 @@ case class CostBasedJoinReorder(conf: CatalystConf) 
extends Rule[LogicalPlan] wi
 if (!conf.cboEnabled || !conf.joinReorderEnabled) {
   plan
 } else {
-  val result = plan transform {
-case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) =>
-  reorder(p, p.outputSet)
-case j @ Join(_, _, _: InnerLike, _) =>
-  reorder(j, j.outputSet)
+  val result = plan transformDown {
+case j @ Join(_, _, _: InnerLike, _) => reorder(j)
   }
   // After reordering is finished, convert OrderedJoin back to Join
-  result transform {
+  result transformDown {
 case oj: OrderedJoin => oj.join
   }
 }
   }
 
-  def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = {
+  def reorder(plan: LogicalPlan): LogicalPlan = {
--- End diff --

private


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 402 matches

Mail list logo