[GitHub] spark pull request #17297: [SPARK-14649][CORE] DagScheduler should not run d...

2017-03-17 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/17297#discussion_r106774285
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -193,13 +193,6 @@ private[spark] class TaskSchedulerImpl 
private[scheduler](
   val stageTaskSets =
 taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new 
HashMap[Int, TaskSetManager])
   stageTaskSets(taskSet.stageAttemptId) = manager
-  val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
--- End diff --

actually, that is not really the point of this check.  Its just checking if 
one stage has two tasksets (aka stage attempts), where both are in the 
"non-zombie" state.  It doesn't do any checks at all on what tasks are actually 
in those tasksets.

This is just checking an invariant which we believe to always be true, but 
we figure its better to fail-fast if we hit this condition, rather than proceed 
with some inconsistent state.  This check was added because behavior gets 
*really* confusing when the invariant is violated, and though we think it 
should always be true, we've still hit cases where it happens.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16209: [WIP][SPARK-10849][SQL] Adds option to the JDBC data sou...

2017-03-17 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/16209
  
Sorry to cutting in though, IMHO we need to have general logic to inject 
user-defined types via `UDTRegistration` into the DDL parser 
(`CatalystSqlParser`). If we have the logic, we could use the types in a 
expression string.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17312: [SPARK-19973] Display num of executors for the stage.

2017-03-17 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/17312
  
@rxin Thanks a lot. I added a number after `Aggregated Metrics by Executor`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17312: [SPARK-19973] Display num of executors for the stage.

2017-03-17 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/17312
  

![screenshot](https://cloud.githubusercontent.com/assets/4058918/24069386/0f556622-0be2-11e7-9f48-cc096cdd7d9b.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

2017-03-17 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/17297
  
I'm a bit confused by the description:

> 1. When a fetch failure happens, the task set manager ask the dag 
scheduler to abort all the non-running tasks. However, the running tasks in the 
task set are not killed.

this is already true.  when there is a fetch failure, the [TaskSetManager 
is marked as 
zombie](https://demo.fluentcode.com/source/spark/master/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala?squery=TaskSetManager#L755),
 and the DAGScheduler resubmits stages, but nothing actively kills running 
tasks.

>  re-launches all tasks in the stage with the fetch failure that hadn't 
completed when the fetch failure occurred (the DAGScheduler re-lanches all of 
the tasks whose output data is not available -- which is equivalent to the set 
of tasks that hadn't yet completed).

I don't think its true that it relaunches all tasks that hadn't completed 
_when the fetch failure occurred_.  it relaunches all the tasks haven't 
completed, by the time the stage gets resubmitted.  More tasks can complete in 
between the time of the first failure, and the time the stage is resubmitted.

But there are several other potential issues you may be trying to address.

Say there is stage 0 and stage 1, each one has 10 tasks.  Stage 0 completes 
fine on the first attempt, then stage 1 starts.  Tasks 0 & 1 in stage 1 
complete, but then there is a fetch failure in task 2.  Lets also say we have 
an abundance of cluster resources so tasks 3 - 9 from stage 1, attempt 0 are 
still running.

Stage 0 get resubmitted as attempt 1, just to regenerate the map output for 
whatever executor had the data for the fetch failure -- perhaps its just one 
task from stage 0 that needs to resubmitted.  Now, lots of different scenarios 
are possible:

(a) Tasks 3 - 9 from stage 1 attempt 0 all finish successfully while stage 
0 attempt 1 is running.  So when stage 0 attempt 1 finishes, then stage 1 
attempt 1 is submitted, just with Task 2.  If it completely succesfully, we're 
done (no wasted work).

(b) stage 0 attempt 1 finishes, before tasks 3 - 9 from stage 1 attempt 0 
have finished.  So stage 1 gets submitted again as stage 1 attempt 1, with 
tasks 2 - 9.  So there are now two copies running for tasks 3 - 9. Maybe all 
the tasks from attempt 0 actually finish shortly after attempt 1 starts.  In 
this case, the stage is complete as soon as there is one complete attempt for 
each task.  But even after the stage completes successfully, all the other 
tasks keep running anyway.  (plenty of wasted work)

(c) like (b), but shortly after stage 1 attempt 1 is submitted, we get 
another fetch failure in one of the old "zombie" tasks from stage 1 attempt 0.  
But the [DAGScheduler realizes it already has a more recent attempt for this 
stage](https://demo.fluentcode.com/source/spark/master/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala?squery=DAgScheduler#L1268),
 so it ignores the fetch failure.  All the other tasks keep running as usual.  
If there aren't any other issues, the stage completes when there is one 
completed attempt for each task.  (same amount of wasted work as (b)).

(d) While stage 0 attempt 1 is running, we get another fetch failure from 
stage 1 attempt 0, say in Task 3, which has a failure from a *different 
executor*.  Maybe its from a completely different host (just by chance, or 
there may be cluster maintenance where multiple hosts are serviced at once); or 
maybe its from another executor on the same host (at least, until we do 
something about your other pr on unregistering all shuffle files on a host).  
To be honest, I don't understand how things work in this scenario.  We [mark 
stage 0 as 
failed](https://demo.fluentcode.com/source/spark/master/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala?squery=DAgScheduler#L1303),
 we [unregister some shuffle 
output](https://demo.fluentcode.com/source/spark/master/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala?squery=DAgScheduler#L1328),
 and [we resubmit stage 
0](https://demo.fluentcode.com/source/spark/master/master/core/src/main/scala/org/apache/s
 park/scheduler/DAGScheduler.scala?squery=DAgScheduler#L1319).  But stage 0 
attempt 1 is still running, so I would have expected us to end up with 
conflicting task sets.  Whatever the real behavior is here, it seems we're at 
risk of having even more duplicated work for yet another attempt for stage 1.

etc.

So I think in (b) and (c), you are trying to avoid resubmitting tasks 3-9 
on stage 1 attempt 1.  the thing is, there is a strong reason to believe that 
the original version of those tasks will fail.  Most likely, those tasks needs 
map output from the same executor that caused the first fetch failure.  So Kay 
is 

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-17 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106774155
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
--- End diff --

why the return type is `Seq[Seq[LogicalPlan]]`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-17 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106774124
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
+case PhysicalOperation(_, _, t: LeafNode) if 
t.stats(conf).rowCount.isDefined => true
+case _ => false
+  }
+
+  if (!foundEligibleJoin) {
+// Some plans don't have stats or are complex plans. 
Conservatively,
+// return an empty star join. This restriction can be lifted
+// once statistics are propagated in the plan.
+emptyStarJoinPlan
+  } else {
+// Find the fact table using cardinality based heuristics i.e.
+// the table with the largest number of rows.
+val sortedFactTables = input.map { plan =>
+

[GitHub] spark pull request #17088: [SPARK-19753][CORE] Un-register all shuffle outpu...

2017-03-17 Thread sitalkedia
Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/17088#discussion_r106774067
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -394,6 +394,68 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with Timeou
 assertDataStructuresEmpty()
   }
 
+  test("All shuffle files should on the slave should be cleaned up when 
slave lost") {
+// reset the test context with the right shuffle service config
+afterEach()
+val conf = new SparkConf()
+conf.set("spark.shuffle.service.enabled", "true")
+init(conf)
+runEvent(ExecutorAdded("exec-hostA1", "hostA"))
+runEvent(ExecutorAdded("exec-hostA2", "hostA"))
+runEvent(ExecutorAdded("exec-hostB", "hostB"))
+val firstRDD = new MyRDD(sc, 3, Nil)
+val firstShuffleDep = new ShuffleDependency(firstRDD, new 
HashPartitioner(2))
+val firstShuffleId = firstShuffleDep.shuffleId
+val shuffleMapRdd = new MyRDD(sc, 3, List(firstShuffleDep))
--- End diff --

You are right, it was confusing before. Changed accordingly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17088: [SPARK-19753][CORE] Un-register all shuffle outpu...

2017-03-17 Thread sitalkedia
Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/17088#discussion_r106774060
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1390,7 +1401,34 @@ class DAGScheduler(
   }
 } else {
   logDebug("Additional executor lost message for " + execId +
-   "(epoch " + currentEpoch + ")")
+"(epoch " + currentEpoch + ")")
+}
+  }
+
+  private[scheduler] def removeExecutorAndUnregisterOutputOnHost(
+  execId: String,
+  host: String,
+  maybeEpoch: Option[Long] = None) {
+val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+if (!failedEpoch.contains(execId) || failedEpoch(execId) < 
currentEpoch) {
+  failedEpoch(execId) = currentEpoch
+  logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
+  blockManagerMaster.removeExecutor(execId)
+  for ((shuffleId, stage) <- shuffleIdToMapStage) {
+logInfo("Shuffle files lost for host: %s (epoch %d)".format(host, 
currentEpoch))
+stage.removeOutputsOnHost(host)
+mapOutputTracker.registerMapOutputs(
+  shuffleId,
+  stage.outputLocInMapOutputTrackerFormat(),
+  changeEpoch = true)
+  }
+  if (shuffleIdToMapStage.isEmpty) {
+mapOutputTracker.incrementEpoch()
+  }
+  clearCacheLocs()
+} else {
+  logDebug("Additional executor lost message for " + execId +
+"(epoch " + currentEpoch + ")")
--- End diff --

Made changes as suggested, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17088: [SPARK-19753][CORE] Un-register all shuffle outpu...

2017-03-17 Thread sitalkedia
Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/17088#discussion_r106774040
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1365,18 +1369,25 @@ class DAGScheduler(
*/
   private[scheduler] def handleExecutorLost(
   execId: String,
-  filesLost: Boolean,
+  fileLost: Boolean) {
+removeExecutorAndUnregisterOutputOnExecutor(execId,
+  fileLost || !env.blockManager.externalShuffleServiceEnabled, None)
+  }
+
+
+  private[scheduler] def removeExecutorAndUnregisterOutputOnExecutor(
+  execId: String,
+  fileLost: Boolean,
   maybeEpoch: Option[Long] = None) {
 val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
 if (!failedEpoch.contains(execId) || failedEpoch(execId) < 
currentEpoch) {
   failedEpoch(execId) = currentEpoch
   logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
   blockManagerMaster.removeExecutor(execId)
-
-  if (filesLost || !env.blockManager.externalShuffleServiceEnabled) {
-logInfo("Shuffle files lost for executor: %s (epoch 
%d)".format(execId, currentEpoch))
+  if (fileLost) {
 // TODO: This will be really slow if we keep accumulating shuffle 
map stages
 for ((shuffleId, stage) <- shuffleIdToMapStage) {
+  logInfo("Shuffle files lost for executor: %s (epoch 
%d)".format(execId, currentEpoch))
--- End diff --

Ah, my bad, thanks for noticing. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17088: [SPARK-19753][CORE] Un-register all shuffle outpu...

2017-03-17 Thread sitalkedia
Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/17088#discussion_r106774047
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1390,7 +1401,34 @@ class DAGScheduler(
   }
 } else {
   logDebug("Additional executor lost message for " + execId +
-   "(epoch " + currentEpoch + ")")
+"(epoch " + currentEpoch + ")")
+}
+  }
+
+  private[scheduler] def removeExecutorAndUnregisterOutputOnHost(
+  execId: String,
+  host: String,
+  maybeEpoch: Option[Long] = None) {
--- End diff --

done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17088: [SPARK-19753][CORE] Un-register all shuffle outpu...

2017-03-17 Thread sitalkedia
Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/17088#discussion_r106774045
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1390,7 +1401,34 @@ class DAGScheduler(
   }
 } else {
   logDebug("Additional executor lost message for " + execId +
-   "(epoch " + currentEpoch + ")")
+"(epoch " + currentEpoch + ")")
--- End diff --

done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-17 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106773943
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
+case PhysicalOperation(_, _, t: LeafNode) if 
t.stats(conf).rowCount.isDefined => true
+case _ => false
+  }
+
+  if (!foundEligibleJoin) {
+// Some plans don't have stats or are complex plans. 
Conservatively,
+// return an empty star join. This restriction can be lifted
+// once statistics are propagated in the plan.
+emptyStarJoinPlan
+  } else {
+// Find the fact table using cardinality based heuristics i.e.
+// the table with the largest number of rows.
+val sortedFactTables = input.map { plan =>
+

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-17 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106773716
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -83,9 +411,19 @@ object ReorderJoin extends Rule[LogicalPlan] with 
PredicateHelper {
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
-case j @ ExtractFiltersAndInnerJoins(input, conditions)
+case ExtractFiltersAndInnerJoins(input, conditions)
 if input.size > 2 && conditions.nonEmpty =>
-  createOrderedJoin(input, conditions)
+  if (conf.starSchemaDetection && !conf.cboEnabled) {
--- End diff --

I don't think this algorithm conflicts with CBO, thought it conflicts with 
the other join reordering rule.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-17 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106773686
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
+case PhysicalOperation(_, _, t: LeafNode) if 
t.stats(conf).rowCount.isDefined => true
+case _ => false
+  }
+
+  if (!foundEligibleJoin) {
+// Some plans don't have stats or are complex plans. 
Conservatively,
+// return an empty star join. This restriction can be lifted
+// once statistics are propagated in the plan.
+emptyStarJoinPlan
+  } else {
+// Find the fact table using cardinality based heuristics i.e.
+// the table with the largest number of rows.
+val sortedFactTables = input.map { plan =>
+

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-17 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106773624
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
+case PhysicalOperation(_, _, t: LeafNode) if 
t.stats(conf).rowCount.isDefined => true
+case _ => false
+  }
+
+  if (!foundEligibleJoin) {
+// Some plans don't have stats or are complex plans. 
Conservatively,
+// return an empty star join. This restriction can be lifted
+// once statistics are propagated in the plan.
+emptyStarJoinPlan
+  } else {
+// Find the fact table using cardinality based heuristics i.e.
+// the table with the largest number of rows.
+val sortedFactTables = input.map { plan =>
+

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-17 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106773584
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
+case PhysicalOperation(_, _, t: LeafNode) if 
t.stats(conf).rowCount.isDefined => true
+case _ => false
+  }
+
+  if (!foundEligibleJoin) {
+// Some plans don't have stats or are complex plans. 
Conservatively,
+// return an empty star join. This restriction can be lifted
+// once statistics are propagated in the plan.
+emptyStarJoinPlan
+  } else {
+// Find the fact table using cardinality based heuristics i.e.
+// the table with the largest number of rows.
+val sortedFactTables = input.map { plan =>
+

[GitHub] spark issue #17312: [SPARK-19973] Display num of executors for the stage.

2017-03-17 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17312
  
Can you put a screenshot here? Might actually be useful to have.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17318: [SPARK-19896][SQL] Throw an exception if case classes ha...

2017-03-17 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17318
  
Can you put the after exception in the pr description as well?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17337: [SQL][MINOR] Fix scaladoc for UDFRegistration

2017-03-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17337


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17337: [SQL][MINOR] Fix scaladoc for UDFRegistration

2017-03-17 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17337
  
Merging in master/branch-2.1.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17138: [SPARK-17080] [SQL] join reorder

2017-03-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17138#discussion_r106773079
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, 
AttributeSet, Expression, PredicateHelper}
+import org.apache.spark.sql.catalyst.plans.{Inner, InnerLike}
+import org.apache.spark.sql.catalyst.plans.logical.{BinaryNode, Join, 
LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+
+/**
+ * Cost-based join reorder.
+ * We may have several join reorder algorithms in the future. This class 
is the entry of these
+ * algorithms, and chooses which one to use.
+ */
+case class CostBasedJoinReorder(conf: CatalystConf) extends 
Rule[LogicalPlan] with PredicateHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.cboEnabled || !conf.joinReorderEnabled) {
+  plan
+} else {
+  val result = plan transform {
+case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) =>
+  reorder(p, p.outputSet)
+case j @ Join(_, _, _: InnerLike, _) =>
+  reorder(j, j.outputSet)
+  }
+  // After reordering is finished, convert OrderedJoin back to Join
+  result transform {
+case oj: OrderedJoin => oj.join
+  }
+}
+  }
+
+  def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = {
+val (items, conditions) = extractInnerJoins(plan)
+val result =
+  // Do reordering if the number of items is appropriate and join 
conditions exist.
+  // We also need to check if costs of all items can be evaluated.
+  if (items.size > 2 && items.size <= conf.joinReorderDPThreshold && 
conditions.nonEmpty &&
+  items.forall(_.stats(conf).rowCount.isDefined)) {
+JoinReorderDP.search(conf, items, conditions, 
output).getOrElse(plan)
+  } else {
+plan
+  }
+// Set consecutive join nodes ordered.
+replaceWithOrderedJoin(result)
+  }
+
+  /**
+   * Extract consecutive inner joinable items and join conditions.
+   * This method works for bushy trees and left/right deep trees.
+   */
+  private def extractInnerJoins(plan: LogicalPlan): (Seq[LogicalPlan], 
Set[Expression]) = {
+plan match {
+  case Join(left, right, _: InnerLike, cond) =>
+val (leftPlans, leftConditions) = extractInnerJoins(left)
+val (rightPlans, rightConditions) = extractInnerJoins(right)
+(leftPlans ++ rightPlans, 
cond.toSet.flatMap(splitConjunctivePredicates) ++
+  leftConditions ++ rightConditions)
+  case Project(projectList, join) if 
projectList.forall(_.isInstanceOf[Attribute]) =>
+extractInnerJoins(join)
+  case _ =>
+(Seq(plan), Set())
+}
+  }
+
+  private def replaceWithOrderedJoin(plan: LogicalPlan): LogicalPlan = 
plan match {
+case j @ Join(left, right, _: InnerLike, cond) =>
--- End diff --

Nit: `cond` -> `_`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17138: [SPARK-17080] [SQL] join reorder

2017-03-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17138#discussion_r106773023
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, 
AttributeSet, Expression, PredicateHelper}
+import org.apache.spark.sql.catalyst.plans.{Inner, InnerLike}
+import org.apache.spark.sql.catalyst.plans.logical.{BinaryNode, Join, 
LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+
+/**
+ * Cost-based join reorder.
+ * We may have several join reorder algorithms in the future. This class 
is the entry of these
+ * algorithms, and chooses which one to use.
+ */
+case class CostBasedJoinReorder(conf: CatalystConf) extends 
Rule[LogicalPlan] with PredicateHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.cboEnabled || !conf.joinReorderEnabled) {
+  plan
+} else {
+  val result = plan transform {
+case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) =>
+  reorder(p, p.outputSet)
+case j @ Join(_, _, _: InnerLike, _) =>
+  reorder(j, j.outputSet)
+  }
+  // After reordering is finished, convert OrderedJoin back to Join
+  result transform {
+case oj: OrderedJoin => oj.join
+  }
+}
+  }
+
+  def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = {
+val (items, conditions) = extractInnerJoins(plan)
--- End diff --

Nit: `items` -> `subplans`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17138: [SPARK-17080] [SQL] join reorder

2017-03-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17138#discussion_r106773013
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, 
AttributeSet, Expression, PredicateHelper}
+import org.apache.spark.sql.catalyst.plans.{Inner, InnerLike}
+import org.apache.spark.sql.catalyst.plans.logical.{BinaryNode, Join, 
LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+
+/**
+ * Cost-based join reorder.
+ * We may have several join reorder algorithms in the future. This class 
is the entry of these
+ * algorithms, and chooses which one to use.
+ */
+case class CostBasedJoinReorder(conf: CatalystConf) extends 
Rule[LogicalPlan] with PredicateHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.cboEnabled || !conf.joinReorderEnabled) {
+  plan
+} else {
+  val result = plan transform {
+case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) =>
+  reorder(p, p.outputSet)
+case j @ Join(_, _, _: InnerLike, _) =>
+  reorder(j, j.outputSet)
+  }
+  // After reordering is finished, convert OrderedJoin back to Join
+  result transform {
+case oj: OrderedJoin => oj.join
+  }
+}
+  }
+
+  def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = {
+val (items, conditions) = extractInnerJoins(plan)
+val result =
+  // Do reordering if the number of items is appropriate and join 
conditions exist.
+  // We also need to check if costs of all items can be evaluated.
+  if (items.size > 2 && items.size <= conf.joinReorderDPThreshold && 
conditions.nonEmpty &&
+  items.forall(_.stats(conf).rowCount.isDefined)) {
+JoinReorderDP.search(conf, items, conditions, 
output).getOrElse(plan)
+  } else {
+plan
+  }
+// Set consecutive join nodes ordered.
+replaceWithOrderedJoin(result)
+  }
+
+  /**
+   * Extract consecutive inner joinable items and join conditions.
--- End diff --

How about

`Extracts the join conditions and sub-plans of consecutive inner joins`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17333: [SPARK-19997] [SQL]fix proxy ugi could not get tgt to ca...

2017-03-17 Thread yaooqinn
Github user yaooqinn commented on the issue:

https://github.com/apache/spark/pull/17333
  
@jerryshao 😂


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-17 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106772868
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
+case PhysicalOperation(_, _, t: LeafNode) if 
t.stats(conf).rowCount.isDefined => true
+case _ => false
+  }
+
+  if (!foundEligibleJoin) {
+// Some plans don't have stats or are complex plans. 
Conservatively,
+// return an empty star join. This restriction can be lifted
+// once statistics are propagated in the plan.
+emptyStarJoinPlan
+  } else {
+// Find the fact table using cardinality based heuristics i.e.
+// the table with the largest number of rows.
+val sortedFactTables = input.map { plan =>
+

[GitHub] spark pull request #17240: [SPARK-19915][SQL] Improve join reorder: simplify...

2017-03-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17240#discussion_r106772853
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -36,27 +36,24 @@ case class CostBasedJoinReorder(conf: CatalystConf) 
extends Rule[LogicalPlan] wi
 if (!conf.cboEnabled || !conf.joinReorderEnabled) {
   plan
 } else {
-  val result = plan transform {
-case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) =>
-  reorder(p, p.outputSet)
-case j @ Join(_, _, _: InnerLike, _) =>
-  reorder(j, j.outputSet)
+  val result = plan transformDown {
+case j @ Join(_, _, _: InnerLike, _) => reorder(j)
   }
   // After reordering is finished, convert OrderedJoin back to Join
-  result transform {
+  result transformDown {
 case oj: OrderedJoin => oj.join
   }
 }
   }
 
-  def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = {
+  def reorder(plan: LogicalPlan): LogicalPlan = {
--- End diff --

private


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17338: [SPARK-19990][SQL]create a temp file for file in test.ja...

2017-03-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17338
  
**[Test build #74769 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74769/testReport)**
 for PR 17338 at commit 
[`09b5e40`](https://github.com/apache/spark/commit/09b5e4065d23fbc8ddc64bc1dc92557648d03818).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17338: [SPARK-19990][SQL]create a temp file for file in ...

2017-03-17 Thread windpiger
GitHub user windpiger opened a pull request:

https://github.com/apache/spark/pull/17338

[SPARK-19990][SQL]create a temp file for file in test.jar's resource when 
run test accross different modules

## What changes were proposed in this pull request?

After we have merged the `HiveDDLSuite` and `DDLSuite` in 
[SPARK-19235](https://issues.apache.org/jira/browse/SPARK-19235), we have two 
subclasses of `DDLSuite`, that is `HiveCatalogedDDLSuite` and 
`InMemoryCatalogDDLSuite`.

While `DDLSuite` is in `sql/core module`, and `HiveCatalogedDDLSuite` is in 
`sql/hive module`, if we test 
`HiveCatalogedDDLSuite`, it will run the test in its parent class 
`DDLSuite`, this will cause some test case failed which will get and use the 
test file path in `sql/core module` 's `resource`.

Because the test file path getted will start with 'jar:' like 
"jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv",
 which will failed when new Path() new Path in datasource.scala

This PR fix this by copy file from resource to  a temp dir.

## How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/windpiger/spark fixtestfailemvn

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17338.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17338


commit 64c6b223840307506468430ed5b50d24bf76218d
Author: windpiger 
Date:   2017-03-18T04:07:25Z

[SPARK-19990][SQL]create a temp file for file in test.jar's resource when 
run test accross different modules

commit 09b5e4065d23fbc8ddc64bc1dc92557648d03818
Author: windpiger 
Date:   2017-03-18T04:17:15Z

fix a comment




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17112: [WIP] Measurement for SPARK-16929.

2017-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17112
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74765/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17112: [WIP] Measurement for SPARK-16929.

2017-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17112
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17112: [WIP] Measurement for SPARK-16929.

2017-03-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17112
  
**[Test build #74765 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74765/testReport)**
 for PR 17112 at commit 
[`cfc7e33`](https://github.com/apache/spark/commit/cfc7e331356b2296a66273e8d059635ba1c7991b).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17191: [SPARK-14471][SQL] Aliases in SELECT could be used in GR...

2017-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17191
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74764/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17191: [SPARK-14471][SQL] Aliases in SELECT could be used in GR...

2017-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17191
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17191: [SPARK-14471][SQL] Aliases in SELECT could be used in GR...

2017-03-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17191
  
**[Test build #74764 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74764/testReport)**
 for PR 17191 at commit 
[`f982f30`](https://github.com/apache/spark/commit/f982f30b22c827e1f8058522b5f6d05f41735ff2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17219: [SPARK-19876][SS][WIP] OneTime Trigger Executor

2017-03-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17219
  
**[Test build #74767 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74767/testReport)**
 for PR 17219 at commit 
[`2989fad`](https://github.com/apache/spark/commit/2989fadd54156f7483822eadb591736370301c16).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17088
  
**[Test build #74768 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74768/testReport)**
 for PR 17088 at commit 
[`9f64e29`](https://github.com/apache/spark/commit/9f64e2931eabd2fcc5909123e73c9c046caceb3b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-17 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106772404
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
+case PhysicalOperation(_, _, t: LeafNode) if 
t.stats(conf).rowCount.isDefined => true
+case _ => false
+  }
+
+  if (!foundEligibleJoin) {
+// Some plans don't have stats or are complex plans. 
Conservatively,
+// return an empty star join. This restriction can be lifted
+// once statistics are propagated in the plan.
+emptyStarJoinPlan
+  } else {
+// Find the fact table using cardinality based heuristics i.e.
+// the table with the largest number of rows.
+val sortedFactTables = input.map { plan =>
+

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-17 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106772374
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
+case PhysicalOperation(_, _, t: LeafNode) if 
t.stats(conf).rowCount.isDefined => true
+case _ => false
+  }
+
+  if (!foundEligibleJoin) {
+// Some plans don't have stats or are complex plans. 
Conservatively,
+// return an empty star join. This restriction can be lifted
+// once statistics are propagated in the plan.
+emptyStarJoinPlan
+  } else {
+// Find the fact table using cardinality based heuristics i.e.
+// the table with the largest number of rows.
+val sortedFactTables = input.map { plan =>
+

[GitHub] spark issue #17219: [SPARK-19876][SS][WIP] OneTime Trigger Executor

2017-03-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17219
  
**[Test build #74766 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74766/testReport)**
 for PR 17219 at commit 
[`5d2ba62`](https://github.com/apache/spark/commit/5d2ba62f70442de0a4eeedc248d5e9ad3a611f2a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17192: [SPARK-19849][SQL] Support ArrayType in to_json to produ...

2017-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17192
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17192: [SPARK-19849][SQL] Support ArrayType in to_json to produ...

2017-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17192
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74763/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-17 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106772124
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
--- End diff --

I think we can just write `input.forall(_.stats(conf).rowCount.isDefined)`, 
both `Filter` and `Project` can propagate the row count


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[GitHub] spark issue #17192: [SPARK-19849][SQL] Support ArrayType in to_json to produ...

2017-03-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17192
  
**[Test build #74763 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74763/testReport)**
 for PR 17192 at commit 
[`185ea60`](https://github.com/apache/spark/commit/185ea6003d60feed20c56de61c17bc304663d99a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class JsonToStructs(`
  * `case class StructsToJson(`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-17 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106772029
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,340 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import org.apache.spark.sql.catalyst.planning.{BaseTableAccess, 
ExtractFiltersAndInnerJoins}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class DetectStarSchemaJoin(conf: CatalystConf) extends 
PredicateHelper {
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
--- End diff --

so this is a limitation of this algorithm, not a limitation of 
catalyst/spark sql, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-17 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106771968
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SimpleCatalystConf.scala
 ---
@@ -40,6 +40,9 @@ case class SimpleCatalystConf(
 override val cboEnabled: Boolean = false,
 override val joinReorderEnabled: Boolean = false,
 override val joinReorderDPThreshold: Int = 12,
+override val starSchemaDetection: Boolean = false,
+override val starSchemaFTRatio: Double = 0.9,
+override val ndvMaxError: Double = 0.05,
--- End diff --

these are not needed anymore, `SQLConf` is in catalyst module now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17337: [SQL][MINOR] Fix scaladoc for UDFRegistration

2017-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17337
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17337: [SQL][MINOR] Fix scaladoc for UDFRegistration

2017-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17337
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74762/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17337: [SQL][MINOR] Fix scaladoc for UDFRegistration

2017-03-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17337
  
**[Test build #74762 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74762/testReport)**
 for PR 17337 at commit 
[`4480e72`](https://github.com/apache/spark/commit/4480e72cb789212c2cc86cea50c61c20d3faa2d1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17216: [SPARK-19873][SS] Record num shuffle partitions in offse...

2017-03-17 Thread tdas
Github user tdas commented on the issue:

https://github.com/apache/spark/pull/17216
  
Seems like an unrelated failure. Probably a flaky test.

On Mar 17, 2017 4:23 PM, "UCB AMPLab"  wrote:

> Merged build finished. Test FAILed.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> , or 
mute
> the thread
> 

> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17164: [SPARK-16844][SQL] Support codegen for sort-based aggrea...

2017-03-17 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/17164
  
@hvanhovell ping


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17318: [SPARK-19896][SQL] Throw an exception if case classes ha...

2017-03-17 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/17318
  
@cloud-fan ping


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17191: [SPARK-14471][SQL] Aliases in SELECT could be use...

2017-03-17 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17191#discussion_r106770623
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -2598,4 +2598,26 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
 }
 assert(!jobStarted.get(), "Command should not trigger a Spark job.")
   }
+
+  test("SPARK-14471 When groupByAliasesEnabled=true, aliases in SELECT 
could exist in GROUP BY") {
+withSQLConf(SQLConf.GROUP_BY_ALIASES_ENABLED.key -> "true") {
--- End diff --

Updated


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17191: [SPARK-14471][SQL] Aliases in SELECT could be used in GR...

2017-03-17 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/17191
  
@gatorsmile I checked and actually MySQL and PostgreSQL cannnot use numbers 
as alias names;

```
// PostgreSQL
postgres=# SELECT gkey1 AS 1 FROM t2;
ERROR:  syntax error at or near "1"
LINE 1: SELECT gkey1 AS 1 FROM t2;

// MySQL
mysql> select gkey1 AS 1 from t2;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual 
that corresponds to your MySQL server version for the right syntax to use near 
'1 from t2' at line 1
mysql> 
```

In current master, spark have the same behaviour; the parser throws an 
exception in the case; 
```
scala> sql("SELECT key1 AS 1 FROM t").show
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '1' expecting {, ',', 'FROM', 'WHERE', 'GROUP', 
'ORDER', 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 
'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 15)

== SQL ==
SELECT key1 AS 1 FROM t
---^^^

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:205)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17109: [SPARK-19740][MESOS]Add support in Spark to pass arbitra...

2017-03-17 Thread yanji84
Github user yanji84 commented on the issue:

https://github.com/apache/spark/pull/17109
  
hello, any update on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17112: [WIP] Measurement for SPARK-16929.

2017-03-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17112
  
**[Test build #74765 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74765/testReport)**
 for PR 17112 at commit 
[`cfc7e33`](https://github.com/apache/spark/commit/cfc7e331356b2296a66273e8d059635ba1c7991b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17191: [SPARK-14471][SQL] Aliases in SELECT could be used in GR...

2017-03-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17191
  
**[Test build #74764 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74764/testReport)**
 for PR 17191 at commit 
[`f982f30`](https://github.com/apache/spark/commit/f982f30b22c827e1f8058522b5f6d05f41735ff2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-17 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/16867
  
@squito 
Sure. I did test for 100k tasks. The results are as below:

|  | time cost |
| --| -- |
| insert | 135ms, 122ms, 119ms, 120ms, 163ms |
| `checkSpeculatableTasks` | 6ms, 6ms, 6ms, 5ms, 6ms |


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

2017-03-17 Thread ioana-delaney
Github user ioana-delaney commented on the issue:

https://github.com/apache/spark/pull/17286
  
@wzhfy Some thoughts on how to solve the Cartesian problem as part of the 
join enumeration algorithm is to apply a similar strategy to the one that we 
discuss for star-plans. You keep track of "connected" tables and "unconnected" 
tables. During join enumeration, mixed combinations are pruned until the plan 
for the “connected” set of tables was built. Then, we add tables from the 
"unconnected" set - maybe only as left-deep trees (i.e. the size of the inner 
is one). Also, knowing that a set of tables are connected through join 
conditions, will allow further plan pruning based on the presence of join 
predicates. Integration with star-join, and probably other heuristics, would 
require to introduce some filtering/pruning strategies on top of the search 
engine. Just some thoughts… 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17191: [SPARK-14471][SQL] Aliases in SELECT could be used in GR...

2017-03-17 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/17191
  
aha, I'll check and just a sec.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17192: [SPARK-19849][SQL] Support ArrayType in to_json to produ...

2017-03-17 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17192
  
(I just rebased to resolve the conflicts)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17192: [SPARK-19849][SQL] Support ArrayType in to_json to produ...

2017-03-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17192
  
**[Test build #74763 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74763/testReport)**
 for PR 17192 at commit 
[`185ea60`](https://github.com/apache/spark/commit/185ea6003d60feed20c56de61c17bc304663d99a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17337: [SQL][MINOR] Fix scaladoc for UDFRegistration

2017-03-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17337
  
**[Test build #74762 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74762/testReport)**
 for PR 17337 at commit 
[`4480e72`](https://github.com/apache/spark/commit/4480e72cb789212c2cc86cea50c61c20d3faa2d1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17337: [SQL][MINOR] Fix scaladoc for UDFRegistration

2017-03-17 Thread jaceklaskowski
GitHub user jaceklaskowski opened a pull request:

https://github.com/apache/spark/pull/17337

[SQL][MINOR] Fix scaladoc for UDFRegistration

## What changes were proposed in this pull request?

Fix scaladoc for UDFRegistration

## How was this patch tested?

local build

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jaceklaskowski/spark udfregistration-scaladoc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17337.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17337


commit 4480e72cb789212c2cc86cea50c61c20d3faa2d1
Author: Jacek Laskowski 
Date:   2017-03-18T01:15:50Z

[SQL][MINOR] Fix scaladoc for UDFRegistration




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17333: [SPARK-19997] [SQL]fix proxy ugi could not get tgt to ca...

2017-03-17 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/17333
  
What a coincidence 😄 !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16626: [SPARK-19261][SQL] Alter add columns for Hive serde and ...

2017-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16626
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16626: [SPARK-19261][SQL] Alter add columns for Hive serde and ...

2017-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16626
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74761/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16626: [SPARK-19261][SQL] Alter add columns for Hive serde and ...

2017-03-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16626
  
**[Test build #74761 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74761/testReport)**
 for PR 16626 at commit 
[`b219178`](https://github.com/apache/spark/commit/b2191788f5d60261946da63e8c2634d5c6dfe6f5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17315: [SPARK-19949][SQL] unify bad record handling in C...

2017-03-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17315#discussion_r106768186
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala
 ---
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.unsafe.types.UTF8String
+
+class FailureSafeParser[IN](
+func: IN => Seq[InternalRow],
+mode: String,
+schema: StructType,
+columnNameOfCorruptRecord: String) {
+
+  private val corruptFieldIndex = 
schema.getFieldIndex(columnNameOfCorruptRecord)
+  private val actualSchema = StructType(schema.filterNot(_.name == 
columnNameOfCorruptRecord))
+  private val resultRow = new GenericInternalRow(schema.length)
+
+  private val toResultRow: (Option[InternalRow], () => UTF8String) => 
InternalRow = {
+if (corruptFieldIndex.isDefined) {
+  (row, badRecord) => {
+for ((f, i) <- actualSchema.zipWithIndex) {
+  resultRow(schema.fieldIndex(f.name)) = row.map(_.get(i, 
f.dataType)).orNull
+}
+resultRow(corruptFieldIndex.get) = badRecord()
+resultRow
+  }
+} else {
+  (row, badRecord) => row.getOrElse {
+for (i <- schema.indices) resultRow.setNullAt(i)
--- End diff --

I ran some tests with the codes below to help.

```scala
object ForWhile {
  def forloop = {
val l = Array[Int](1,2,3)
for (i <- l) {
}
  }

  def whileloop = {
val arr = Array[Int](1,2,3)
var idx = 0
while(idx < arr.length) {
  idx += 1
}
  }
}
```

```
Compiled from "ForWhile.scala"
public final class ForWhile {
  public static void whileloop();
Code:
   0: getstatic #16 // Field 
ForWhile$.MODULE$:LForWhile$;
   3: invokevirtual #18 // Method 
ForWhile$.whileloop:()V
   6: return

  public static void forloop();
Code:
   0: getstatic #16 // Field 
ForWhile$.MODULE$:LForWhile$;
   3: invokevirtual #21 // Method ForWhile$.forloop:()V
   6: return
}

Compiled from "ForWhile.scala"
public final class ForWhile$ {
  public static final ForWhile$ MODULE$;

  public static {};
Code:
   0: new   #2  // class ForWhile$
   3: invokespecial #12 // Method "":()V
   6: return

  public void forloop();
Code:
   0: getstatic #18 // Field 
scala/Array$.MODULE$:Lscala/Array$;
   3: getstatic #23 // Field 
scala/Predef$.MODULE$:Lscala/Predef$;
   6: iconst_3
   7: newarray   int
   9: dup
  10: iconst_0
  11: iconst_1
  12: iastore
  13: dup
  14: iconst_1
  15: iconst_2
  16: iastore
  17: dup
  18: iconst_2
  19: iconst_3
  20: iastore
  21: invokevirtual #27 // Method 
scala/Predef$.wrapIntArray:([I)Lscala/collection/mutable/WrappedArray;
  24: getstatic #32 // Field 
scala/reflect/ClassTag$.MODULE$:Lscala/reflect/ClassTag$;
  27: invokevirtual #36 // Method 
scala/reflect/ClassTag$.Int:()Lscala/reflect/ClassTag;
  30: invokevirtual #40 // Method 
scala/Array$.apply:(Lscala/collection/Seq;Lscala/reflect/ClassTag;)Ljava/lang/Object;
  33: checkcast #42 // class "[I"
  36: astore_1
  37: getstatic #23 // Field 
scala/Predef$.MODULE$:Lscala/Predef$;
  40: aload_1
  41: 

[GitHub] spark pull request #17315: [SPARK-19949][SQL] unify bad record handling in C...

2017-03-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17315#discussion_r106767486
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala
 ---
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.unsafe.types.UTF8String
+
+class FailureSafeParser[IN](
+func: IN => Seq[InternalRow],
+mode: String,
+schema: StructType,
+columnNameOfCorruptRecord: String) {
+
+  private val corruptFieldIndex = 
schema.getFieldIndex(columnNameOfCorruptRecord)
+  private val actualSchema = StructType(schema.filterNot(_.name == 
columnNameOfCorruptRecord))
+  private val resultRow = new GenericInternalRow(schema.length)
+
+  private val toResultRow: (Option[InternalRow], () => UTF8String) => 
InternalRow = {
+if (corruptFieldIndex.isDefined) {
+  (row, badRecord) => {
+for ((f, i) <- actualSchema.zipWithIndex) {
+  resultRow(schema.fieldIndex(f.name)) = row.map(_.get(i, 
f.dataType)).orNull
+}
+resultRow(corruptFieldIndex.get) = badRecord()
+resultRow
+  }
+} else {
+  (row, badRecord) => row.getOrElse {
+for (i <- schema.indices) resultRow.setNullAt(i)
--- End diff --

I ran some tests. I can see boxing/unboxing where it seems does not exist 
in while loop.

```scala
object ForWhile {
  def forloop = {
val l = Seq(1,2,3).indices
for (i <- l) println(i)
  }
}
```

```
2: invokestatic  #46 // Method 
scala/runtime/BoxesRunTime.unboxToInt:(Ljava/lang/Object;)I
```

```
4: invokestatic  #37 // Method 
scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/lang/Integer;
```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-17 Thread ioana-delaney
Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106767465
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
--- End diff --

@ron8hu Method ```StarSchemaDetection.findStarJoins()``` only finds the set 
of star-join plans. It doesn’t reoder the plans.  

The star-based reorder call from ```ReorderJoin``` rule is disabled when 
CBO is turned on. If both are enabled, cost-based re-ordering will do the final 
reordering, but I blocked star-plans just in case. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional 

[GitHub] spark issue #17327: [SPARK-19721][SS][BRANCH-2.1] Good error message for ver...

2017-03-17 Thread lw-lin
Github user lw-lin commented on the issue:

https://github.com/apache/spark/pull/17327
  
Thanks! Closed this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17327: [SPARK-19721][SS][BRANCH-2.1] Good error message ...

2017-03-17 Thread lw-lin
Github user lw-lin closed the pull request at:

https://github.com/apache/spark/pull/17327


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-17 Thread ron8hu
Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106766281
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
--- End diff --

} else if (conf.joinReorderEnabled) {
  emptyStarJoinPlan
} else {

When both configuration parameters joinReorderEnabled and 
starSchemaDetection are true, we want to avoid performing join reorder twice.  
There is no added value to perform join reorder twice.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...

2017-03-17 Thread xwu0226
Github user xwu0226 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16626#discussion_r106765877
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -296,6 +297,45 @@ class SessionCatalog(
   }
 
   /**
+   * Alter the schema of a table identified by the provided table 
identifier to add new columns
+   * @param identifier TableIdentifier
+   * @param columns new columns
+   * @param caseSensitive enforce case sensitivity for column names
+   */
+  def alterTableAddColumns(
--- End diff --

I will change back to use `alterTableSchema` to make it more generic for 
other schema evolution features. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14615: [SPARK-17029] make toJSON not go through rdd form but op...

2017-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14615
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14615: [SPARK-17029] make toJSON not go through rdd form but op...

2017-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14615
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74759/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14615: [SPARK-17029] make toJSON not go through rdd form but op...

2017-03-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14615
  
**[Test build #74759 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74759/testReport)**
 for PR 14615 at commit 
[`0043190`](https://github.com/apache/spark/commit/004319080bd7194aba546aca540589c2eade76f1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17326: [SPARK-19985][ML] Fixed copy method for some ML M...

2017-03-17 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/17326#discussion_r106765091
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala
 ---
@@ -74,6 +74,7 @@ class MultilayerPerceptronClassifierSuite
   .setMaxIter(100)
   .setSolver("l-bfgs")
 val model = trainer.fit(dataset)
+MLTestingUtils.checkCopy(model)
--- End diff --

Just something for consideration. Not sure if this is the best way to add 
unit test for this. I don't think we need to add check only for the classes 
included in this PR. And even if we do, maybe it's possible to do it in a more 
uniform way, like in `testEstimatorAndModelReadWrite` or add a 
`pipelineCompatibilityTest` as necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-03-17 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/17014
  
Hi @zhengruifeng , is there any update?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17191: [SPARK-14471][SQL] Aliases in SELECT could be used in GR...

2017-03-17 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17191
  
What is the behaviors of MySQL and Postgres when we use digits as alias?
```SQL
SELECT k1 AS `2`, k2 AS a, SUM(v) FROM t GROUP BY 2, k2
```

You might need to replace backticks by the other symbols for quoting the 
column names


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17179: [SPARK-19067][SS] Processing-time-based timeout in MapGr...

2017-03-17 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/17179
  
LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17216: [SPARK-19873][SS] Record num shuffle partitions in offse...

2017-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17216
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17216: [SPARK-19873][SS] Record num shuffle partitions in offse...

2017-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17216
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74760/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17216: [SPARK-19873][SS] Record num shuffle partitions in offse...

2017-03-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17216
  
**[Test build #74760 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74760/testReport)**
 for PR 17216 at commit 
[`a0c71af`](https://github.com/apache/spark/commit/a0c71af4ad2dd99815f85e3b0c842419beeea67c).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17216: [SPARK-19873][SS] Record num shuffle partitions i...

2017-03-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17216


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17216: [SPARK-19873][SS] Record num shuffle partitions in offse...

2017-03-17 Thread tdas
Github user tdas commented on the issue:

https://github.com/apache/spark/pull/17216
  
LGTM. Merging it to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17216: [SPARK-19873][SS] Record num shuffle partitions in offse...

2017-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17216
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17216: [SPARK-19873][SS] Record num shuffle partitions in offse...

2017-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17216
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74758/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17216: [SPARK-19873][SS] Record num shuffle partitions in offse...

2017-03-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17216
  
**[Test build #74758 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74758/testReport)**
 for PR 17216 at commit 
[`3abe0a0`](https://github.com/apache/spark/commit/3abe0a0fcb60def5eb14697fa6f78eee318b77b0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17216: [SPARK-19873][SS] Record num shuffle partitions in offse...

2017-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17216
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17216: [SPARK-19873][SS] Record num shuffle partitions in offse...

2017-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17216
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74757/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17216: [SPARK-19873][SS] Record num shuffle partitions in offse...

2017-03-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17216
  
**[Test build #74757 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74757/testReport)**
 for PR 17216 at commit 
[`a2b32ce`](https://github.com/apache/spark/commit/a2b32ce3ee536d1ea1d12ae4dcc3c561788e3dd2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17330: [SPARK-19993][SQL][WIP] Caching logical plans con...

2017-03-17 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/17330#discussion_r106761010
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
 ---
@@ -61,6 +63,36 @@ abstract class SubqueryExpression(
   }
 }
 
+/**
+ * This expression is used to represent any form of subquery expression 
namely
+ * ListQuery, Exists and ScalarSubquery. This is only used to make sure the
+ * expression equality works properly when LogicalPlan.sameResult is called
+ * on plans containing SubqueryExpression(s). This is only a transient 
expression
+ * that only lives in the scope of sameResult function call. In other 
words, analyzer,
+ * optimizer or planner never sees this expression type during 
transformation of
+ * plans.
+ */
+case class CanonicalizedSubqueryExpr(expr: SubqueryExpression)
+  extends LeafExpression with Unevaluable {
+  override def dataType: DataType = expr.dataType
+  override def nullable: Boolean = expr.nullable
+  override def toString: String = 
s"CanonicalizedSubqueryExpr(${expr.toString})"
+
+  // Hashcode is generated conservatively for now i.e it does not include 
the
+  // sub query plan. Doing so causes issue when we canonicalize 
expressions to
+  // re-order them based on hashcode.
+  // TODO : improve the hashcode generation by considering the plan info.
+  override def hashCode(): Int = {
+val state = Seq(children, this.getClass.getName)
+state.map(Objects.hashCode).foldLeft(0)((a, b) => 31 * a + b)
--- End diff --

@rxin Sure Reynold.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-17 Thread kayousterhout
Github user kayousterhout commented on the issue:

https://github.com/apache/spark/pull/17088
  
Ok that makes sense. I wanted to make sure that there wasn't some bug in 
SlaveLost (which might lead to a simpler fix than this) but @squito's 
description makes it clear that there are a bunch of situations that SlaveLost 
can't handle correctly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16626: [SPARK-19261][SQL] Alter add columns for Hive serde and ...

2017-03-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16626
  
**[Test build #74761 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74761/testReport)**
 for PR 16626 at commit 
[`b219178`](https://github.com/apache/spark/commit/b2191788f5d60261946da63e8c2634d5c6dfe6f5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17216: [SPARK-19873][SS] Record num shuffle partitions i...

2017-03-17 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/17216#discussion_r106759443
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -256,6 +259,15 @@ class StreamExecution(
   updateStatusMessage("Initializing sources")
   // force initialization of the logical plan so that the sources can 
be created
   logicalPlan
+
+  // Isolated spark session to run the batches with.
+  val sparkSessionToRunBatches = sparkSession.cloneSession()
+  // Adaptive execution can change num shuffle partitions, disallow
+  
sparkSessionToRunBatches.conf.set(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key, 
"false")
+  offsetSeqMetadata = OffsetSeqMetadata(batchWatermarkMs = 0, 
batchTimestampMs = 0,
--- End diff --

Yeah, this should be kept. It should use the conf in the cloned session.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17330: [SPARK-19993][SQL][WIP] Caching logical plans con...

2017-03-17 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17330#discussion_r106758290
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
 ---
@@ -61,6 +63,36 @@ abstract class SubqueryExpression(
   }
 }
 
+/**
+ * This expression is used to represent any form of subquery expression 
namely
+ * ListQuery, Exists and ScalarSubquery. This is only used to make sure the
+ * expression equality works properly when LogicalPlan.sameResult is called
+ * on plans containing SubqueryExpression(s). This is only a transient 
expression
+ * that only lives in the scope of sameResult function call. In other 
words, analyzer,
+ * optimizer or planner never sees this expression type during 
transformation of
+ * plans.
+ */
+case class CanonicalizedSubqueryExpr(expr: SubqueryExpression)
+  extends LeafExpression with Unevaluable {
+  override def dataType: DataType = expr.dataType
+  override def nullable: Boolean = expr.nullable
+  override def toString: String = 
s"CanonicalizedSubqueryExpr(${expr.toString})"
+
+  // Hashcode is generated conservatively for now i.e it does not include 
the
+  // sub query plan. Doing so causes issue when we canonicalize 
expressions to
+  // re-order them based on hashcode.
+  // TODO : improve the hashcode generation by considering the plan info.
+  override def hashCode(): Int = {
+val state = Seq(children, this.getClass.getName)
+state.map(Objects.hashCode).foldLeft(0)((a, b) => 31 * a + b)
--- End diff --

can we write this imperatively


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17216: [SPARK-19873][SS] Record num shuffle partitions i...

2017-03-17 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/17216#discussion_r106757213
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -256,6 +259,15 @@ class StreamExecution(
   updateStatusMessage("Initializing sources")
   // force initialization of the logical plan so that the sources can 
be created
   logicalPlan
+
+  // Isolated spark session to run the batches with.
+  val sparkSessionToRunBatches = sparkSession.cloneSession()
+  // Adaptive execution can change num shuffle partitions, disallow
+  
sparkSessionToRunBatches.conf.set(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key, 
"false")
+  offsetSeqMetadata = OffsetSeqMetadata(batchWatermarkMs = 0, 
batchTimestampMs = 0,
--- End diff --

nit: remove line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17216: [SPARK-19873][SS] Record num shuffle partitions in offse...

2017-03-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17216
  
**[Test build #74760 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74760/testReport)**
 for PR 17216 at commit 
[`a0c71af`](https://github.com/apache/spark/commit/a0c71af4ad2dd99815f85e3b0c842419beeea67c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >