[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

2017-04-10 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17077#discussion_r110699779
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2167,6 +2167,61 @@ def test_BinaryType_serialization(self):
 df = self.spark.createDataFrame(data, schema=schema)
 df.collect()
 
+def test_bucketed_write(self):
+data = [
+(1, "foo", 3.0), (2, "foo", 5.0),
+(3, "bar", -1.0), (4, "bar", 6.0),
+]
+df = self.spark.createDataFrame(data, ["x", "y", "z"])
+
+# Test write with one bucketing column
+df.write.bucketBy(3, 
"x").mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(
+len([c for c in 
self.spark.catalog.listColumns("pyspark_bucket")
+ if c.name == "x" and c.isBucket]),
+1
+)
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write two bucketing columns
+df.write.bucketBy(3, "x", 
"y").mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(
+len([c for c in 
self.spark.catalog.listColumns("pyspark_bucket")
+ if c.name in ("x", "y") and c.isBucket]),
--- End diff --

Copying docs from Scala docs directly could be confusing since we won't 
support this in 2.0 and 2.1 and changes since 2.0 doesn't really affect us 
here. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16774: [SPARK-19357][ML] Adding parallel model evaluatio...

2017-04-10 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/16774#discussion_r110717717
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -100,31 +108,60 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") 
override val uid: String)
 val eval = $(evaluator)
 val epm = $(estimatorParamMaps)
 val numModels = epm.length
-val metrics = new Array[Double](epm.length)
+
+// Create execution context, run in serial if numParallelEval is 1
+val executionContext = $(numParallelEval) match {
+  case 1 =>
+ThreadUtils.sameThread
+  case n =>
+ExecutionContext.fromExecutorService(executorServiceFactory(n))
+}
 
 val instr = Instrumentation.create(this, dataset)
 instr.logParams(numFolds, seed)
 logTuningParams(instr)
 
+// Compute metrics for each model over each split
+logDebug(s"Running cross-validation with level of parallelism: 
$numParallelEval.")
 val splits = MLUtils.kFold(dataset.toDF.rdd, $(numFolds), $(seed))
-splits.zipWithIndex.foreach { case ((training, validation), 
splitIndex) =>
+val metrics = splits.zipWithIndex.map { case ((training, validation), 
splitIndex) =>
   val trainingDataset = sparkSession.createDataFrame(training, 
schema).cache()
   val validationDataset = sparkSession.createDataFrame(validation, 
schema).cache()
-  // multi-model training
   logDebug(s"Train split $splitIndex with multiple sets of 
parameters.")
-  val models = est.fit(trainingDataset, 
epm).asInstanceOf[Seq[Model[_]]]
-  trainingDataset.unpersist()
-  var i = 0
-  while (i < numModels) {
-// TODO: duplicate evaluator to take extra params from input
-val metric = eval.evaluate(models(i).transform(validationDataset, 
epm(i)))
-logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
-metrics(i) += metric
-i += 1
+
+  // Fit models in a Future with thread-pool size determined by 
'$numParallelEval'
+  val models = epm.map { paramMap =>
+Future[Model[_]] {
+  val model = est.fit(trainingDataset, paramMap)
+  model.asInstanceOf[Model[_]]
+} (executionContext)
   }
+
+  Future.sequence[Model[_], Iterable](models)(implicitly, 
executionContext).onComplete { _ =>
+trainingDataset.unpersist()
+  } (executionContext)
+
+  // Evaluate models in a Future with thread-pool size determined by 
'$numParallelEval'
+  val foldMetricFutures = models.zip(epm).map { case (modelFuture, 
paramMap) =>
+modelFuture.flatMap { model =>
+  Future {
+// TODO: duplicate evaluator to take extra params from input
+val metric = eval.evaluate(model.transform(validationDataset, 
paramMap))
+logDebug(s"Got metric $metric for model trained with 
$paramMap.")
+metric
+  } (executionContext)
+} (executionContext)
+  }
+
+  // Wait for metrics to be calculated before upersisting validation 
dataset
+  val foldMetrics = foldMetricFutures.map(ThreadUtils.awaitResult(_, 
Duration.Inf))
--- End diff --

Sure, not a big deal either way


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...

2017-04-10 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17591#discussion_r110692150
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
 ---
@@ -220,6 +221,32 @@ class FileIndexSuite extends SharedSQLContext {
   assert(catalog.leafDirPaths.head == fs.makeQualified(dirPath))
 }
   }
+
+  test("SPARK-20280 - FileStatusCache with a partition with very many 
files") {
+/* fake the size, otherwise we need to allocate 2GB of data to trigger 
this bug */
+class MyFileStatus extends FileStatus with KnownSizeEstimation {
+  override def estimatedSize: Long = 1000 * 1000 * 1000
+}
+/* files * MyFileStatus.estimatedSize should overflow to negative 
integer
+ * so, make it between 2bn and 4bn
+ */
+val files = (1 to 3).map { i =>
+  new MyFileStatus()
+}
+val fileStatusCache = FileStatusCache.getOrCreate(spark)
+fileStatusCache.putLeafFiles(new Path("/tmp", "abc"), files.toArray)
+// scalastyle:off
--- End diff --

Lets remove this comment block, the JIRA should be used for tracking these 
things.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...

2017-04-10 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17591#discussion_r110692271
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileStatusCache.scala
 ---
@@ -94,27 +94,46 @@ private class SharedInMemoryCache(maxSizeInBytes: Long) 
extends Logging {
   // Opaque object that uniquely identifies a shared cache user
   private type ClientId = Object
 
+
   private val warnedAboutEviction = new AtomicBoolean(false)
 
   // we use a composite cache key in order to distinguish entries inserted 
by different clients
-  private val cache: Cache[(ClientId, Path), Array[FileStatus]] = 
CacheBuilder.newBuilder()
-.weigher(new Weigher[(ClientId, Path), Array[FileStatus]] {
-  override def weigh(key: (ClientId, Path), value: Array[FileStatus]): 
Int = {
-(SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt
-  }})
-.removalListener(new RemovalListener[(ClientId, Path), 
Array[FileStatus]]() {
-  override def onRemoval(removed: RemovalNotification[(ClientId, 
Path), Array[FileStatus]])
+  private val cache: Cache[(ClientId, Path), Array[FileStatus]] = {
+/* [[Weigher]].weigh returns Int so we could only cache objects < 2GB
--- End diff --

NIT: Could you use java style comments `//` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17566: [SPARK-19518][SQL] IGNORE NULLS in first / last i...

2017-04-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17566


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17330
  
**[Test build #75664 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75664/testReport)**
 for PR 17330 at commit 
[`f1e63c8`](https://github.com/apache/spark/commit/f1e63c8bc2a4689c15e8c11ddf5d2c7864cf96a6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17587: [SPARK-20274][SQL] support compatible array element type...

2017-04-10 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/17587
  
@cloud-fan Good to see what plan is generated in your comment. People 
(especially I) will forgot what plan we generated in the future. 
When I saw the comment in `upCastToExpectedType`, I understood this PR. I 
agree with @viirya' comment.

LGTM except this comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17591
  
**[Test build #75665 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75665/testReport)**
 for PR 17591 at commit 
[`e02fc4a`](https://github.com/apache/spark/commit/e02fc4aa651f6803dc6f4844fc0570187a167e80).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17077
  
**[Test build #75666 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75666/testReport)**
 for PR 17077 at commit 
[`481416d`](https://github.com/apache/spark/commit/481416d695d804144d041a98ea929b88829ebe47).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17566: [SPARK-19518][SQL] IGNORE NULLS in first / last in SQL

2017-04-10 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/17566
  
LGTM - merging to master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

2017-04-10 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/17330#discussion_r110693397
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
 ---
@@ -59,6 +58,13 @@ abstract class SubqueryExpression(
 children.zip(p.children).forall(p => p._1.semanticEquals(p._2))
 case _ => false
   }
+  def canonicalize(attrs: AttributeSeq): SubqueryExpression = {
+// Normalize the outer references in the subquery plan.
+val subPlan = plan.transformAllExpressions {
+  case OuterReference(r) => QueryPlan.normalizeExprId(r, attrs)
--- End diff --

@cloud-fan Actually you r right. Preserving the OuterReference would be 
good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17077
  
**[Test build #75667 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75667/testReport)**
 for PR 17077 at commit 
[`71c9e0f`](https://github.com/apache/spark/commit/71c9e0faf39b979eb7f61d74af8c1821d0a0bcf3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17308: [SPARK-19968][SS] Use a cached instance of `Kafka...

2017-04-10 Thread BenFradet
Github user BenFradet commented on a diff in the pull request:

https://github.com/apache/spark/pull/17308#discussion_r110685760
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala
 ---
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.kafka010
+
+import java.{util => ju}
+
+import scala.collection.mutable
+
+import org.apache.kafka.clients.producer.KafkaProducer
+
+import org.apache.spark.internal.Logging
+
+private[kafka010] object CachedKafkaProducer extends Logging {
+
+  private val cacheMap = new mutable.HashMap[Int, 
KafkaProducer[Array[Byte], Array[Byte]]]()
+
+  private def createKafkaProducer(
+producerConfiguration: ju.HashMap[String, Object]): 
KafkaProducer[Array[Byte], Array[Byte]] = {
+val kafkaProducer: KafkaProducer[Array[Byte], Array[Byte]] =
+  new KafkaProducer[Array[Byte], Array[Byte]](producerConfiguration)
+cacheMap.put(producerConfiguration.hashCode(), kafkaProducer)
--- End diff --

True, my bad I thought `KafkaSink` was a public API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17593: [SPARK-20279][WEB-UI]In web ui,'Only showing 200' should...

2017-04-10 Thread ajbozarth
Github user ajbozarth commented on the issue:

https://github.com/apache/spark/pull/17593
  
I agree with @srowen we left it that way since sorting can change


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17592
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...

2017-04-10 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/17591
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17520: [WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates ...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17520
  
**[Test build #75668 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75668/testReport)**
 for PR 17520 at commit 
[`b923bd5`](https://github.com/apache/spark/commit/b923bd5b2c79a84ada834a32b756ad0da80f12c6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

2017-04-10 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17077#discussion_r110693499
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2167,6 +2167,61 @@ def test_BinaryType_serialization(self):
 df = self.spark.createDataFrame(data, schema=schema)
 df.collect()
 
+def test_bucketed_write(self):
+data = [
+(1, "foo", 3.0), (2, "foo", 5.0),
+(3, "bar", -1.0), (4, "bar", 6.0),
+]
+df = self.spark.createDataFrame(data, ["x", "y", "z"])
+
+# Test write with one bucketing column
+df.write.bucketBy(3, 
"x").mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(
+len([c for c in 
self.spark.catalog.listColumns("pyspark_bucket")
+ if c.name == "x" and c.isBucket]),
+1
+)
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write two bucketing columns
+df.write.bucketBy(3, "x", 
"y").mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(
+len([c for c in 
self.spark.catalog.listColumns("pyspark_bucket")
+ if c.name in ("x", "y") and c.isBucket]),
--- End diff --

If you think it is better I'll trust your judgment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17585: [SPARK-20273] [SQL] Disallow Non-deterministic Fi...

2017-04-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17585


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17364: [SPARK-20038] [SQL]: FileFormatWriter.ExecuteWriteTask.r...

2017-04-10 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/17364
  
@squito Is this ready to go in? Like I warned, I'm not going to add tests 
for this, not on its own


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17550: [SPARK-20240][SQL] SparkSQL support limitations of max d...

2017-04-10 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17550
  
We do not add the new things into 1.6 branch. Please open the PR using the 
master branch


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17480: [SPARK-20079][Core][yarn] Re registration of AM h...

2017-04-10 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17480#discussion_r110716862
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -249,7 +249,14 @@ private[spark] class ExecutorAllocationManager(
* yarn-client mode when AM re-registers after a failure.
*/
   def reset(): Unit = synchronized {
-initializing = true
+/**
+ * When some tasks need to be scheduled and initial executor = 0, 
resetting the initializing
+ * field may cause it to not be set to false in yarn.
+ * SPARK-20079: https://issues.apache.org/jira/browse/SPARK-20079
+ */
+if (maxNumExecutorsNeeded() == 0) {
+  initializing = true
--- End diff --

This kinda raises the question. Is it ever correct to set this to `true` 
here?

This method is only called when the YARN client-mode AM is restarted, and 
at that point I'd expect initialization to have already happened (so I don't 
see a need to reset the field in any situation).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpen...

2017-04-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17592


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16774: [SPARK-19357][ML] Adding parallel model evaluatio...

2017-04-10 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/16774#discussion_r110713910
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -100,31 +108,60 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") 
override val uid: String)
 val eval = $(evaluator)
 val epm = $(estimatorParamMaps)
 val numModels = epm.length
-val metrics = new Array[Double](epm.length)
+
+// Create execution context, run in serial if numParallelEval is 1
+val executionContext = $(numParallelEval) match {
+  case 1 =>
+ThreadUtils.sameThread
+  case n =>
+ExecutionContext.fromExecutorService(executorServiceFactory(n))
+}
 
 val instr = Instrumentation.create(this, dataset)
 instr.logParams(numFolds, seed)
 logTuningParams(instr)
 
+// Compute metrics for each model over each split
+logDebug(s"Running cross-validation with level of parallelism: 
$numParallelEval.")
 val splits = MLUtils.kFold(dataset.toDF.rdd, $(numFolds), $(seed))
-splits.zipWithIndex.foreach { case ((training, validation), 
splitIndex) =>
+val metrics = splits.zipWithIndex.map { case ((training, validation), 
splitIndex) =>
   val trainingDataset = sparkSession.createDataFrame(training, 
schema).cache()
   val validationDataset = sparkSession.createDataFrame(validation, 
schema).cache()
-  // multi-model training
   logDebug(s"Train split $splitIndex with multiple sets of 
parameters.")
-  val models = est.fit(trainingDataset, 
epm).asInstanceOf[Seq[Model[_]]]
-  trainingDataset.unpersist()
-  var i = 0
-  while (i < numModels) {
-// TODO: duplicate evaluator to take extra params from input
-val metric = eval.evaluate(models(i).transform(validationDataset, 
epm(i)))
-logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
-metrics(i) += metric
-i += 1
+
+  // Fit models in a Future with thread-pool size determined by 
'$numParallelEval'
+  val models = epm.map { paramMap =>
+Future[Model[_]] {
+  val model = est.fit(trainingDataset, paramMap)
+  model.asInstanceOf[Model[_]]
+} (executionContext)
   }
+
+  Future.sequence[Model[_], Iterable](models)(implicitly, 
executionContext).onComplete { _ =>
+trainingDataset.unpersist()
+  } (executionContext)
+
+  // Evaluate models in a Future with thread-pool size determined by 
'$numParallelEval'
+  val foldMetricFutures = models.zip(epm).map { case (modelFuture, 
paramMap) =>
+modelFuture.flatMap { model =>
+  Future {
+// TODO: duplicate evaluator to take extra params from input
+val metric = eval.evaluate(model.transform(validationDataset, 
paramMap))
+logDebug(s"Got metric $metric for model trained with 
$paramMap.")
+metric
+  } (executionContext)
+} (executionContext)
+  }
+
+  // Wait for metrics to be calculated before upersisting validation 
dataset
+  val foldMetrics = foldMetricFutures.map(ThreadUtils.awaitResult(_, 
Duration.Inf))
--- End diff --

I thought about that, but since it's a blocking call anyway, it will still 
be bound by the longest running thread.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17592
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75662/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17592
  
**[Test build #75662 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75662/testReport)**
 for PR 17592 at commit 
[`d486d60`](https://github.com/apache/spark/commit/d486d6015ae0129ce41e6683eae37243c843ba59).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17077
  
**[Test build #75667 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75667/testReport)**
 for PR 17077 at commit 
[`71c9e0f`](https://github.com/apache/spark/commit/71c9e0faf39b979eb7f61d74af8c1821d0a0bcf3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17077
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17077
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75667/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "...

2017-04-10 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/17436#discussion_r110714213
  
--- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala ---
@@ -67,6 +67,9 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable 
with Logging with Seria
   if (loadDefaults) {
 loadFromSystemProperties(false)
   }
+  // tentatively enable offHeap for test
+  set("spark.memory.offHeap.enabled", "true")
+  set("spark.memory.offHeap.size", "1gb")
--- End diff --

Yes, it should be removed before merging this PR.
This code is tentative added to enable offheap mode.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...

2017-04-10 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/17592
  
Merging to master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...

2017-04-10 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17591#discussion_r110692833
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileStatusCache.scala
 ---
@@ -94,27 +94,46 @@ private class SharedInMemoryCache(maxSizeInBytes: Long) 
extends Logging {
   // Opaque object that uniquely identifies a shared cache user
   private type ClientId = Object
 
+
   private val warnedAboutEviction = new AtomicBoolean(false)
 
   // we use a composite cache key in order to distinguish entries inserted 
by different clients
-  private val cache: Cache[(ClientId, Path), Array[FileStatus]] = 
CacheBuilder.newBuilder()
-.weigher(new Weigher[(ClientId, Path), Array[FileStatus]] {
-  override def weigh(key: (ClientId, Path), value: Array[FileStatus]): 
Int = {
-(SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt
-  }})
-.removalListener(new RemovalListener[(ClientId, Path), 
Array[FileStatus]]() {
-  override def onRemoval(removed: RemovalNotification[(ClientId, 
Path), Array[FileStatus]])
+  private val cache: Cache[(ClientId, Path), Array[FileStatus]] = {
+/* [[Weigher]].weigh returns Int so we could only cache objects < 2GB
+ * instead, the weight is divided by this factor (which is smaller
+ * than the size of one [[FileStatus]]).
+ * so it will support objects up to 64GB in size.
+ */
+val weightScale = 32
+CacheBuilder.newBuilder()
+  .weigher(new Weigher[(ClientId, Path), Array[FileStatus]] {
+override def weigh(key: (ClientId, Path), value: 
Array[FileStatus]): Int = {
+  val estimate = (SizeEstimator.estimate(key) + 
SizeEstimator.estimate(value)) / weightScale
+  if (estimate > Int.MaxValue) {
+logWarning(s"Cached table partition metadata size is too big. 
Approximating to " +
+  s"${Int.MaxValue.toLong * weightScale}.")
+Int.MaxValue
+  } else {
+estimate.toInt
+  }
+}
+  })
+  .removalListener(new RemovalListener[(ClientId, Path), 
Array[FileStatus]]() {
--- End diff --

This is kinda hard to read. Can we just initialize the weighter and the 
listener in separate variables?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

2017-04-10 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17077#discussion_r110692936
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2167,6 +2167,61 @@ def test_BinaryType_serialization(self):
 df = self.spark.createDataFrame(data, schema=schema)
 df.collect()
 
+def test_bucketed_write(self):
+data = [
+(1, "foo", 3.0), (2, "foo", 5.0),
+(3, "bar", -1.0), (4, "bar", 6.0),
+]
+df = self.spark.createDataFrame(data, ["x", "y", "z"])
+
+# Test write with one bucketing column
+df.write.bucketBy(3, 
"x").mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(
+len([c for c in 
self.spark.catalog.listColumns("pyspark_bucket")
+ if c.name == "x" and c.isBucket]),
+1
+)
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write two bucketing columns
+df.write.bucketBy(3, "x", 
"y").mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(
+len([c for c in 
self.spark.catalog.listColumns("pyspark_bucket")
+ if c.name in ("x", "y") and c.isBucket]),
+2
+)
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write with bucket and sort
+df.write.bucketBy(2, 
"x").sortBy("z").mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(
+len([c for c in 
self.spark.catalog.listColumns("pyspark_bucket")
+ if c.name == "x" and c.isBucket]),
+1
+)
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write with a list of columns
+df.write.bucketBy(3, ["x", 
"y"]).mode("overwrite").saveAsTable("pyspark_bucket")
+self.assertEqual(
+len([c for c in 
self.spark.catalog.listColumns("pyspark_bucket")
+ if c.name in ("x", "y") and c.isBucket]),
+2
+)
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write with bucket and sort with a list of columns
+(df.write.bucketBy(2, "x")
+.sortBy(["y", "z"])
+.mode("overwrite").saveAsTable("pyspark_bucket"))
+self.assertSetEqual(set(data), 
set(self.spark.table("pyspark_bucket").collect()))
+
+# Test write with bucket and sort with multiple columns
+(df.write.bucketBy(2, "x")
+.sortBy("y", "z")
+.mode("overwrite").saveAsTable("pyspark_bucket"))
--- End diff --

I don't think that dropping before is necessary. We override on each write 
and name clashes are unlikely.

We can drop down after the tests but I am not sure how to do it right. 
`SQLTests` is overgrown and I am not sure if we should add `tearDown`  only for 
this but adding `DROP TABLE` in test itself doesn't look right.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12823: [SPARK-14985][ML] Update LinearRegression, LogisticRegre...

2017-04-10 Thread BenFradet
Github user BenFradet commented on the issue:

https://github.com/apache/spark/pull/12823
  
ping @jkbradley if you could take a look, that'd be great.

If you have the time, there is also the #17431 segue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17077
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17077
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75666/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17077
  
**[Test build #75666 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75666/testReport)**
 for PR 17077 at commit 
[`481416d`](https://github.com/apache/spark/commit/481416d695d804144d041a98ea929b88829ebe47).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16774: [SPARK-19357][ML] Adding parallel model evaluatio...

2017-04-10 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/16774#discussion_r110712166
  
--- Diff: docs/ml-tuning.md ---
@@ -55,6 +55,9 @@ for multiclass problems. The default metric used to 
choose the best `ParamMap` c
 method in each of these evaluators.
 
 To help construct the parameter grid, users can use the 
[`ParamGridBuilder`](api/scala/index.html#org.apache.spark.ml.tuning.ParamGridBuilder)
 utility.
+Sets of parameters from the parameter grid can be evaluated in parallel by 
setting `numParallelEval` with a value of 2 or more (a value of 1 will evaluate 
in serial) before running model selection with `CrossValidator` or 
`TrainValidationSplit`.
+The value of `numParallelEval` should be chosen carefully to maximize 
parallelism without exceeding cluster resources, and will be capped at the 
number of cores in the driver system.  Generally speaking, a value up to 10 
should be sufficient for most clusters.
+
--- End diff --

Since that API is marked as experimental, maybe it would be better to not 
document right away until we are sure this is what we need?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...

2017-04-10 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17592
  
Should this go into branch-2.1 as well?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17330
  
**[Test build #75664 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75664/testReport)**
 for PR 17330 at commit 
[`f1e63c8`](https://github.com/apache/spark/commit/f1e63c8bc2a4689c15e8c11ddf5d2c7864cf96a6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17595: [SPARK-20283][SQL] Add preOptimizationBatches

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17595
  
**[Test build #75671 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75671/testReport)**
 for PR 17595 at commit 
[`263dc68`](https://github.com/apache/spark/commit/263dc688a1da8cfeef9f9e7e0168964a27e779b6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17445: [SPARK-20115] [CORE] Fix DAGScheduler to recompute all t...

2017-04-10 Thread umehrot2
Github user umehrot2 commented on the issue:

https://github.com/apache/spark/pull/17445
  
@kayousterhout @mridulm @rxin @lins05 Can you take a look at this PR ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17597: [SPARK-20285][Tests]Increase the pyspark streaming test ...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17597
  
**[Test build #75673 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75673/testReport)**
 for PR 17597 at commit 
[`67a1221`](https://github.com/apache/spark/commit/67a12213b8abbb9c41333ff9e0ffb3f70944a12e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String t...

2017-04-10 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/17527
  
Merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17594: [SPARK-20282][SS][Tests]Write the commit log first to fi...

2017-04-10 Thread marmbrus
Github user marmbrus commented on the issue:

https://github.com/apache/spark/pull/17594
  
LGTM, for fixing the issue with the test.  We should separately decide if 
this is really the behavior we want for the commit log.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17596: [SPARK-12837][SQL] reduce the serialized size of accumul...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17596
  
**[Test build #75672 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75672/testReport)**
 for PR 17596 at commit 
[`67044b3`](https://github.com/apache/spark/commit/67044b34ac60863d3fbdbe37ba89fffa85ffb2fe).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17596: [SPARK-12837][SQL] reduce the serialized size of accumul...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17596
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75672/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17596: [SPARK-12837][SQL] reduce the serialized size of accumul...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17596
  
**[Test build #75672 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75672/testReport)**
 for PR 17596 at commit 
[`67044b3`](https://github.com/apache/spark/commit/67044b34ac60863d3fbdbe37ba89fffa85ffb2fe).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17596: [SPARK-12837][SQL] reduce the serialized size of accumul...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17596
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17597: [SPARK-20285][Tests]Increase the pyspark streaming test ...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17597
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17597: [SPARK-20285][Tests]Increase the pyspark streaming test ...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17597
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75673/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17595: [SPARK-20283][SQL] Add preOptimizationBatches

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17595
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75671/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17595: [SPARK-20283][SQL] Add preOptimizationBatches

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17595
  
**[Test build #75671 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75671/testReport)**
 for PR 17595 at commit 
[`263dc68`](https://github.com/apache/spark/commit/263dc688a1da8cfeef9f9e7e0168964a27e779b6).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17595: [SPARK-20283][SQL] Add preOptimizationBatches

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17595
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17495: [SPARK-20172][Core] Add file permission check whe...

2017-04-10 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17495#discussion_r110722790
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -320,14 +321,15 @@ private[history] class FsHistoryProvider(conf: 
SparkConf, clock: Clock)
 .filter { entry =>
   try {
 val prevFileSize = 
fileToAppInfo.get(entry.getPath()).map{_.fileSize}.getOrElse(0L)
+fs.access(entry.getPath, FsAction.READ)
--- End diff --

So, this API actually calls `fs.getFileStatus()` which is a remote call and 
completely unnecessary in this context, since you already have the `FileStatus`.

You could instead directly do the check against the `FsPermission` object 
(same way the `fs.access()` does after it performs the remote call).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17591
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17591
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75665/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17591
  
**[Test build #75665 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75665/testReport)**
 for PR 17591 at commit 
[`e02fc4a`](https://github.com/apache/spark/commit/e02fc4aa651f6803dc6f4844fc0570187a167e80).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17594: Write the log first to fix a race contion in test...

2017-04-10 Thread zsxwing
GitHub user zsxwing opened a pull request:

https://github.com/apache/spark/pull/17594

Write the log first to fix a race contion in tests

## What changes were proposed in this pull request?

This PR fixes the following failure:
```
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
Assert on query failed: 

== Progress ==
   AssertOnQuery(, )
   StopStream
   AddData to MemoryStream[value#30891]: 1,2
   
StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@35cdc93a,Map())
   CheckAnswer: [6],[3]
   StopStream
=> AssertOnQuery(, )
   AssertOnQuery(, )
   
StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@cdb247d,Map())
   CheckAnswer: [6],[3]
   StopStream
   AddData to MemoryStream[value#30891]: 3
   
StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@55394e4d,Map())
   CheckLastBatch: [2]
   StopStream
   AddData to MemoryStream[value#30891]: 0
   
StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@749aa997,Map())
   ExpectFailure[org.apache.spark.SparkException, isFatalError: false]
   AssertOnQuery(, )
   AssertOnQuery(, incorrect start offset or end offset on 
exception)

== Stream ==
Output Mode: Append
Stream state: not started
Thread state: dead


== Sink ==
0: [6] [3]


== Plan ==

 
 
at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
at 
org.apache.spark.sql.streaming.StreamTest$class.failTest$1(StreamTest.scala:347)
at 
org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:318)
at 
org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:483)
at 
org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:357)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.sql.streaming.StreamTest$class.liftedTree1$1(StreamTest.scala:357)
at 
org.apache.spark.sql.streaming.StreamTest$class.testStream(StreamTest.scala:356)
at 
org.apache.spark.sql.streaming.StreamingQuerySuite.testStream(StreamingQuerySuite.scala:41)
at 
org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply$mcV$sp(StreamingQuerySuite.scala:166)
at 
org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161)
at 
org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161)
at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42)
at 
org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply$mcV$sp(SQLTestUtils.scala:268)
at 
org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268)
at 
org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
at 
org.apache.spark.sql.streaming.StreamingQuerySuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StreamingQuerySuite.scala:41)
at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
at 
org.apache.spark.sql.streaming.StreamingQuerySuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingQuerySuite.scala:41)
at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
at 
org.apache.spark.sql.streaming.StreamingQuerySuite.runTest(StreamingQuerySuite.scala:41)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 

[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-10 Thread ioana-delaney
Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110741246
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -327,3 +345,104 @@ object JoinReorderDP extends PredicateHelper with 
Logging {
 case class Cost(card: BigInt, size: BigInt) {
   def +(other: Cost): Cost = Cost(this.card + other.card, this.size + 
other.size)
 }
+
+/**
+ * Implements optional filters to reduce the search space for join 
enumeration.
+ *
+ * 1) Star-join filters: Plan star-joins together since they are assumed
+ *to have an optimal execution based on their RI relationship.
+ * 2) Cartesian products: Defer their planning later in the graph to avoid
+ *large intermediate results (expanding joins, in general).
+ * 3) Composite inners: Don't generate "bushy tree" plans to avoid 
materializing
+ *   intermediate results.
+ *
+ * Filters (2) and (3) are not implemented.
+ */
+case class JoinReorderDPFilters(conf: SQLConf) extends PredicateHelper {
+  /**
+   * Builds join graph information to be used by the filtering strategies.
+   * Currently, it builds the sets of star/non-star joins.
+   * It can be extended with the sets of connected/unconnected joins, which
+   * can be used to filter Cartesian products.
+   */
+  def buildJoinGraphInfo(
+  items: Seq[LogicalPlan],
+  conditions: Set[Expression],
+  planIndex: Seq[(LogicalPlan, Int)]): Option[JoinGraphInfo] = {
+
+// Compute the tables in a star-schema relationship.
+val starJoin = StarSchemaDetection(conf).findStarJoins(items, 
conditions.toSeq)
+val nonStarJoin = items.filterNot(starJoin.contains(_))
+
+if (starJoin.nonEmpty && nonStarJoin.nonEmpty) {
+  val (starInt, nonStarInt) = planIndex.collect {
+case (p, i) if starJoin.contains(p) =>
+  (Some(i), None)
+case (p, i) if nonStarJoin.contains(p) =>
+  (None, Some(i))
+case _ =>
+  (None, None)
+  }.unzip
+  Some(JoinGraphInfo(starInt.flatten.toSet, nonStarInt.flatten.toSet))
+} else {
+  // Nothing interesting to return.
+  None
+}
+  }
+
+  /**
+   * Applies star-join filter.
+   *
+   * Given the outer/inner and the star/non-star sets,
+   * the following plan combinations are allowed:
+   * 1) (outer U inner) is a subset of star-join
--- End diff --

@cloud-fan The ```outer/inner``` represents the join plan combinations 
generated by ```JoinReorderDP```. ```JoinReorderDP``` calls them 
```oneJoinPlan/otherJoinPlan```. I will rename them to align to join DP.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17598: [SPARK-20284][CORE] Make {Des,S}erializationStream exten...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17598
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17597: [SPARK-20285][Tests]Increase the pyspark streaming test ...

2017-04-10 Thread brkyvz
Github user brkyvz commented on the issue:

https://github.com/apache/spark/pull/17597
  
LGTM!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17597: [SPARK-20285][Tests]Increase the pyspark streamin...

2017-04-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17597


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17594: [SPARK-20282][SS][Tests]Write the commit log first to fi...

2017-04-10 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/17594
  
Thanks! Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17594: [SPARK-20282][SS][Tests]Write the commit log firs...

2017-04-10 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/17594#discussion_r110735241
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -304,8 +304,8 @@ class StreamExecution(
   finishTrigger(dataAvailable)
   if (dataAvailable) {
 // Update committed offsets.
-committedOffsets ++= availableOffsets
 batchCommitLog.add(currentBatchId)
--- End diff --

This is existing, but do we not write to the commit log if there is no new 
data in the next batch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-10 Thread ioana-delaney
Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110742657
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -218,28 +220,44 @@ object JoinReorderDP extends PredicateHelper with 
Logging {
   }
 
   /**
-   * Builds a new JoinPlan when both conditions hold:
+   * Builds a new JoinPlan if the following conditions hold:
* - the sets of items contained in left and right sides do not overlap.
* - there exists at least one join condition involving references from 
both sides.
+   * - if star-join filter is enabled, allow the following combinations:
+   * 1) (oneJoinPlan U otherJoinPlan) is a subset of star-join
+   * 2) star-join is a subset of (oneJoinPlan U otherJoinPlan)
+   * 3) (oneJoinPlan U otherJoinPlan) is a subset of non star-join
+   *
* @param oneJoinPlan One side JoinPlan for building a new JoinPlan.
* @param otherJoinPlan The other side JoinPlan for building a new join 
node.
* @param conf SQLConf for statistics computation.
* @param conditions The overall set of join conditions.
* @param topOutput The output attributes of the final plan.
+   * @param filters Join graph info to be used as filters by the search 
algorithm.
* @return Builds and returns a new JoinPlan if both conditions hold. 
Otherwise, returns None.
*/
   private def buildJoin(
   oneJoinPlan: JoinPlan,
   otherJoinPlan: JoinPlan,
   conf: SQLConf,
   conditions: Set[Expression],
-  topOutput: AttributeSet): Option[JoinPlan] = {
+  topOutput: AttributeSet,
+  filters: Option[JoinGraphInfo]): Option[JoinPlan] = {
 
 if (oneJoinPlan.itemIds.intersect(otherJoinPlan.itemIds).nonEmpty) {
   // Should not join two overlapping item sets.
   return None
 }
 
+if (conf.joinReorderDPStarFilter && filters.isDefined) {
+  // Apply star-join filter, which ensures that tables in a star 
schema relationship
+  // are planned together.
+  val isValidJoinCombination =
+JoinReorderDPFilters(conf).starJoinFilter(oneJoinPlan.itemIds, 
otherJoinPlan.itemIds,
+  filters.get)
+  if (!isValidJoinCombination) return None
--- End diff --

@cloud-fan The star filter will eliminate joins among star and non-star 
tables until the star-joins are build. The assumption is that star-join should 
be planned together. I will add more comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17594: [SPARK-20282][SS][Tests]Write the commit log first to fi...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17594
  
**[Test build #75670 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75670/testReport)**
 for PR 17594 at commit 
[`6b63179`](https://github.com/apache/spark/commit/6b631799033f708ae61338a3fc8e5d63255fa803).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17596: [SPARK-12837][SQL] reduce the serialized size of accumul...

2017-04-10 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/17596
  
cc @rxin @davies @andrewor14 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17596: [SPARK-12837][SQL] reduce the serialized size of ...

2017-04-10 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/17596

[SPARK-12837][SQL] reduce the serialized size of accumulator

## What changes were proposed in this pull request?

When sending accumulator updates back to driver, the network overhead is 
pretty big as there are a lot of accumulators, e.g. `TaskMetrics` will send 
about 20 accumulators everytime, there may be a lot of `SQLMetric` if the query 
plan is complicated.

Therefore, it's critical to reduce the size of serialized accumulator. This 
PR proposed 2 ways:

1. Do not send the accumulator name to executor side, as it's unnecessary. 
When executor sends accumulator updates back to driver, we can look up the 
accumulator name in `AccumulatorContext` easily.
2. Introduce `InternalLongAccumulator` which can slightly reduce the size, 
but can also take over the `setValue` functionality from `LongAccumulator`, 
which doesn't fit well.

Tried on the example in https://issues.apache.org/jira/browse/SPARK-12837, 
the size of serialized accumulator has been cut down around 50%.

## How was this patch tested?

existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark oom

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17596.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17596


commit 67044b34ac60863d3fbdbe37ba89fffa85ffb2fe
Author: Wenchen Fan 
Date:   2017-04-10T12:31:47Z

reduce the serialized size of accumulator




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17598: [SPARK-20284][CORE] Make {Des,S}erializationStrea...

2017-04-10 Thread superbobry
GitHub user superbobry opened a pull request:

https://github.com/apache/spark/pull/17598

[SPARK-20284][CORE] Make {Des,S}erializationStream extend Closeable

## What changes were proposed in this pull request?

This PR allows to use `SerializationStream` and `DeserializationStream` in 
try-with-resources.

## How was this patch tested?

`core` unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/criteo-forks/spark 
compression-stream-closeable

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17598.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17598


commit 75ba026db26171e0ed59d48d0ab2855f2a2af757
Author: Sergei Lebedev 
Date:   2017-04-10T19:32:12Z

[SPARK-20284][CORE] Make {Des,S}erializationStream extend Closeable

This change allows to use these streams in try-with-resources.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...

2017-04-10 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/17591
  
Merging to master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-10 Thread ioana-delaney
Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110740755
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -54,8 +54,6 @@ case class CostBasedJoinReorder(conf: SQLConf) extends 
Rule[LogicalPlan] with Pr
 
   private def reorder(plan: LogicalPlan, output: Seq[Attribute]): 
LogicalPlan = {
 val (items, conditions) = extractInnerJoins(plan)
-// TODO: Compute the set of star-joins and use them in the join 
enumeration
-// algorithm to prune un-optimal plan choices.
--- End diff --

@cloud-fan Star-schema detection is first called to compute the set of 
tables connected by star-schema relationship e.g. {F1, D1, D2} in our code 
example. This call does not do any join reordering among the tables. It simply 
computes the set of tables in a star-schema relationship. Then, DP join 
enumeration generates all possible plan combinations among the entire set of 
tables in a the join e.g. {F1, D1}, {F1, T1}, {T2, T3}, etc. Star-filter, if 
called, will eliminate plan combinations among the star and non-star tables 
until the star join combinations are built. For example, {F1, D1} combination 
will be retained since it involves tables in a star schema, but {F1, T1} will 
be eliminated since it mixes star and non-star tables. Star-filter simply 
decides what combinations to retain but it will not decide on the order of 
execution of those tables. The order of the joins within a star-join and for 
the overall plan is decided by the DP join enumeration. Star-filter only 
ensures that
  tables in a star-join are planned together. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17520: [WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates ...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17520
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75668/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17520: [WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates ...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17520
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17568: [SPARK-20254][SQL] Remove unnecessary data conversion fo...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17568
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75669/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17568: [SPARK-20254][SQL] Remove unnecessary data conversion fo...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17568
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17495: [SPARK-20172][Core] Add file permission check whe...

2017-04-10 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17495#discussion_r110721933
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala
 ---
@@ -571,6 +572,34 @@ class FsHistoryProviderSuite extends SparkFunSuite 
with BeforeAndAfter with Matc
 }
  }
 
+  test("log without read permission should be filtered out before actual 
reading") {
+class TestFsHistoryProvider extends 
FsHistoryProvider(createTestConf()) {
--- End diff --

I think you'll need this from another test in this same file:

```
// setReadable(...) does not work on Windows. Please refer JDK-6728842.
assume(!Utils.isWindows)
```

In fact shouldn't this test be merged with the test for SPARK-3697?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17330
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17330
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75664/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...

2017-04-10 Thread ioana-delaney
Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/17546#discussion_r110742361
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -327,3 +345,104 @@ object JoinReorderDP extends PredicateHelper with 
Logging {
 case class Cost(card: BigInt, size: BigInt) {
   def +(other: Cost): Cost = Cost(this.card + other.card, this.size + 
other.size)
 }
+
+/**
+ * Implements optional filters to reduce the search space for join 
enumeration.
+ *
+ * 1) Star-join filters: Plan star-joins together since they are assumed
+ *to have an optimal execution based on their RI relationship.
+ * 2) Cartesian products: Defer their planning later in the graph to avoid
+ *large intermediate results (expanding joins, in general).
+ * 3) Composite inners: Don't generate "bushy tree" plans to avoid 
materializing
+ *   intermediate results.
+ *
+ * Filters (2) and (3) are not implemented.
+ */
+case class JoinReorderDPFilters(conf: SQLConf) extends PredicateHelper {
+  /**
+   * Builds join graph information to be used by the filtering strategies.
+   * Currently, it builds the sets of star/non-star joins.
+   * It can be extended with the sets of connected/unconnected joins, which
+   * can be used to filter Cartesian products.
+   */
+  def buildJoinGraphInfo(
+  items: Seq[LogicalPlan],
+  conditions: Set[Expression],
+  planIndex: Seq[(LogicalPlan, Int)]): Option[JoinGraphInfo] = {
+
+// Compute the tables in a star-schema relationship.
+val starJoin = StarSchemaDetection(conf).findStarJoins(items, 
conditions.toSeq)
+val nonStarJoin = items.filterNot(starJoin.contains(_))
+
+if (starJoin.nonEmpty && nonStarJoin.nonEmpty) {
+  val (starInt, nonStarInt) = planIndex.collect {
+case (p, i) if starJoin.contains(p) =>
+  (Some(i), None)
+case (p, i) if nonStarJoin.contains(p) =>
+  (None, Some(i))
+case _ =>
+  (None, None)
+  }.unzip
+  Some(JoinGraphInfo(starInt.flatten.toSet, nonStarInt.flatten.toSet))
+} else {
+  // Nothing interesting to return.
+  None
+}
+  }
+
+  /**
+   * Applies star-join filter.
+   *
+   * Given the outer/inner and the star/non-star sets,
+   * the following plan combinations are allowed:
+   * 1) (outer U inner) is a subset of star-join
+   * 2) star-join is a subset of (outer U inner)
+   * 3) (outer U inner) is a subset of non star-join
+   *
+   * It assumes the sets are disjoint.
+   *
+   * Example query graph:
+   *
+   * t1   d1 - t2 - t3
+   *  \  /
+   *   f1
+   *   |
+   *   d2
+   *
+   * star: {d1, f1, d2}
+   * non-star: {t2, t1, t3}
--- End diff --

@cloud-fan ```outer/inner``` i.e. ```oneJoinPlan/otherJoinPlan``` represent 
all the plan permutations generated by ```JoinReorderDP```. For example, at 
level 1, join enumeration will combine plans from level 0 e.g. ```oneJoinPlan = 
(f1)``` and ```otherJoinPlan = (d1)```. At level 2, it will generate plans from 
plan combinations at level 0 and level 1 e.g. ```oneJoinPlan = (d2)``` and 
```otherJoinPlan = {f1, d1}```, and so on. I will clarify the comment with more 
details.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String t...

2017-04-10 Thread nihavend
Github user nihavend commented on the issue:

https://github.com/apache/spark/pull/17527
  
Thank you very much all of you for all your efforts. Sometimes, facing the 
same issue different platforms and looking for a way to set jvm options for 
locale explicitly. But many times there is no way. 

And mostly the platform owners do not take this as serious problem and take 
care as you do.

For this reason, on the platforms leaving user an open door to define the 
locale manually may solve most of the issues related with the locale problem.

Hope helped some for future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17566: [SPARK-19518][SQL] IGNORE NULLS in first / last in SQL

2017-04-10 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17566
  
Thank you @hvanhovell.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17597: [SPARK-20285][Tests]Increase the pyspark streamin...

2017-04-10 Thread zsxwing
GitHub user zsxwing opened a pull request:

https://github.com/apache/spark/pull/17597

[SPARK-20285][Tests]Increase the pyspark streaming test timeout to 30 
seconds

## What changes were proposed in this pull request?

Saw the following failure locally:

```
Traceback (most recent call last):
  File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 
351, in test_cogroup
self._test_func(input, func, expected, sort=True, input2=input2)
  File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 
162, in _test_func
self.assertEqual(expected, result)
AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != []

First list contains 3 additional elements.
First extra element 0:
[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))]

+ []
- [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))],
-  [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))],
-  [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]]
```

It also happened on Jenkins: 
http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120

It's because when the machine is overloaded, the timeout is not enough. 
This PR just increases the timeout to 30 seconds.

## How was this patch tested?

Jenkins

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zsxwing/spark SPARK-20285

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17597.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17597


commit 67a12213b8abbb9c41333ff9e0ffb3f70944a12e
Author: Shixiong Zhu 
Date:   2017-04-10T19:47:21Z

Increase the pyspark streaming test timeout to 30 seconds




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17595: [SPARK-20283][SQL] Add preOptimizationBatches

2017-04-10 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/17595
  
LGTM - pending jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...

2017-04-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17591


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17520: [WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates ...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17520
  
**[Test build #75668 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75668/testReport)**
 for PR 17520 at commit 
[`b923bd5`](https://github.com/apache/spark/commit/b923bd5b2c79a84ada834a32b756ad0da80f12c6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class SparkListenerBlockManagerAdded(`
  * `class StorageStatus(`
  * `public final class JavaStructuredSessionization `
  * `  public static class LineWithTimestamp implements Serializable `
  * `  public static class Event implements Serializable `
  * `  public static class SessionInfo implements Serializable `
  * `  public static class SessionUpdate implements Serializable `
  * `case class Event(sessionId: String, timestamp: Timestamp)`
  * `case class SessionInfo(`
  * `case class SessionUpdate(`
  * `class Correlation(object):`
  * `case class UnresolvedMapObjects(`
  * `case class AssertNotNull(child: Expression, walkedTypePath: 
Seq[String] = Nil)`
  * `case class StarSchemaDetection(conf: SQLConf) extends PredicateHelper `
  * `   * Helper case class to hold (plan, rowCount) pairs.`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17597: [SPARK-20285][Tests]Increase the pyspark streaming test ...

2017-04-10 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/17597
  
Thanks! Merging to master, 2.1 and 2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17582: [SPARK-20239][Core] Improve HistoryServer's ACL mechanis...

2017-04-10 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/17582
  
> user configured with "spark.admin.acls" (or group) or 
"spark.ui.view.acls" (or group), or the user who started SHS could list all the 
applications, otherwise none of them can be listed

So to me this is the only bug; which means that maybe ACLs on the listing 
itself shouldn't ever be applied, and this PR should be a lot simpler, right? 

Most of it seem to be dealing with filtering the list of apps so that only 
applications the user can see are shown. I wonder if that's necessary, since 
the only thing that's showing is the existence of the application, not any data 
about it that could be considered sensitive.

There's also a minor thing that the listing being different for different 
users might cause confusion; but if there's a good reason for filtering, then 
that concern can be overridden. I'm just not sure there is a good reason for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17568: [SPARK-20254][SQL] Remove unnecessary data conversion fo...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17568
  
**[Test build #75669 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75669/testReport)**
 for PR 17568 at commit 
[`3be368c`](https://github.com/apache/spark/commit/3be368c84f82f65baf475805b3f56441cece3128).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17594: [SPARK-20282][SS][Tests]Write the commit log firs...

2017-04-10 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/17594#discussion_r110735710
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -304,8 +304,8 @@ class StreamExecution(
   finishTrigger(dataAvailable)
   if (dataAvailable) {
 // Update committed offsets.
-committedOffsets ++= availableOffsets
 batchCommitLog.add(currentBatchId)
--- End diff --

I thinks so. I think it's because we only update `currentBatchId` when data 
is available. cc @tdas


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17595: [SPARK-20283][SQL] Add preOptimizationBatches

2017-04-10 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/17595

[SPARK-20283][SQL] Add preOptimizationBatches

## What changes were proposed in this pull request?
We currently have postHocOptimizationBatches, but not 
preOptimizationBatches. This patch adds preOptimizationBatches so the optimizer 
debugging extensions are symmetric.

## How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-20283

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17595.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17595


commit 263dc688a1da8cfeef9f9e7e0168964a27e779b6
Author: Reynold Xin 
Date:   2017-04-10T19:00:24Z

[SPARK-20283][SQL] Add preOptimizationBatches




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2017-04-10 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16347
  
is this still a problem?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17527


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17597: [SPARK-20285][Tests]Increase the pyspark streaming test ...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17597
  
**[Test build #75673 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75673/testReport)**
 for PR 17597 at commit 
[`67a1221`](https://github.com/apache/spark/commit/67a12213b8abbb9c41333ff9e0ffb3f70944a12e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17546: [SPARK-20233] [SQL] Apply star-join filter heuristics to...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17546
  
**[Test build #75674 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75674/testReport)**
 for PR 17546 at commit 
[`7a5d1d0`](https://github.com/apache/spark/commit/7a5d1d0615a98fbee3c58934f92314bb92a97354).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17594: [SPARK-20282][SS][Tests]Write the commit log firs...

2017-04-10 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17594#discussion_r110760826
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -304,8 +304,8 @@ class StreamExecution(
   finishTrigger(dataAvailable)
   if (dataAvailable) {
 // Update committed offsets.
-committedOffsets ++= availableOffsets
 batchCommitLog.add(currentBatchId)
--- End diff --

Technically, when there is a batch with data, it finishes executing and 
then increments the batch id for the next batch. But if next batch has no data, 
then the batchid not further increment, and gets eventually used in a future 
batch with data. 

So i think it is correct to update the commit log as soon this batch is 
done, and before the batch id is incremented for next batch. This change 
maintains that invariant as far as i think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17568: [SPARK-20254][SQL] Remove unnecessary data conversion fo...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17568
  
**[Test build #75669 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75669/testReport)**
 for PR 17568 at commit 
[`3be368c`](https://github.com/apache/spark/commit/3be368c84f82f65baf475805b3f56441cece3128).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17594: [SPARK-20282][SS][Tests]Write the commit log first to fi...

2017-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17594
  
**[Test build #75670 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75670/testReport)**
 for PR 17594 at commit 
[`6b63179`](https://github.com/apache/spark/commit/6b631799033f708ae61338a3fc8e5d63255fa803).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17594: [SPARK-20282][SS][Tests]Write the commit log first to fi...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17594
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75670/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17594: [SPARK-20282][SS][Tests]Write the commit log first to fi...

2017-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17594
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >