[GitHub] spark pull request #16641: Merge pull request #1 from apache/master

2017-01-18 Thread someorz
Github user someorz closed the pull request at:

https://github.com/apache/spark/pull/16641


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16641: Merge pull request #1 from apache/master

2017-01-18 Thread someorz
GitHub user someorz opened a pull request:

https://github.com/apache/spark/pull/16641

Merge pull request #1 from apache/master

update

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/someorz/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16641.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16641


commit 65c6538a24b42fd4de934623553155ada76125e7
Author: someorz <24164...@qq.com>
Date:   2016-10-17T07:52:29Z

Merge pull request #1 from apache/master

update




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16344
  
**[Test build #71643 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71643/testReport)**
 for PR 16344 at commit 
[`83deee3`](https://github.com/apache/spark/commit/83deee352c46ec113554fccee4bdc14ead56072e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16640: Merge pull request #1 from apache/master

2017-01-18 Thread someorz
Github user someorz closed the pull request at:

https://github.com/apache/spark/pull/16640


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16640: Merge pull request #1 from apache/master

2017-01-18 Thread someorz
Github user someorz commented on the issue:

https://github.com/apache/spark/pull/16640
  
update


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16640: Merge pull request #1 from apache/master

2017-01-18 Thread someorz
GitHub user someorz opened a pull request:

https://github.com/apache/spark/pull/16640

Merge pull request #1 from apache/master

update

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/someorz/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16640.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16640


commit 65c6538a24b42fd4de934623553155ada76125e7
Author: someorz <24164...@qq.com>
Date:   2016-10-17T07:52:29Z

Merge pull request #1 from apache/master

update




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...

2017-01-18 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16593
  
@windpiger Could you do me a favor to add a dedicated test case in this PR? 
- Create a partitinoed Hive Table
- Create a partitinoed data source Table
- Create a partitinoed Hive Table As SELECT
- Create a partitinoed data source Table AS SELECT

I want to see whether all of them follow the same rule:
- data columns + partitino columns
- the order of data columns is based on the user-specified order in either 
schema (CT) or query (CTAS)
- the order of parttin columns is based on the order of columns specified 
in the clause of `PARTITIONED BY`?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-18 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/16344
  
jenkins test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-18 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/16344
  
jenkins add to whitelist


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable wor...

2017-01-18 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16593#discussion_r96805975
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala
 ---
@@ -87,8 +101,8 @@ case class CreateHiveTableAsSelectCommand(
   }
 } else {
   try {
-sparkSession.sessionState.executePlan(InsertIntoTable(
-  metastoreRelation, Map(), query, overwrite = true, ifNotExists = 
false)).toRdd
+
sparkSession.sessionState.executePlan(InsertIntoTable(metastoreRelation,
--- End diff --

Yeah. Agree


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable wor...

2017-01-18 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16593#discussion_r96805893
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala
 ---
@@ -64,7 +77,7 @@ case class CreateHiveTableAsSelectCommand(
   val withSchema = if (withFormat.schema.isEmpty) {
 // Hive doesn't support specifying the column list for target 
table in CTAS
 // However we don't think SparkSQL should follow that.
--- End diff --

We need to update the above comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15415
  
**[Test build #71642 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71642/testReport)**
 for PR 15415 at commit 
[`3273b76`](https://github.com/apache/spark/commit/3273b76c3d818636a822f98ecd3df0706a4cae26).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12064: [SPARK-14272][ML] Add Loglikelihood in GaussianMixtureSu...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12064
  
**[Test build #71641 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71641/testReport)**
 for PR 12064 at commit 
[`cbec946`](https://github.com/apache/spark/commit/cbec946583536283bf31dd5fb4f61b724e502e68).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-01-18 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/15415#discussion_r96804046
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala ---
@@ -0,0 +1,232 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.{DoubleParam, ParamMap, Params}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.fpm.{FPGrowth => MLlibFPGrowth, 
FPGrowthModel => MLlibFPGrowthModel}
+import org.apache.spark.sql.{DataFrame, _}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.{ArrayType, StringType, StructType}
+
+/**
+ * Common params for FPGrowth and FPGrowthModel
+ */
+private[fpm] trait FPGrowthParams extends Params with HasFeaturesCol with 
HasPredictionCol {
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new 
ArrayType(StringType, false))
+SchemaUtils.appendColumn(schema, $(predictionCol), new 
ArrayType(StringType, false))
+  }
+
+  /**
+   * the minimal support level of the frequent pattern
+   * Default: 0.3
+   * @group param
+   */
+  @Since("2.2.0")
+  val minSupport: DoubleParam = new DoubleParam(this, "minSupport",
+"the minimal support level of the frequent pattern (Default: 0.3)")
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getMinSupport: Double = $(minSupport)
+
+}
+
+/**
+ * :: Experimental ::
+ * A parallel FP-growth algorithm to mine frequent itemsets.
+ *
+ * @see [[http://dx.doi.org/10.1145/1454008.1454027 Li et al., PFP: 
Parallel FP-Growth for Query
+ *  Recommendation]]
+ */
+@Since("2.2.0")
+@Experimental
+class FPGrowth @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Estimator[FPGrowthModel] with FPGrowthParams with 
DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("FPGrowth"))
+
+  /** @group setParam */
+  @Since("2.2.0")
+  def setMinSupport(value: Double): this.type = set(minSupport, value)
+  setDefault(minSupport -> 0.3)
+
+  /** @group setParam */
+  @Since("2.2.0")
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+  /** @group setParam */
+  @Since("2.2.0")
+  def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
+
+  def fit(dataset: Dataset[_]): FPGrowthModel = {
+val data = dataset.select($(featuresCol)).rdd.map(r => 
r.getSeq[String](0).toArray)
+val parentModel = new 
MLlibFPGrowth().setMinSupport($(minSupport)).run(data)
+copyValues(new FPGrowthModel(uid, parentModel))
+  }
+
+  @Since("2.2.0")
+  override def transformSchema(schema: StructType): StructType = {
+validateAndTransformSchema(schema)
+  }
+
+  override def copy(extra: ParamMap): FPGrowth = defaultCopy(extra)
+}
+
+
+@Since("2.2.0")
+object FPGrowth extends DefaultParamsReadable[FPGrowth] {
+
+  @Since("2.2.0")
+  override def load(path: String): FPGrowth = super.load(path)
+}
+
+/**
+ * :: Experimental ::
+ * Model fitted by FPGrowth.
+ *
+ * @param parentModel a model trained by spark.mllib.fpm.FPGrowth
+ */
+@Since("2.2.0")
+@Experimental
+class FPGrowthModel private[ml] (
+@Since("2.2.0") override val uid: String,
+private val parentModel: MLlibFPGrowthModel[_])
+  extends Model[FPGrowthModel] with FPGrowthParams with 

[GitHub] spark pull request #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-01-18 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/15415#discussion_r96803812
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala ---
@@ -0,0 +1,232 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.{DoubleParam, ParamMap, Params}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.fpm.{FPGrowth => MLlibFPGrowth, 
FPGrowthModel => MLlibFPGrowthModel}
+import org.apache.spark.sql.{DataFrame, _}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.{ArrayType, StringType, StructType}
+
+/**
+ * Common params for FPGrowth and FPGrowthModel
+ */
+private[fpm] trait FPGrowthParams extends Params with HasFeaturesCol with 
HasPredictionCol {
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new 
ArrayType(StringType, false))
+SchemaUtils.appendColumn(schema, $(predictionCol), new 
ArrayType(StringType, false))
+  }
+
+  /**
+   * the minimal support level of the frequent pattern
+   * Default: 0.3
+   * @group param
+   */
+  @Since("2.2.0")
+  val minSupport: DoubleParam = new DoubleParam(this, "minSupport",
+"the minimal support level of the frequent pattern (Default: 0.3)")
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getMinSupport: Double = $(minSupport)
+
+}
+
+/**
+ * :: Experimental ::
+ * A parallel FP-growth algorithm to mine frequent itemsets.
+ *
+ * @see [[http://dx.doi.org/10.1145/1454008.1454027 Li et al., PFP: 
Parallel FP-Growth for Query
+ *  Recommendation]]
+ */
+@Since("2.2.0")
+@Experimental
+class FPGrowth @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Estimator[FPGrowthModel] with FPGrowthParams with 
DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("FPGrowth"))
+
+  /** @group setParam */
+  @Since("2.2.0")
+  def setMinSupport(value: Double): this.type = set(minSupport, value)
+  setDefault(minSupport -> 0.3)
+
+  /** @group setParam */
+  @Since("2.2.0")
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
--- End diff --

Thanks. Let's collect more feedback about it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16639
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16639
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71640/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16639
  
**[Test build #71640 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71640/testReport)**
 for PR 16639 at commit 
[`9635980`](https://github.com/apache/spark/commit/9635980fca20d18b44fa5085996ad43cbf3f3bb5).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread scwf
Github user scwf commented on the issue:

https://github.com/apache/spark/pull/16633
  
need define a new map output statistics  to do this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
@scwf I don't think it would work. map output statistics is just 
approximate number of output bytes. You can't use it to get correct row number.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16639
  
**[Test build #71638 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71638/testReport)**
 for PR 16639 at commit 
[`b93c37f`](https://github.com/apache/spark/commit/b93c37f1b0dfd3f1293b7a3df8beacc2fec7a33d).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16639
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71638/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-01-18 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/15415#discussion_r96802011
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/fpm/AssociationRules.scala ---
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.param.{DoubleParam, Param, ParamMap, Params}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.fpm.{AssociationRules => 
MLlibAssociationRules}
+import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset
+import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
+
+/**
+ * :: Experimental ::
+ *
+ * Generates association rules from frequent itemsets ("items", "freq"). 
This method only generates
+ * association rules which have a single item as the consequent.
+ */
+@Since("2.1.0")
+@Experimental
+class AssociationRules(override val uid: String) extends Params {
--- End diff --

`freqItemsets` and `rules` does not have a one-to-one mapping relation and 
will probably violates the primitives of Transformer. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16639
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16639
  
**[Test build #71640 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71640/testReport)**
 for PR 16639 at commit 
[`9635980`](https://github.com/apache/spark/commit/9635980fca20d18b44fa5085996ad43cbf3f3bb5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16639
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16639
  
**[Test build #71637 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71637/testReport)**
 for PR 16639 at commit 
[`5c28b62`](https://github.com/apache/spark/commit/5c28b62941fc08438e16a92bc70041636cc1dbee).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16639
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71637/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/16639
  
cc @kayousterhout @markhamstra @mateiz 

This isn't just protecting against crazy user code -- I've seen users hit 
this with spark sql (because of 
https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L214),
 so it seems important to fix.

I attempted to write a larger integration test, which reproduced the issue 
in a "local-cluster" setup, but got stuck.  ShuffleBlockFetcherIterator does 
_some_ fetches on construction, before its used as an iterator wrapped in user 
code.  So if the failures happen during that initialization, everything was 
fine before.  The failure has to happen inside the call to 
`shuffleBlockFetcherIterator.next()` when its called by the user's iterator for 
the error to happen.  I eventually was able to reproduce it with this 
https://github.com/squito/spark/commit/c2d27d10f32edf70e78d849967f7b7bf51495c4e 
but it involved hacking internals and didn't seem easy to get into a test.  I 
settled for a simpler unit test just on `Executor`, but open to more 
suggestions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16621: [SPARK-19265][SQL] make table relation cache gene...

2017-01-18 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16621#discussion_r96800976
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -586,12 +594,12 @@ class SessionCatalog(
 desc = metadata,
 output = metadata.schema.toAttributes,
 child = parser.parsePlan(viewText))
-  SubqueryAlias(relationAlias, child, Option(name))
+  SubqueryAlias(relationAlias, child, Some(name.copy(table = 
table, database = Some(db
 } else {
   SubqueryAlias(relationAlias, SimpleCatalogRelation(metadata), 
None)
 }
   } else {
-SubqueryAlias(relationAlias, tempTables(table), Option(name))
+SubqueryAlias(relationAlias, tempTables(table), None)
--- End diff --

Sorry, I have been living under a rock in the past month or so.

This is not really needed anymore. Lets remove it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16552: [SPARK-19152][SQL]DataFrameWriter.saveAsTable support hi...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16552
  
**[Test build #71639 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71639/testReport)**
 for PR 16552 at commit 
[`2bf67c7`](https://github.com/apache/spark/commit/2bf67c722b4d57f446b54fa4f35349ba7cb2b6d6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16639
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16639
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71636/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16639
  
**[Test build #71638 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71638/testReport)**
 for PR 16639 at commit 
[`b93c37f`](https://github.com/apache/spark/commit/b93c37f1b0dfd3f1293b7a3df8beacc2fec7a33d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16621: [SPARK-19265][SQL] make table relation cache gene...

2017-01-18 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16621#discussion_r96800456
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -586,12 +594,12 @@ class SessionCatalog(
 desc = metadata,
 output = metadata.schema.toAttributes,
 child = parser.parsePlan(viewText))
-  SubqueryAlias(relationAlias, child, Option(name))
+  SubqueryAlias(relationAlias, child, Some(name.copy(table = 
table, database = Some(db
 } else {
   SubqueryAlias(relationAlias, SimpleCatalogRelation(metadata), 
None)
 }
   } else {
-SubqueryAlias(relationAlias, tempTables(table), Option(name))
+SubqueryAlias(relationAlias, tempTables(table), None)
--- End diff --

ping @hvanhovell again. : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16581: [SPARK-18589] [SQL] Fix Python UDF accessing attributes ...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16581
  
**[Test build #3541 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3541/testReport)**
 for PR 16581 at commit 
[`e4db820`](https://github.com/apache/spark/commit/e4db8209843379fdd385dbf299baca7dea410075).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16639
  
**[Test build #71637 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71637/testReport)**
 for PR 16639 at commit 
[`5c28b62`](https://github.com/apache/spark/commit/5c28b62941fc08438e16a92bc70041636cc1dbee).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread scwf
Github user scwf commented on the issue:

https://github.com/apache/spark/pull/16633
  
Yes, you are right, we can not ensure the uniform distribution for global 
limit.
An idea is not use a special partitioner, after the shuffle we should get 
the mapoutput statistics for row num of each bucket, and decide each global 
limit should take how many element.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16639
  
**[Test build #71636 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71636/testReport)**
 for PR 16639 at commit 
[`0091aba`](https://github.com/apache/spark/commit/0091abacb930642a4ef2178a31be7d6b70462766).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16639: [SPARK-19276][CORE] Fetch Failure handling robust...

2017-01-18 Thread squito
GitHub user squito opened a pull request:

https://github.com/apache/spark/pull/16639

[SPARK-19276][CORE] Fetch Failure handling robust to user error handling

## What changes were proposed in this pull request?

Fault-tolerance in spark requires special handling of shuffle fetch
failures.  The Executor would catch FetchFailedException and send a
special msg back to the driver.

However, intervening user code could intercept that exception, and wrap
it with something else.  This even happens in SparkSQL.  So rather than
checking the exception directly, we'll store the fetch failure directly
in the TaskContext, where users can't touch it.

## How was this patch tested?

Added a test case which failed before the fix.  Full test suite via jenkins.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/squito/spark SPARK-19276

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16639.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16639


commit 0091abacb930642a4ef2178a31be7d6b70462766
Author: Imran Rashid 
Date:   2017-01-18T21:55:50Z

[SPARK-19276][CORE] Fetch Failure handling robust to user error handling

Fault-tolerance in spark requires special handling of shuffle fetch
failures.  The Executor would catch FetchFailedException and send a
special msg back to the driver.

However, intervening user code could intercept that exception, and wrap
it with something else.  This even happens in SparkSQL.  So rather than
checking the exception directly, we'll store the fetch failure directly
in the TaskContext, where users can't touch it.

This includes a test case which failed before the fix.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16633
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71633/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16633
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16633
  
**[Test build #71633 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71633/testReport)**
 for PR 16633 at commit 
[`3cbd6ee`](https://github.com/apache/spark/commit/3cbd6ee19a994d368a4130da47a2554bd0019679).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16635
  
**[Test build #71635 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71635/testReport)**
 for PR 16635 at commit 
[`ea6bd7d`](https://github.com/apache/spark/commit/ea6bd7d00c4d6ef5aea158a3fc8c3bfc5a0c02e4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread jayadevanmurali
Github user jayadevanmurali commented on the issue:

https://github.com/apache/spark/pull/16635
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread jayadevanmurali
Github user jayadevanmurali commented on the issue:

https://github.com/apache/spark/pull/16635
  
@cloud-fan 
Incorporated code review comments



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF

2017-01-18 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/16605
  
okay. But, if this issue finished, I'm planning to take SPARK-12823 in a 
similar way.
Do u think also it's not also worth trying struct? cc: @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16605
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71631/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16605
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16605
  
**[Test build #71631 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71631/testReport)**
 for PR 16605 at commit 
[`35715a4`](https://github.com/apache/spark/commit/35715a4b6847f56f62038e9bbd77bf4a83250410).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16635
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16635
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71630/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF

2017-01-18 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16605
  
Well, it will be good if we can support `Array` in `ScalaUDF`, but it's not 
a big deal as users can easily do `udf { (seq: Seq[Int]) => val a = 
seq.toArray; // do anything you like with the array }`.

considering the size of this PR, I don't think it worth.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16635
  
**[Test build #71630 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71630/testReport)**
 for PR 16635 at commit 
[`71be60f`](https://github.com/apache/spark/commit/71be60f38bbc18e05b90f4f4837dcda6cde2460d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16593
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16593
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71634/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16593
  
**[Test build #71634 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71634/testReport)**
 for PR 16593 at commit 
[`14aed85`](https://github.com/apache/spark/commit/14aed85b6b3b083b8a4fdb3a3cab65f1eebc8729).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16593
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71632/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16593
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16593
  
**[Test build #71632 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71632/testReport)**
 for PR 16593 at commit 
[`9270851`](https://github.com/apache/spark/commit/9270851f0b358c30a14f0f63eded25b68b38b102).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16634: [SPARK-16968][SQL][Backport-2.0]Add additional op...

2017-01-18 Thread gatorsmile
Github user gatorsmile closed the pull request at:

https://github.com/apache/spark/pull/16634


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16634: [SPARK-16968][SQL][Backport-2.0]Add additional options i...

2017-01-18 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16634
  
Thanks for the review, Merging to 2.0!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF

2017-01-18 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/16605
  
Sure, @maropu . `WrappedArray` is not documented for now.

Hi, @gatorsmile and @cloud-fan .
Could you review this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
@scwf No. A simple example: if there are 5 local limit which produce 1, 2, 
1, 1, 1 rows when limit is 10. If you shuffle to 5 partitions, the 
distributions for each local limit look like:

1: (1, 0, 0, 0, 0)
2: (1, 1, 0, 0, 0)
3: (1, 0, 0, 0, 0)
4: (1, 0, 0, 0, 0)
5: (1, 0, 0, 0, 0)

So the final rows in 5 partitions are (5, 1, 0, 0, 0) which is not 
uniformly distributed.

You don't know how many rows each local limit can get. So how do you know 
how many partitions and how many rows to retrieve for each partitions?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16213: [SPARK-18020][Streaming][Kinesis] Checkpoint SHARD_END t...

2017-01-18 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/16213
  
@tdas ping


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16635: [SPARK-19059] [SQL] Unable to retrieve data from ...

2017-01-18 Thread jayadevanmurali
Github user jayadevanmurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/16635#discussion_r96790387
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -2513,4 +2513,18 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  test(
+"SPARK-19059: Unable to retrieve data from parquet table whose name 
startswith underscore") {
+sql("CREATE TABLE `_tbl`(i INT) USING parquet")
+sql("INSERT INTO `_tbl` VALUES (1), (2), (3)")
+checkAnswer( sql("SELECT * FROM `_tbl`"), Row(1) :: Row(2) :: Row(3) 
:: Nil)
--- End diff --

It should be fine. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF

2017-01-18 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/16605
  
oh, yea. I didn't find that and I think it's a good point.
IMO `WrappedArray` is implicitly used inside for implicit conversions, so 
users do not use `WrappedArray` directly for UDFs in most cases.

Anyway, thanks alots for your reviews!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16635: [SPARK-19059] [SQL] Unable to retrieve data from ...

2017-01-18 Thread jayadevanmurali
Github user jayadevanmurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/16635#discussion_r96790264
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -2513,4 +2513,18 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  test(
+"SPARK-19059: Unable to retrieve data from parquet table whose name 
startswith underscore") {
+sql("CREATE TABLE `_tbl`(i INT) USING parquet")
--- End diff --

I will do this change


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16635: [SPARK-19059] [SQL] Unable to retrieve data from ...

2017-01-18 Thread jayadevanmurali
Github user jayadevanmurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/16635#discussion_r96790238
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -2513,4 +2513,18 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  test(
+"SPARK-19059: Unable to retrieve data from parquet table whose name 
startswith underscore") {
--- End diff --

Yes, it is not parquet only. I think SPARK-19059: read file based table 
whose name starts with underscore is fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16635: [SPARK-19059] [SQL] Unable to retrieve data from ...

2017-01-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16635#discussion_r96790019
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -2513,4 +2513,18 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  test(
+"SPARK-19059: Unable to retrieve data from parquet table whose name 
startswith underscore") {
+sql("CREATE TABLE `_tbl`(i INT) USING parquet")
+sql("INSERT INTO `_tbl` VALUES (1), (2), (3)")
+checkAnswer( sql("SELECT * FROM `_tbl`"), Row(1) :: Row(2) :: Row(3) 
:: Nil)
--- End diff --

I think we can stop here, create a table with underscore, then insert into 
it, then read it, that's enough to prove we can support table with underscore. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread scwf
Github user scwf commented on the issue:

https://github.com/apache/spark/pull/16633
  
refer to the maillist
>One issue left is how to decide shuffle partition number. 
We can have a config of the maximum number of elements for each GlobalLimit 
task to process, 
then do a factorization to get a number most close to that config. 
E.g. the config is 2000: 
if limit=1,  1 = 2000 * 5, we shuffle to 5 partitions 
if limit=,  =  * 9, we shuffle to 9 partitions 
if limit is a prime number, we just fall back to single partition 

You mean for the prime number case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF

2017-01-18 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/16605#discussion_r96789868
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
 ---
@@ -84,7 +86,9 @@ case class ScalaUDF(
 case 1 =>
   val func = function.asInstanceOf[(Any) => Any]
   val child0 = children(0)
-  lazy val converter0 = 
CatalystTypeConverters.createToScalaConverter(child0.dataType)
+  lazy val converter0 = inputConverters.map {
--- End diff --

okay, fixed!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16635: [SPARK-19059] [SQL] Unable to retrieve data from ...

2017-01-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16635#discussion_r96789821
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -2513,4 +2513,18 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  test(
+"SPARK-19059: Unable to retrieve data from parquet table whose name 
startswith underscore") {
+sql("CREATE TABLE `_tbl`(i INT) USING parquet")
--- End diff --

use `withTable("tbl", "_tbl") {...}`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16635: [SPARK-19059] [SQL] Unable to retrieve data from ...

2017-01-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16635#discussion_r96789777
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -2513,4 +2513,18 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  test(
+"SPARK-19059: Unable to retrieve data from parquet table whose name 
startswith underscore") {
--- End diff --

this bug is not parquet only right? how about `SPARK-19059: read file based 
table whose name starts with underscore`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16633
  
**[Test build #71633 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71633/testReport)**
 for PR 16633 at commit 
[`3cbd6ee`](https://github.com/apache/spark/commit/3cbd6ee19a994d368a4130da47a2554bd0019679).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16593
  
**[Test build #71634 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71634/testReport)**
 for PR 16593 at commit 
[`14aed85`](https://github.com/apache/spark/commit/14aed85b6b3b083b8a4fdb3a3cab65f1eebc8729).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
@scwf 
> it use a special partitioner to do this, the partitioner like the 
row_numer in sql it give each row a uniform partitionid, so in the reduce task, 
each task handle num of rows very closely.

I see @wzhfy wants to use a partitioner to uniformly distribute the rows in 
each local limit. However, because each local limit can produce different 
number of rows, you can't get a real uniform distribution. So in the global 
limit operation, you can't know how many partitions you need to use in order to 
satisfy the final limit number.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable wor...

2017-01-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16593#discussion_r96788653
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala
 ---
@@ -87,8 +101,8 @@ case class CreateHiveTableAsSelectCommand(
   }
 } else {
   try {
-sparkSession.sessionState.executePlan(InsertIntoTable(
-  metastoreRelation, Map(), query, overwrite = true, ifNotExists = 
false)).toRdd
+
sparkSession.sessionState.executePlan(InsertIntoTable(metastoreRelation,
--- End diff --

oh, we should be fine here, the table is created with 
`reorderedOutputQuery.schema`, so there won't be any type difference


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread scwf
Github user scwf commented on the issue:

https://github.com/apache/spark/pull/16633
  
To clear, now we have these issues:
1.  local limit compute all partitions, that means it launch many tasks  
but actually maybe very small tasks is enough.
2.  global limit single partition issue, now the global limit will shuffle 
all the data to one partition, so if the limit num is very big, it cause 
performance bottleneck 

It is perfect if we combine the global limit and local limit into one 
stage, and avoid the shuffle, but for now i can not find a very good 
solution(no performance regression) to do this without change spark 
core/scheduler, your solution is trying to do that, but as i suggest, there are 
some cases the performance maybe worse.

@wzhfy 's idea is just resolve the single partition issue, still shuffle, 
still local limit on all the partitions, but it not bring performance down in 
that cases compare with current code path.

> Another issue is, how do you make sure you create a uniform distribution 
of the result of local limit. Each local limit can produce different number of 
rows.

it use a special partitioner to do this, the partitioner like the 
`row_numer`  in sql it give each row a uniform partitionid, so in the reduce 
task, each task handle num of rows very closely.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16593
  
**[Test build #71632 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71632/testReport)**
 for PR 16593 at commit 
[`9270851`](https://github.com/apache/spark/commit/9270851f0b358c30a14f0f63eded25b68b38b102).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable wor...

2017-01-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16593#discussion_r96788538
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala 
---
@@ -1361,6 +1355,22 @@ class HiveDDLSuite
 }
   }
 
+  test("create hive serde table as select") {
--- End diff --

`create partitioned hive serde table as select`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16633
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71627/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16633
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16633
  
**[Test build #71627 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71627/testReport)**
 for PR 16633 at commit 
[`6ba8b28`](https://github.com/apache/spark/commit/6ba8b284ec8f43a76c9ba54349438e484a097223).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class LocalLimitExec(limit: Int, child: SparkPlan) extends 
UnaryExecNode with CodegenSupport `
  * `case class GlobalLimitExec(limit: Int, child: SparkPlan) extends 
UnaryExecNode `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16605
  
**[Test build #71631 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71631/testReport)**
 for PR 16605 at commit 
[`35715a4`](https://github.com/apache/spark/commit/35715a4b6847f56f62038e9bbd77bf4a83250410).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16635
  
**[Test build #71630 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71630/testReport)**
 for PR 16635 at commit 
[`71be60f`](https://github.com/apache/spark/commit/71be60f38bbc18e05b90f4f4837dcda6cde2460d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread jayadevanmurali
Github user jayadevanmurali commented on the issue:

https://github.com/apache/spark/pull/16635
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16635
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71629/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16635
  
**[Test build #71629 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71629/testReport)**
 for PR 16635 at commit 
[`499711d`](https://github.com/apache/spark/commit/499711d96f5f776baf482e0cbc12cd55f8c9b2c2).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16635
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16635
  
**[Test build #71629 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71629/testReport)**
 for PR 16635 at commit 
[`499711d`](https://github.com/apache/spark/commit/499711d96f5f776baf482e0cbc12cd55f8c9b2c2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread jayadevanmurali
Github user jayadevanmurali commented on the issue:

https://github.com/apache/spark/pull/16635
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16552: [SPARK-19152][SQL]DataFrameWriter.saveAsTable support hi...

2017-01-18 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16552
  
The overall idea is to use `InsertIntable` to implement appending to hive 
table, but this approach is too hacky, we should follow the way how we deal 
with data source table, e.g. `DataFrameWriter.saveAsTable` just build a 
`CreateTable` plan, rule `AnalyzeCreateTable` do some checking and 
normalization, and another rule turn `CreateTable` into 
`CreateDataSourceTableAsSelectCommand`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16633: [SPARK-19274][SQL] Make GlobalLimit without shuff...

2017-01-18 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16633#discussion_r96786080
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
@@ -90,21 +94,74 @@ trait BaseLimitExec extends UnaryExecNode with 
CodegenSupport {
 }
 
 /**
- * Take the first `limit` elements of each child partition, but do not 
collect or shuffle them.
+ * Take the first `limit` elements of the child's partitions.
  */
-case class LocalLimitExec(limit: Int, child: SparkPlan) extends 
BaseLimitExec {
-
-  override def outputOrdering: Seq[SortOrder] = child.outputOrdering
-
-  override def outputPartitioning: Partitioning = child.outputPartitioning
-}
+case class GlobalLimitExec(limit: Int, child: SparkPlan) extends 
UnaryExecNode {
+  override def output: Seq[Attribute] = child.output
 
-/**
- * Take the first `limit` elements of the child's single output partition.
- */
-case class GlobalLimitExec(limit: Int, child: SparkPlan) extends 
BaseLimitExec {
+  protected override def doExecute(): RDD[InternalRow] = {
+// This logic is mainly copyed from `SparkPlan.executeTake`.
+// TODO: combine this with `SparkPlan.executeTake`, if possible.
+val childRDD = child.execute()
+val totalParts = childRDD.partitions.length
+var partsScanned = 0
+var totalNum = 0
+var resultRDD: RDD[InternalRow] = null
+while (totalNum < limit && partsScanned < totalParts) {
+  // The number of partitions to try in this iteration. It is ok for 
this number to be
+  // greater than totalParts because we actually cap it at totalParts 
in runJob.
+  var numPartsToTry = 1L
+  if (partsScanned > 0) {
+// If we didn't find any rows after the previous iteration, 
quadruple and retry.
+// Otherwise, interpolate the number of partitions we need to try, 
but overestimate
+// it by 50%. We also cap the estimation in the end.
+val limitScaleUpFactor = 
Math.max(sqlContext.conf.limitScaleUpFactor, 2)
+if (totalNum == 0) {
+  numPartsToTry = partsScanned * limitScaleUpFactor
+} else {
+  // the left side of max is >=1 whenever partsScanned >= 2
+  numPartsToTry = Math.max((1.5 * limit * partsScanned / 
totalNum).toInt - partsScanned, 1)
+  numPartsToTry = Math.min(numPartsToTry, partsScanned * 
limitScaleUpFactor)
+}
+  }
 
-  override def requiredChildDistribution: List[Distribution] = AllTuples 
:: Nil
+  val p = partsScanned.until(math.min(partsScanned + numPartsToTry, 
totalParts).toInt)
+  val sc = sqlContext.sparkContext
+  val res = sc.runJob(childRDD,
+(it: Iterator[InternalRow]) => Array[Int](it.size), p)
+
+  totalNum += res.map(_.head).sum
+  partsScanned += p.size
+
+  if (totalNum >= limit) {
+// If we scan more rows than the limit number, we need to reduce 
that from scanned.
+// We calculate how many rows need to be reduced for each 
partition,
+// until all redunant rows are reduced.
+var numToReduce = (totalNum - limit)
+val reduceAmounts = new HashMap[Int, Int]()
+val partitionsToReduce = p.zip(res.map(_.head)).foreach { case 
(part, size) =>
+  val toReduce = if (size > numToReduce) numToReduce else size
+  reduceAmounts += ((part, toReduce))
+  numToReduce -= toReduce
+}
+resultRDD = childRDD.mapPartitionsWithIndexInternal { case (index, 
iter) =>
+  if (index < partsScanned) {
--- End diff --

Yes. The broken RDD job chain causes extra partition scan.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16621: [SPARK-19265][SQL] make table relation cache gene...

2017-01-18 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/16621#discussion_r96785980
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -118,6 +118,14 @@ class SessionCatalog(
   }
 
   /**
+   * A cache of qualified table name to table relation plan.
+   */
+  val tableRelationCache: Cache[QualifiedTableName, LogicalPlan] = {
+// TODO: create a config instead of hardcode 1000 here.
--- End diff --

Yep. Sure~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
@scwf The main issue the user posted in the mailing list is, the limit is 
big enough or partition number is big enough to cause performance bottleneck in 
shuffling the data of local limit. But @wzhfy's idea is also involving 
shuffling.

Another issue is, how do you make sure you create a uniform distribution of 
the result of local limit. Each local limit can produce different number of 
rows.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16638: spark-19115

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16638
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16638: spark-19115

2017-01-18 Thread ouyangxiaochen
GitHub user ouyangxiaochen opened a pull request:

https://github.com/apache/spark/pull/16638

spark-19115

## What changes were proposed in this pull request?
sparksql supports the command : create external table if not exists gen_tbl 
like src_tbl location '/warehouse/gen_tbl' in spark2.X


## How was this patch tested?
 manual tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ouyangxiaochen/spark spark19115

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16638.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16638


commit adde008588cf8e05cf261c086201c27a8dd5584f
Author: ouyangxiaochen 
Date:   2017-01-19T03:15:17Z

spark-19115




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16621: [SPARK-19265][SQL] make table relation cache gene...

2017-01-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16621#discussion_r96785426
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -118,6 +118,14 @@ class SessionCatalog(
   }
 
   /**
+   * A cache of qualified table name to table relation plan.
+   */
+  val tableRelationCache: Cache[QualifiedTableName, LogicalPlan] = {
+// TODO: create a config instead of hardcode 1000 here.
--- End diff --

yea it's easy, but I wanna minimal the code changes so it's easier to 
review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16627: [SPARK-19267][SS]Fix a race condition when stoppi...

2017-01-18 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/16627#discussion_r96779672
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
 ---
@@ -34,6 +35,132 @@ import org.apache.spark.util.ThreadUtils
 /** Unique identifier for a [[StateStore]] */
 case class StateStoreId(checkpointLocation: String, operatorId: Long, 
partitionId: Int)
 
+/**
+ * The class to maintain [[StateStore]]s. When a SparkContext is active 
(i.e. SparkEnv.get is not
+ * null), it will run a periodic background task to do maintenance on the 
loaded stores. The
+ * background task will be cancelled when `stop` is called or 
`SparkEnv.get` becomes `null`.
+ */
+class StateStoreContext extends Logging {
+  import StateStore._
+
+  private val maintenanceTaskExecutor =
+
ThreadUtils.newDaemonSingleThreadScheduledExecutor("state-store-maintenance-task")
+
+  @GuardedBy("StateStore.LOCK")
+  private val loadedProviders = new mutable.HashMap[StateStoreId, 
StateStoreProvider]()
+
+  @GuardedBy("StateStore.LOCK")
+  private var _coordRef: StateStoreCoordinatorRef = null
+
+  @GuardedBy("StateStore.LOCK")
+  private var isStopped: Boolean = false
+
+  /** Get the state store provider or add `stateStoreProvider` if not 
exist */
+  def getOrElseUpdate(
+  storeId: StateStoreId,
+  stateStoreProvider: => StateStoreProvider): StateStoreProvider = 
LOCK.synchronized {
+loadedProviders.getOrElseUpdate(storeId, stateStoreProvider)
+  }
+
+  /** Unload a state store provider */
+  def unload(storeId: StateStoreId): Unit = LOCK.synchronized { 
loadedProviders.remove(storeId) }
+
+  /** Whether a state store provider is loaded or not */
+  def isLoaded(storeId: StateStoreId): Boolean = LOCK.synchronized {
+loadedProviders.contains(storeId)
+  }
+
+  /** Whether the maintenance task is running */
+  def isMaintenanceRunning: Boolean = LOCK.synchronized { !isStopped }
+
+  /** Unload and stop all state store providers */
+  def stop(): Unit = LOCK.synchronized {
+if (!isStopped) {
+  isStopped = true
+  loadedProviders.clear()
+  maintenanceTaskExecutor.shutdown()
+  logInfo("StateStore stopped")
+}
+  }
+
+  /** Start the periodic maintenance task if not already started and if 
Spark active */
+  private def startMaintenance(): Unit = {
+val env = SparkEnv.get
+if (env != null) {
+  val periodMs = env.conf.getTimeAsMs(
+MAINTENANCE_INTERVAL_CONFIG, 
s"${MAINTENANCE_INTERVAL_DEFAULT_SECS}s")
+  val runnable = new Runnable {
+override def run(): Unit = { doMaintenance() }
+  }
+  maintenanceTaskExecutor.scheduleAtFixedRate(
+runnable, periodMs, periodMs, TimeUnit.MILLISECONDS)
+  logInfo("State Store maintenance task started")
+}
+  }
+
+  /**
+   * Execute background maintenance task in all the loaded store providers 
if they are still
+   * the active instances according to the coordinator.
+   */
+  private def doMaintenance(): Unit = {
+logDebug("Doing maintenance")
+if (SparkEnv.get == null) {
+  stop()
+} else {
+  LOCK.synchronized { loadedProviders.toSeq }.foreach { case (id, 
provider) =>
--- End diff --

This is locking the state store while maintenance is going on. since it 
using the same lock as the external lock this, the task using the store will 
block on the maintenance task.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   >