date:20160308

[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...

2016-03-08 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/11599#discussion_r55482347
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] {
 p
   }
 
+// Eliminate the Projects with empty projectList
+case p @ Project(projectList, child) if projectList.isEmpty => child
--- End diff --

Let me respond the original question by @cloud-fan 
We will not see an empty Project, if the child has more than one columns. 
The empty Project only happens after columnPruning. I am fine, if we want to 
add an extra rule for eliminating Project only. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...

2016-03-08 Thread dilipbiswal

Github user dilipbiswal commented on the pull request:

https://github.com/apache/spark/pull/11596#issuecomment-194167614
  
@cloud-fan Thanks !! Actually i remember exploring this option. There is a 
generate syntax for outer lateral view like following -
```SQL
SELECT * FROM src LATERAL VIEW OUTER explode(array()) C AS a limit 10;
```
I thought if we converted this generate operator as  a sub select where the 
generator is in projection 
list , we will loose the outer semantics. Thats why i felt safer to 
generate as a LATERAL VIEW
clause.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...

2016-03-08 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/11599#discussion_r55482082
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] {
 p
   }
 
+// Eliminate the Projects with empty projectList
+case p @ Project(projectList, child) if projectList.isEmpty => child
--- End diff --

Thanks @viirya @cloud-fan !

I am not sure which way is better. 
```scala
case p @ Project(_, l: LeafNode) if !l.isInstanceOf[OneRowRelation] => p 
```
My concern is the above line looks more hacky than the current PR fix. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...

2016-03-08 Thread olarayej

Github user olarayej commented on a diff in the pull request:

https://github.com/apache/spark/pull/11220#discussion_r55481841
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -303,8 +303,28 @@ setMethod("colnames",
 #' @rdname columns
 #' @name colnames<-
 setMethod("colnames<-",
-  signature(x = "DataFrame", value = "character"),
--- End diff --

Thanks @felixcheung for investigating this further!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...

2016-03-08 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/11599#discussion_r55481794
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] {
 p
   }
 
+// Eliminate the Projects with empty projectList
+case p @ Project(projectList, child) if projectList.isEmpty => child
--- End diff --

Yea. As I posted before.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Update Java Doc in Spark Submit

2016-03-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11600#issuecomment-194164345
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...

2016-03-08 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/11599#discussion_r55481593
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] {
 p
   }
 
+// Eliminate the Projects with empty projectList
+case p @ Project(projectList, child) if projectList.isEmpty => child
--- End diff --

Without the fix, I hit the error:

```
== FAIL: Plans do not match ===
 Project [1 AS 1#0]  Project [1 AS 1#0]
!+- Project  +- OneRowRelation$
!   +- OneRowRelation$   
 
ScalaTestFailureLocation: org.apache.spark.sql.catalyst.plans.PlanTest at 
(PlanTest.scala:59)
org.scalatest.exceptions.TestFailedException: 
== FAIL: Plans do not match ===
 Project [1 AS 1#0]  Project [1 AS 1#0]
!+- Project  +- OneRowRelation$
!   +- OneRowRelation$   
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...

2016-03-08 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/11599#discussion_r55481539
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] {
 p
   }
 
+// Eliminate the Projects with empty projectList
+case p @ Project(projectList, child) if projectList.isEmpty => child
--- End diff --

Nevermind. It is because another rule I added.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Update Java Doc in Spark Submit

2016-03-08 Thread AhmedKamal

GitHub user AhmedKamal opened a pull request:

https://github.com/apache/spark/pull/11600

Update Java Doc in Spark Submit

## What changes were proposed in this pull request?

JIRA : https://issues.apache.org/jira/browse/SPARK-13769

The java doc here 
(https://github.com/apache/spark/blob/e97fc7f176f8bf501c9b3afd8410014e3b0e1602/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L51)
needs to be updated from "The latter two operations are currently supported 
only for standalone cluster mode." to "The latter two operations are currently 
supported only for standalone and mesos cluster modes."



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AhmedKamal/spark SPARK-13769

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11600.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11600


commit 104b5059c315425ce0861324005f6b3d86f7cd3c
Author: Ahmed Kamal 
Date:   2016-03-09T07:41:16Z

Update Java Doc in Spark Submit




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...

2016-03-08 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/11599#discussion_r55481375
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] {
 p
   }
 
+// Eliminate the Projects with empty projectList
+case p @ Project(projectList, child) if projectList.isEmpty => child
--- End diff --

Actually, I just try to run this tests without this patch. It passes. 
@gatorsmile Can you verify it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...

2016-03-08 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/11599#discussion_r55481341
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] {
 p
   }
 
+// Eliminate the Projects with empty projectList
+case p @ Project(projectList, child) if projectList.isEmpty => child
--- End diff --

How about this?
```SQL
case p @ Project(_, l: LeafNode) if ! l.isInstanceOf[OneRowRelation]  => p 
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...

2016-03-08 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/11599#discussion_r55481255
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] {
 p
   }
 
+// Eliminate the Projects with empty projectList
+case p @ Project(projectList, child) if projectList.isEmpty => child
--- End diff --

```SQL
case p @ Project(_, l: LeafNode) => p 
```

There is another case above it. Thus, it will stop here.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...

2016-03-08 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/11599#discussion_r55480958
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] {
 p
   }
 
+// Eliminate the Projects with empty projectList
+case p @ Project(projectList, child) if projectList.isEmpty => child
--- End diff --

But a `Project` with empty projectList also has no output right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13706] [ML] Add Python Example for Trai...

2016-03-08 Thread MLnick

Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/11547#issuecomment-194153088
  
Made a few minor comments, pending those LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...

2016-03-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11596#issuecomment-194153085
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52725/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...

2016-03-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11596#issuecomment-194153084
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13706] [ML] Add Python Example for Trai...

2016-03-08 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/11547#discussion_r55480608
  
--- Diff: examples/src/main/python/ml/train_validation_split.py ---
@@ -0,0 +1,69 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark import SparkContext
+# $example on$
+from pyspark.ml import Pipeline
+from pyspark.ml.evaluation import RegressionEvaluator
+from pyspark.ml.regression import LinearRegression
+from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
+from pyspark.sql import SQLContext
+# $example off$
+
+"""
+This example demonstrats applying TrainValidationSplit to split data
+and preform model selection, as well as applying Pipelines.
+Run with:
+
+  bin/spark-submit examples/src/main/python/ml/train_validation_split.py
+"""
+
+if __name__ == "__main__":
+sc = SparkContext(appName="TrainValidationSplit")
+sqlContext = SQLContext(sc)
+# $example on$
+# Prepare training and test data.
+data = sqlContext.read.format("libsvm")\
+.load("data/mllib/sample_linear_regression_data.txt")
+train, test = data.randomSplit([0.7, 0.3])
+lr = LinearRegression(maxIter=10, regParam=0.1)
+
+# We use a ParamGridBuilder to construct a grid of parameters to 
search over.
+# TrainValidationSplit will try all combinations of values and 
determine best model using
+# the evaluator.
+paramGrid = ParamGridBuilder()\
+.addGrid(lr.regParam, [0.1, 0.01]) \
+.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
+.build()
+
+# In this case the estimator is simply the linear regression.
+# A TrainValidationSplit requires an Estimator, a set of Estimator 
ParamMaps, and an Evaluator.
+tvs = TrainValidationSplit(estimator=lr,
+   estimatorParamMaps=paramGrid,
+   evaluator=RegressionEvaluator(),
+   # 80% of the data will be used for 
training, 20% for validation.
+   trainRatio=0.8)
+
+# Run TrainValidationSplit, chosing the set of parameters that 
optimizes the evaluator.
--- End diff --

Perhaps we could change this comment to something like `Running 
TrainValidationSplit returns the model with the combination of parameters that 
performed best on the validation set.`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13706] [ML] Add Python Example for Trai...

2016-03-08 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/11547#discussion_r55480665
  
--- Diff: examples/src/main/python/ml/train_validation_split.py ---
@@ -0,0 +1,69 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark import SparkContext
+# $example on$
+from pyspark.ml import Pipeline
+from pyspark.ml.evaluation import RegressionEvaluator
+from pyspark.ml.regression import LinearRegression
+from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
+from pyspark.sql import SQLContext
+# $example off$
+
+"""
+This example demonstrats applying TrainValidationSplit to split data
+and preform model selection, as well as applying Pipelines.
+Run with:
+
+  bin/spark-submit examples/src/main/python/ml/train_validation_split.py
+"""
+
+if __name__ == "__main__":
+sc = SparkContext(appName="TrainValidationSplit")
+sqlContext = SQLContext(sc)
+# $example on$
+# Prepare training and test data.
+data = sqlContext.read.format("libsvm")\
+.load("data/mllib/sample_linear_regression_data.txt")
+train, test = data.randomSplit([0.7, 0.3])
+lr = LinearRegression(maxIter=10, regParam=0.1)
+
+# We use a ParamGridBuilder to construct a grid of parameters to 
search over.
+# TrainValidationSplit will try all combinations of values and 
determine best model using
+# the evaluator.
+paramGrid = ParamGridBuilder()\
+.addGrid(lr.regParam, [0.1, 0.01]) \
+.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
+.build()
+
+# In this case the estimator is simply the linear regression.
+# A TrainValidationSplit requires an Estimator, a set of Estimator 
ParamMaps, and an Evaluator.
+tvs = TrainValidationSplit(estimator=lr,
+   estimatorParamMaps=paramGrid,
+   evaluator=RegressionEvaluator(),
+   # 80% of the data will be used for 
training, 20% for validation.
+   trainRatio=0.8)
+
+# Run TrainValidationSplit, chosing the set of parameters that 
optimizes the evaluator.
+model = tvs.fit(train)
+# Make predictions on test data. model is the model with combination 
of parameters
--- End diff --

... and here we can then simply have `Make predictions on test data using 
the model`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...

2016-03-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11596#issuecomment-194152932
  
**[Test build #52725 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52725/consoleFull)**
 for PR 11596 at commit 
[`d7f0207`](https://github.com/apache/spark/commit/d7f020701b3b063beb00293af3ff4149211926ab).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...

2016-03-08 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/11599#discussion_r55480611
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] {
 p
   }
 
+// Eliminate the Projects with empty projectList
+case p @ Project(projectList, child) if projectList.isEmpty => child
--- End diff --

Because `OneRowRelation` has no output. So its output is different to its 
parent `Project`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...

2016-03-08 Thread cloud-fan

Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/11596#issuecomment-194152170
  
I'm not sure which one is better, it depends on which one is easier to 
understand and implement. I'm ok to generate verbose SQL string, but the 
generator itself should be as simple and robust as possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...

2016-03-08 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/11599#discussion_r55480282
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] {
 p
   }
 
+// Eliminate the Projects with empty projectList
+case p @ Project(projectList, child) if projectList.isEmpty => child
--- End diff --

I'm thinking of the correctness of this rule. Actually this is not column 
pruning, but add more columns, as `child` may have more one columns.

And why this rule `case p @ Project(projectList, child) if 
sameOutput(child.output, p.output) => child` can't work?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13706] [ML] Add Python Example for Trai...

2016-03-08 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/11547#discussion_r55479803
  
--- Diff: examples/src/main/python/ml/train_validation_split.py ---
@@ -0,0 +1,69 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark import SparkContext
+# $example on$
+from pyspark.ml import Pipeline
+from pyspark.ml.evaluation import RegressionEvaluator
+from pyspark.ml.regression import LinearRegression
+from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
+from pyspark.sql import SQLContext
+# $example off$
+
+"""
+This example demonstrats applying TrainValidationSplit to split data
--- End diff --

Typo: `demonstrats`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13636][SQL] Directly consume UnsafeRow ...

2016-03-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11484#issuecomment-194146675
  
**[Test build #52733 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52733/consoleFull)**
 for PR 11484 at commit 
[`dea644a`](https://github.com/apache/spark/commit/dea644a74251c62f1ccb0fd095083d434acd1a8c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13636][SQL] Directly consume UnsafeRow ...

2016-03-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11484#issuecomment-194145448
  
**[Test build #52732 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52732/consoleFull)**
 for PR 11484 at commit 
[`6400eb2`](https://github.com/apache/spark/commit/6400eb22aefa986fb5d96d3d6d0242778e2f332a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...

2016-03-08 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/11599#issuecomment-194143635
  
@cloud-fan Added another two cases. Feel free to let me know if you want me 
to add more cases. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13759][SQL] Add IsNotNull constraints f...

2016-03-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11594#issuecomment-194142858
  
**[Test build #52731 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52731/consoleFull)**
 for PR 11594 at commit 
[`84bac4d`](https://github.com/apache/spark/commit/84bac4d7309121c0fd699ea2b06aa05b775a4d81).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13759][SQL] Add IsNotNull constraints f...

2016-03-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11594#issuecomment-194141605
  
**[Test build #52730 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52730/consoleFull)**
 for PR 11594 at commit 
[`54c3f49`](https://github.com/apache/spark/commit/54c3f49f6f558b77d8bd805269ed1ca8b8be3fad).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...

2016-03-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11599#issuecomment-194141048
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52724/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...

2016-03-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11599#issuecomment-194141046
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...

2016-03-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11599#issuecomment-194140806
  
**[Test build #52724 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52724/consoleFull)**
 for PR 11599 at commit 
[`a31b1b5`](https://github.com/apache/spark/commit/a31b1b588949f2f92981f7d1a7d04d6e1806ccd1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12792][SPARKR] Refactor RRDD to support...

2016-03-08 Thread sun-rui

Github user sun-rui commented on a diff in the pull request:

https://github.com/apache/spark/pull/10947#discussion_r55477889
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/RRunner.scala ---
@@ -0,0 +1,367 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.r
+
+import java.io._
+import java.net.{InetAddress, ServerSocket}
+import java.util.Arrays
+
+import scala.io.Source
+import scala.util.Try
+
+import org.apache.spark._
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.util.Utils
+
+/**
+ * A helper class to run R UDFs in Spark.
+ */
+private[spark] class RRunner[U](
+func: Array[Byte],
+deserializer: String,
+serializer: String,
+packageNames: Array[Byte],
+broadcastVars: Array[Broadcast[Object]],
+numPartitions: Int = -1)
+  extends Logging {
+  private var bootTime: Double = _
+  private var dataStream: DataInputStream = _
+  val readData = numPartitions match {
+case -1 =>
+  serializer match {
+case SerializationFormats.STRING => readStringData _
+case _ => readByteArrayData _
+  }
+case _ => readShuffledData _
+  }
+
+  def compute(
+  inputIterator: Iterator[_],
+  partitionIndex: Int,
+  context: TaskContext): Iterator[U] = {
+// Timing start
+bootTime = System.currentTimeMillis / 1000.0
+
+// we expect two connections
+val serverSocket = new ServerSocket(0, 2, 
InetAddress.getByName("localhost"))
+val listenPort = serverSocket.getLocalPort()
+
+// The stdout/stderr is shared by multiple tasks, because we use one 
daemon
+// to launch child process as worker.
+val errThread = RRunner.createRWorker(listenPort)
+
+// We use two sockets to separate input and output, then it's easy to 
manage
+// the lifecycle of them to avoid deadlock.
+// TODO: optimize it to use one socket
+
+// the socket used to send out the input of task
+serverSocket.setSoTimeout(1)
+val inSocket = serverSocket.accept()
+startStdinThread(inSocket.getOutputStream(), inputIterator, 
partitionIndex)
+
+// the socket used to receive the output of task
+val outSocket = serverSocket.accept()
+val inputStream = new BufferedInputStream(outSocket.getInputStream)
+dataStream = new DataInputStream(inputStream)
+serverSocket.close()
+
+try {
+  return new Iterator[U] {
+def next(): U = {
+  val obj = _nextObj
+  if (hasNext) {
+_nextObj = read()
+  }
+  obj
+}
+
+var _nextObj = read()
+
+def hasNext(): Boolean = {
+  val hasMore = (_nextObj != null)
+  if (!hasMore) {
+dataStream.close()
+  }
+  hasMore
+}
+  }
+} catch {
+  case e: Exception =>
+throw new SparkException("R computation failed with\n " + 
errThread.getLines())
+}
+  }
+
+  /**
+   * Start a thread to write RDD data to the R process.
+   */
+  private def startStdinThread(
+  output: OutputStream,
+  iter: Iterator[_],
+  partitionIndex: Int): Unit = {
+val env = SparkEnv.get
+val taskContext = TaskContext.get()
+val bufferSize = System.getProperty("spark.buffer.size", "65536").toInt
+val stream = new BufferedOutputStream(output, bufferSize)
+
+new Thread("writer for R") {
+  override def run(): Unit = {
+try {
+  SparkEnv.set(env)
+  TaskContext.setTaskContext(taskContext)
+  val dataOut = new DataOutputStream(stream)
+  dataOut.writeInt(partitionIndex)
+
+  SerDe.writeString(dataOut, deserializer)
+  SerDe.writeString(dataOut, serializer)

[GitHub] spark pull request: [SPARK-12792][SPARKR] Refactor RRDD to support...

2016-03-08 Thread sun-rui

Github user sun-rui commented on a diff in the pull request:

https://github.com/apache/spark/pull/10947#discussion_r55477830
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/RRunner.scala ---
@@ -0,0 +1,367 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.r
+
+import java.io._
+import java.net.{InetAddress, ServerSocket}
+import java.util.Arrays
+
+import scala.io.Source
+import scala.util.Try
+
+import org.apache.spark._
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.util.Utils
+
+/**
+ * A helper class to run R UDFs in Spark.
+ */
+private[spark] class RRunner[U](
+func: Array[Byte],
+deserializer: String,
+serializer: String,
+packageNames: Array[Byte],
+broadcastVars: Array[Broadcast[Object]],
+numPartitions: Int = -1)
+  extends Logging {
+  private var bootTime: Double = _
+  private var dataStream: DataInputStream = _
+  val readData = numPartitions match {
--- End diff --

line 44 ~line 51 are new code.

readData is a val holding a function for reading data returned from the R 
worker.
It is one of the following 3 functions:
readShuffledData(), readByteArrayData(), readStringData()

According the constructor parameters, we can decide which function is used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12792][SPARKR] Refactor RRDD to support...

2016-03-08 Thread sun-rui

Github user sun-rui commented on a diff in the pull request:

https://github.com/apache/spark/pull/10947#discussion_r55477700
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/RRunner.scala ---
@@ -0,0 +1,367 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.r
+
+import java.io._
+import java.net.{InetAddress, ServerSocket}
+import java.util.Arrays
+
+import scala.io.Source
+import scala.util.Try
+
+import org.apache.spark._
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.util.Utils
+
+/**
+ * A helper class to run R UDFs in Spark.
+ */
+private[spark] class RRunner[U](
+func: Array[Byte],
+deserializer: String,
+serializer: String,
+packageNames: Array[Byte],
+broadcastVars: Array[Broadcast[Object]],
+numPartitions: Int = -1)
+  extends Logging {
+  private var bootTime: Double = _
+  private var dataStream: DataInputStream = _
+  val readData = numPartitions match {
+case -1 =>
+  serializer match {
+case SerializationFormats.STRING => readStringData _
+case _ => readByteArrayData _
+  }
+case _ => readShuffledData _
+  }
+
+  def compute(
+  inputIterator: Iterator[_],
+  partitionIndex: Int,
+  context: TaskContext): Iterator[U] = {
+// Timing start
+bootTime = System.currentTimeMillis / 1000.0
+
+// we expect two connections
+val serverSocket = new ServerSocket(0, 2, 
InetAddress.getByName("localhost"))
+val listenPort = serverSocket.getLocalPort()
+
+// The stdout/stderr is shared by multiple tasks, because we use one 
daemon
+// to launch child process as worker.
+val errThread = RRunner.createRWorker(listenPort)
+
+// We use two sockets to separate input and output, then it's easy to 
manage
+// the lifecycle of them to avoid deadlock.
+// TODO: optimize it to use one socket
+
+// the socket used to send out the input of task
+serverSocket.setSoTimeout(1)
+val inSocket = serverSocket.accept()
+startStdinThread(inSocket.getOutputStream(), inputIterator, 
partitionIndex)
+
+// the socket used to receive the output of task
+val outSocket = serverSocket.accept()
+val inputStream = new BufferedInputStream(outSocket.getInputStream)
+dataStream = new DataInputStream(inputStream)
+serverSocket.close()
+
+try {
+  return new Iterator[U] {
+def next(): U = {
+  val obj = _nextObj
+  if (hasNext) {
+_nextObj = read()
+  }
+  obj
+}
+
+var _nextObj = read()
+
+def hasNext(): Boolean = {
+  val hasMore = (_nextObj != null)
+  if (!hasMore) {
+dataStream.close()
+  }
+  hasMore
+}
+  }
+} catch {
+  case e: Exception =>
+throw new SparkException("R computation failed with\n " + 
errThread.getLines())
+}
+  }
+
+  /**
+   * Start a thread to write RDD data to the R process.
+   */
+  private def startStdinThread(
+  output: OutputStream,
+  iter: Iterator[_],
+  partitionIndex: Int): Unit = {
+val env = SparkEnv.get
+val taskContext = TaskContext.get()
+val bufferSize = System.getProperty("spark.buffer.size", "65536").toInt
+val stream = new BufferedOutputStream(output, bufferSize)
+
+new Thread("writer for R") {
+  override def run(): Unit = {
+try {
+  SparkEnv.set(env)
+  TaskContext.setTaskContext(taskContext)
+  val dataOut = new DataOutputStream(stream)
+  dataOut.writeInt(partitionIndex)
+
+  SerDe.writeString(dataOut, deserializer)
+  SerDe.writeString(dataOut, serializer)

[GitHub] spark pull request: [SPARK-12893][YARN] Fix history URL redirect e...

2016-03-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10821#issuecomment-194140043
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13759][SQL] Add IsNotNull constraints f...

2016-03-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11594#issuecomment-194140033
  
**[Test build #52729 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52729/consoleFull)**
 for PR 11594 at commit 
[`635eae2`](https://github.com/apache/spark/commit/635eae217f293af248a68cafca7ba6187c2a1886).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12893][YARN] Fix history URL redirect e...

2016-03-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10821#issuecomment-194140046
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52722/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12792][SPARKR] Refactor RRDD to support...

2016-03-08 Thread sun-rui

Github user sun-rui commented on a diff in the pull request:

https://github.com/apache/spark/pull/10947#discussion_r55477584
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/RRunner.scala ---
@@ -0,0 +1,367 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.r
+
+import java.io._
+import java.net.{InetAddress, ServerSocket}
+import java.util.Arrays
+
+import scala.io.Source
+import scala.util.Try
+
+import org.apache.spark._
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.util.Utils
+
+/**
+ * A helper class to run R UDFs in Spark.
+ */
+private[spark] class RRunner[U](
+func: Array[Byte],
+deserializer: String,
+serializer: String,
+packageNames: Array[Byte],
+broadcastVars: Array[Broadcast[Object]],
+numPartitions: Int = -1)
+  extends Logging {
+  private var bootTime: Double = _
+  private var dataStream: DataInputStream = _
+  val readData = numPartitions match {
+case -1 =>
+  serializer match {
+case SerializationFormats.STRING => readStringData _
+case _ => readByteArrayData _
+  }
+case _ => readShuffledData _
+  }
+
+  def compute(
+  inputIterator: Iterator[_],
+  partitionIndex: Int,
+  context: TaskContext): Iterator[U] = {
+// Timing start
+bootTime = System.currentTimeMillis / 1000.0
+
+// we expect two connections
+val serverSocket = new ServerSocket(0, 2, 
InetAddress.getByName("localhost"))
+val listenPort = serverSocket.getLocalPort()
+
+// The stdout/stderr is shared by multiple tasks, because we use one 
daemon
+// to launch child process as worker.
+val errThread = RRunner.createRWorker(listenPort)
+
+// We use two sockets to separate input and output, then it's easy to 
manage
+// the lifecycle of them to avoid deadlock.
+// TODO: optimize it to use one socket
+
+// the socket used to send out the input of task
+serverSocket.setSoTimeout(1)
+val inSocket = serverSocket.accept()
+startStdinThread(inSocket.getOutputStream(), inputIterator, 
partitionIndex)
+
+// the socket used to receive the output of task
+val outSocket = serverSocket.accept()
+val inputStream = new BufferedInputStream(outSocket.getInputStream)
+dataStream = new DataInputStream(inputStream)
+serverSocket.close()
+
+try {
+  return new Iterator[U] {
+def next(): U = {
+  val obj = _nextObj
+  if (hasNext) {
+_nextObj = read()
+  }
+  obj
+}
+
+var _nextObj = read()
+
+def hasNext(): Boolean = {
+  val hasMore = (_nextObj != null)
+  if (!hasMore) {
+dataStream.close()
+  }
+  hasMore
+}
+  }
+} catch {
+  case e: Exception =>
+throw new SparkException("R computation failed with\n " + 
errThread.getLines())
+}
+  }
+
+  /**
+   * Start a thread to write RDD data to the R process.
+   */
+  private def startStdinThread(
+  output: OutputStream,
+  iter: Iterator[_],
+  partitionIndex: Int): Unit = {
+val env = SparkEnv.get
+val taskContext = TaskContext.get()
+val bufferSize = System.getProperty("spark.buffer.size", "65536").toInt
+val stream = new BufferedOutputStream(output, bufferSize)
+
+new Thread("writer for R") {
+  override def run(): Unit = {
+try {
+  SparkEnv.set(env)
+  TaskContext.setTaskContext(taskContext)
+  val dataOut = new DataOutputStream(stream)
+  dataOut.writeInt(partitionIndex)
+
+  SerDe.writeString(dataOut, deserializer)
+  SerDe.writeString(dataOut, serializer)

[GitHub] spark pull request: [SPARK-12893][YARN] Fix history URL redirect e...

2016-03-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10821#issuecomment-194139901
  
**[Test build #52722 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52722/consoleFull)**
 for PR 10821 at commit 
[`c98c3b5`](https://github.com/apache/spark/commit/c98c3b5b28c834c9652f178b1669e2af977e6123).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...

2016-03-08 Thread dilipbiswal

Github user dilipbiswal commented on the pull request:

https://github.com/apache/spark/pull/11596#issuecomment-194139575
  
@cloud-fan Yeah.. that should be possible, You are thinking to represent 
each Generate as a sub select clause ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12718][SPARK-13720][SQL] SQL generation...

2016-03-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11555#issuecomment-194139422
  
**[Test build #52728 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52728/consoleFull)**
 for PR 11555 at commit 
[`dab7a2f`](https://github.com/apache/spark/commit/dab7a2f1a5cc0438405b0fa1cf532ab883bed7e7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13320] [SQL] Support Star in CreateStru...

2016-03-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11208#issuecomment-194138373
  
**[Test build #52727 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52727/consoleFull)**
 for PR 11208 at commit 
[`e060dea`](https://github.com/apache/spark/commit/e060deaaf09d122966f090bf3b86895636418664).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13320] [SQL] Support Star in CreateStru...

2016-03-08 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/11208#issuecomment-194136402
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12792][SPARKR] Refactor RRDD to support...

2016-03-08 Thread sun-rui

Github user sun-rui commented on a diff in the pull request:

https://github.com/apache/spark/pull/10947#discussion_r55476565
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/RRunner.scala ---
@@ -0,0 +1,367 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.r
+
+import java.io._
+import java.net.{InetAddress, ServerSocket}
+import java.util.Arrays
+
+import scala.io.Source
+import scala.util.Try
+
+import org.apache.spark._
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.util.Utils
+
+/**
+ * A helper class to run R UDFs in Spark.
+ */
+private[spark] class RRunner[U](
+func: Array[Byte],
+deserializer: String,
+serializer: String,
+packageNames: Array[Byte],
+broadcastVars: Array[Broadcast[Object]],
+numPartitions: Int = -1)
+  extends Logging {
+  private var bootTime: Double = _
+  private var dataStream: DataInputStream = _
+  val readData = numPartitions match {
+case -1 =>
+  serializer match {
+case SerializationFormats.STRING => readStringData _
+case _ => readByteArrayData _
+  }
+case _ => readShuffledData _
+  }
+
+  def compute(
+  inputIterator: Iterator[_],
+  partitionIndex: Int,
+  context: TaskContext): Iterator[U] = {
+// Timing start
+bootTime = System.currentTimeMillis / 1000.0
+
+// we expect two connections
+val serverSocket = new ServerSocket(0, 2, 
InetAddress.getByName("localhost"))
+val listenPort = serverSocket.getLocalPort()
+
+// The stdout/stderr is shared by multiple tasks, because we use one 
daemon
+// to launch child process as worker.
+val errThread = RRunner.createRWorker(listenPort)
+
+// We use two sockets to separate input and output, then it's easy to 
manage
+// the lifecycle of them to avoid deadlock.
+// TODO: optimize it to use one socket
+
+// the socket used to send out the input of task
+serverSocket.setSoTimeout(1)
+val inSocket = serverSocket.accept()
+startStdinThread(inSocket.getOutputStream(), inputIterator, 
partitionIndex)
+
+// the socket used to receive the output of task
+val outSocket = serverSocket.accept()
+val inputStream = new BufferedInputStream(outSocket.getInputStream)
+dataStream = new DataInputStream(inputStream)
+serverSocket.close()
+
+try {
+  return new Iterator[U] {
+def next(): U = {
+  val obj = _nextObj
+  if (hasNext) {
+_nextObj = read()
+  }
+  obj
+}
+
+var _nextObj = read()
+
+def hasNext(): Boolean = {
+  val hasMore = (_nextObj != null)
+  if (!hasMore) {
+dataStream.close()
+  }
+  hasMore
+}
+  }
+} catch {
+  case e: Exception =>
+throw new SparkException("R computation failed with\n " + 
errThread.getLines())
+}
+  }
+
+  /**
+   * Start a thread to write RDD data to the R process.
+   */
+  private def startStdinThread(
+  output: OutputStream,
+  iter: Iterator[_],
+  partitionIndex: Int): Unit = {
+val env = SparkEnv.get
+val taskContext = TaskContext.get()
+val bufferSize = System.getProperty("spark.buffer.size", "65536").toInt
+val stream = new BufferedOutputStream(output, bufferSize)
+
+new Thread("writer for R") {
+  override def run(): Unit = {
+try {
+  SparkEnv.set(env)
+  TaskContext.setTaskContext(taskContext)
+  val dataOut = new DataOutputStream(stream)
+  dataOut.writeInt(partitionIndex)
+
+  SerDe.writeString(dataOut, deserializer)
+  SerDe.writeString(dataOut, serializer)

[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...

2016-03-08 Thread cloud-fan

Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/11596#issuecomment-194133518
  
But a `Generate` operator only has one `generator`, so I think we can 
convert any kind of `Generate` to SELECT?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12792][SPARKR] Refactor RRDD to support...

2016-03-08 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/10947#issuecomment-194133921
  
cc @davies


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13696] Remove BlockStore class & simpli...

2016-03-08 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/11534#discussion_r55476210
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -789,117 +752,193 @@ private[spark] class BlockManager(
   // lockNewBlockForWriting returned a read lock on the existing 
block, so we must free it:
   releaseLock(blockId)
 }
-return DoPutSucceeded
+return true
   }
 }
 
 val startTimeMs = System.currentTimeMillis
 
-// Size of the block in bytes
-var size = 0L
-
-// The level we actually use to put the block
-val putLevel = effectiveStorageLevel.getOrElse(level)
-
-// If we're storing bytes, then initiate the replication before 
storing them locally.
+// Since we're storing bytes, initiate the replication before storing 
them locally.
 // This is faster as data is already serialized and ready to send.
-val replicationFuture = data match {
-  case b: ByteBufferValues if putLevel.replication > 1 =>
-// Duplicate doesn't copy the bytes, but just creates a wrapper
-val bufferView = b.buffer.duplicate()
-Future {
-  // This is a blocking action and should run in 
futureExecutionContext which is a cached
-  // thread pool
-  replicate(blockId, bufferView, putLevel)
-}(futureExecutionContext)
-  case _ => null
+val replicationFuture = if (level.replication > 1) {
+  // Duplicate doesn't copy the bytes, but just creates a wrapper
+  val bufferView = bytes.duplicate()
+  Future {
+// This is a blocking action and should run in 
futureExecutionContext which is a cached
+// thread pool
+replicate(blockId, bufferView, level)
+  }(futureExecutionContext)
+} else {
+  null
 }
 
 var blockWasSuccessfullyStored = false
-var iteratorFromFailedMemoryStorePut: Option[Iterator[Any]] = None
-
-putBlockInfo.synchronized {
-  logTrace("Put for block %s took %s to get into synchronized block"
-.format(blockId, Utils.getUsedTimeMs(startTimeMs)))
 
-  try {
-if (putLevel.useMemory) {
-  // Put it in memory first, even if it also has useDisk set to 
true;
-  // We will drop it to disk later if the memory store can't hold 
it.
-  data match {
-case IteratorValues(iterator) =>
-  memoryStore.putIterator(blockId, iterator(), putLevel) match 
{
-case Right(s) =>
-  size = s
-case Left(iter) =>
-  iteratorFromFailedMemoryStorePut = Some(iter)
-  }
-case ByteBufferValues(bytes) =>
-  bytes.rewind()
-  size = bytes.limit()
-  memoryStore.putBytes(blockId, bytes, putLevel)
-  }
-} else if (putLevel.useDisk) {
-  data match {
-case IteratorValues(iterator) =>
-  diskStore.putIterator(blockId, iterator(), putLevel) match {
-case Right(s) =>
-  size = s
-// putIterator() will never return Left (see its return 
type).
-  }
-case ByteBufferValues(bytes) =>
-  bytes.rewind()
-  size = bytes.limit()
-  diskStore.putBytes(blockId, bytes, putLevel)
-  }
+bytes.rewind()
+val size = bytes.limit()
+
+try {
+  if (level.useMemory) {
+// Put it in memory first, even if it also has useDisk set to true;
+// We will drop it to disk later if the memory store can't hold it.
+val putSucceeded = if (level.deserialized) {
+  val values = dataDeserialize(blockId, bytes.duplicate())
+  memoryStore.putIterator(blockId, values, level).isRight
 } else {
-  assert(putLevel == StorageLevel.NONE)
-  throw new BlockException(
-blockId, s"Attempted to put block $blockId without specifying 
storage level!")
+  memoryStore.putBytes(blockId, size, () => bytes)
+}
+if (!putSucceeded && level.useDisk) {
+  logWarning(s"Persisting block $blockId to disk instead.")
+  diskStore.putBytes(blockId, bytes)
 }
+  } else if (level.useDisk) {
+diskStore.putBytes(blockId, bytes)
+  }
 
-val putBlockStatus = getCurrentBlockStatus(blockId, putBlockInfo)
-blockWasSuccessfullyStored = putBlockStatus.storageLevel.isValid
-if (blockWasSuccessfullyStored) {
-

[GitHub] spark pull request: [SPARK-13696] Remove BlockStore class & simpli...

2016-03-08 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/11534#discussion_r55476161
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -789,117 +752,193 @@ private[spark] class BlockManager(
   // lockNewBlockForWriting returned a read lock on the existing 
block, so we must free it:
   releaseLock(blockId)
 }
-return DoPutSucceeded
+return true
   }
 }
 
 val startTimeMs = System.currentTimeMillis
 
-// Size of the block in bytes
-var size = 0L
-
-// The level we actually use to put the block
-val putLevel = effectiveStorageLevel.getOrElse(level)
-
-// If we're storing bytes, then initiate the replication before 
storing them locally.
+// Since we're storing bytes, initiate the replication before storing 
them locally.
 // This is faster as data is already serialized and ready to send.
-val replicationFuture = data match {
-  case b: ByteBufferValues if putLevel.replication > 1 =>
-// Duplicate doesn't copy the bytes, but just creates a wrapper
-val bufferView = b.buffer.duplicate()
-Future {
-  // This is a blocking action and should run in 
futureExecutionContext which is a cached
-  // thread pool
-  replicate(blockId, bufferView, putLevel)
-}(futureExecutionContext)
-  case _ => null
+val replicationFuture = if (level.replication > 1) {
+  // Duplicate doesn't copy the bytes, but just creates a wrapper
+  val bufferView = bytes.duplicate()
+  Future {
+// This is a blocking action and should run in 
futureExecutionContext which is a cached
+// thread pool
+replicate(blockId, bufferView, level)
+  }(futureExecutionContext)
+} else {
+  null
 }
 
 var blockWasSuccessfullyStored = false
-var iteratorFromFailedMemoryStorePut: Option[Iterator[Any]] = None
-
-putBlockInfo.synchronized {
-  logTrace("Put for block %s took %s to get into synchronized block"
-.format(blockId, Utils.getUsedTimeMs(startTimeMs)))
 
-  try {
-if (putLevel.useMemory) {
-  // Put it in memory first, even if it also has useDisk set to 
true;
-  // We will drop it to disk later if the memory store can't hold 
it.
-  data match {
-case IteratorValues(iterator) =>
-  memoryStore.putIterator(blockId, iterator(), putLevel) match 
{
-case Right(s) =>
-  size = s
-case Left(iter) =>
-  iteratorFromFailedMemoryStorePut = Some(iter)
-  }
-case ByteBufferValues(bytes) =>
-  bytes.rewind()
-  size = bytes.limit()
-  memoryStore.putBytes(blockId, bytes, putLevel)
-  }
-} else if (putLevel.useDisk) {
-  data match {
-case IteratorValues(iterator) =>
-  diskStore.putIterator(blockId, iterator(), putLevel) match {
-case Right(s) =>
-  size = s
-// putIterator() will never return Left (see its return 
type).
-  }
-case ByteBufferValues(bytes) =>
-  bytes.rewind()
-  size = bytes.limit()
-  diskStore.putBytes(blockId, bytes, putLevel)
-  }
+bytes.rewind()
+val size = bytes.limit()
+
+try {
+  if (level.useMemory) {
+// Put it in memory first, even if it also has useDisk set to true;
+// We will drop it to disk later if the memory store can't hold it.
+val putSucceeded = if (level.deserialized) {
+  val values = dataDeserialize(blockId, bytes.duplicate())
+  memoryStore.putIterator(blockId, values, level).isRight
 } else {
-  assert(putLevel == StorageLevel.NONE)
-  throw new BlockException(
-blockId, s"Attempted to put block $blockId without specifying 
storage level!")
+  memoryStore.putBytes(blockId, size, () => bytes)
+}
+if (!putSucceeded && level.useDisk) {
+  logWarning(s"Persisting block $blockId to disk instead.")
+  diskStore.putBytes(blockId, bytes)
 }
+  } else if (level.useDisk) {
+diskStore.putBytes(blockId, bytes)
+  }
 
-val putBlockStatus = getCurrentBlockStatus(blockId, putBlockInfo)
-blockWasSuccessfullyStored = putBlockStatus.storageLevel.isValid
-if (blockWasSuccessfullyStored) {
-

[GitHub] spark pull request: [SPARK-12792][SPARKR] Refactor RRDD to support...

2016-03-08 Thread sun-rui

Github user sun-rui commented on a diff in the pull request:

https://github.com/apache/spark/pull/10947#discussion_r55476044
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/RRunner.scala ---
@@ -0,0 +1,367 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.r
+
+import java.io._
+import java.net.{InetAddress, ServerSocket}
+import java.util.Arrays
+
+import scala.io.Source
+import scala.util.Try
+
+import org.apache.spark._
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.util.Utils
+
+/**
+ * A helper class to run R UDFs in Spark.
+ */
+private[spark] class RRunner[U](
+func: Array[Byte],
+deserializer: String,
+serializer: String,
+packageNames: Array[Byte],
+broadcastVars: Array[Broadcast[Object]],
+numPartitions: Int = -1)
+  extends Logging {
+  private var bootTime: Double = _
+  private var dataStream: DataInputStream = _
+  val readData = numPartitions match {
+case -1 =>
+  serializer match {
+case SerializationFormats.STRING => readStringData _
+case _ => readByteArrayData _
+  }
+case _ => readShuffledData _
+  }
+
+  def compute(
+  inputIterator: Iterator[_],
+  partitionIndex: Int,
+  context: TaskContext): Iterator[U] = {
+// Timing start
+bootTime = System.currentTimeMillis / 1000.0
+
+// we expect two connections
+val serverSocket = new ServerSocket(0, 2, 
InetAddress.getByName("localhost"))
+val listenPort = serverSocket.getLocalPort()
+
+// The stdout/stderr is shared by multiple tasks, because we use one 
daemon
+// to launch child process as worker.
+val errThread = RRunner.createRWorker(listenPort)
+
+// We use two sockets to separate input and output, then it's easy to 
manage
+// the lifecycle of them to avoid deadlock.
+// TODO: optimize it to use one socket
+
+// the socket used to send out the input of task
+serverSocket.setSoTimeout(1)
+val inSocket = serverSocket.accept()
+startStdinThread(inSocket.getOutputStream(), inputIterator, 
partitionIndex)
+
+// the socket used to receive the output of task
+val outSocket = serverSocket.accept()
+val inputStream = new BufferedInputStream(outSocket.getInputStream)
+dataStream = new DataInputStream(inputStream)
+serverSocket.close()
+
+try {
+  return new Iterator[U] {
+def next(): U = {
+  val obj = _nextObj
+  if (hasNext) {
+_nextObj = read()
+  }
+  obj
+}
+
+var _nextObj = read()
+
+def hasNext(): Boolean = {
+  val hasMore = (_nextObj != null)
+  if (!hasMore) {
+dataStream.close()
+  }
+  hasMore
+}
+  }
+} catch {
+  case e: Exception =>
+throw new SparkException("R computation failed with\n " + 
errThread.getLines())
+}
+  }
+
+  /**
+   * Start a thread to write RDD data to the R process.
+   */
+  private def startStdinThread(
+  output: OutputStream,
+  iter: Iterator[_],
+  partitionIndex: Int): Unit = {
+val env = SparkEnv.get
+val taskContext = TaskContext.get()
+val bufferSize = System.getProperty("spark.buffer.size", "65536").toInt
+val stream = new BufferedOutputStream(output, bufferSize)
+
+new Thread("writer for R") {
+  override def run(): Unit = {
+try {
+  SparkEnv.set(env)
+  TaskContext.setTaskContext(taskContext)
+  val dataOut = new DataOutputStream(stream)
+  dataOut.writeInt(partitionIndex)
+
+  SerDe.writeString(dataOut, deserializer)
+  SerDe.writeString(dataOut, serializer)

[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...

2016-03-08 Thread dilipbiswal

Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/11596#discussion_r55475930
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala 
---
@@ -445,4 +461,98 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest 
with SQLTestUtils {
   "f1", "b[0].f1", "f1", "c[foo]", "d[0]"
 )
   }
+
+  test("SQL generator for explode in projection list") {
+// Basic Explode
+checkHiveQl("SELECT explode(array(1,2,3)) FROM src")
+
+// Explode with Alias
+checkHiveQl("SELECT explode(array(1,2,3)) as value FROM src")
+
+// Explode without FROM
+checkHiveQl("select explode(array(1,2,3)) AS gencol")
+
+// non-generated columns in projection list
+checkHiveQl("SELECT key as c1, explode(array(1,2,3)) as c2, value as 
c3 FROM t1")
+  }
+
+  test("SQL generation for json_tuple as generator") {
+checkHiveQl("SELECT key, json_tuple(jstring, 'f1', 'f2', 'f3', 'f4', 
'f5') FROM parquet_t3")
+  }
+
+  test("SQL generation for lateral views") {
+// Filter and OUTER clause
+checkHiveQl(
+  """
+|SELECT key, value
+|FROM t1
+|LATERAL VIEW OUTER explode(value) gentab as gencol
+|WHERE key = 1
+  """.stripMargin
+)
+
+// single lateral view
+checkHiveQl(
+  """
+|SELECT *
+|FROM t1
+|LATERAL VIEW explode(array(1,2,3)) gentab AS gencol
+|SORT BY key ASC, gencol ASC LIMIT 1
+  """.stripMargin
+)
+
+// multiple lateral views
+checkHiveQl(
+  """
+|SELECT gentab2.*
+|FROM t1
+|LATERAL VIEW explode(array(array(1,2,3))) gentab1 AS gencol1
+|LATERAL VIEW explode(gentab1.gencol1) gentab2 AS gencol2 LIMIT 3
+  """.stripMargin
+)
+
+// No generated column aliases
+checkHiveQl(
+  """SELECT gentab.*
+|FROM t1
+|LATERAL VIEW explode(map('key1', 100, 'key2', 200)) gentab limit 2
+  """.stripMargin
+)
+  }
+
+  test("SQL generation for lateral views in subquery") {
+// Subquries in FROM clause using Generate
+checkHiveQl(
+  """
+|SELECT subq.gencol
+|FROM
+|(SELECT * from t1 LATERAL VIEW explode(value) gentab AS gencol) 
subq
+  """.stripMargin)
+
+checkHiveQl(
+  """
+|SELECT subq.key
+|FROM
+|(SELECT key, value from t1 LATERAL VIEW explode(value) gentab AS 
gencol) subq
+  """.stripMargin
+)
+  }
+
+  test("SQL generation for UDTF") {
--- End diff --

OK. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...

2016-03-08 Thread dilipbiswal

Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/11596#discussion_r55475917
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala ---
@@ -297,6 +303,60 @@ class SQLBuilder(logicalPlan: LogicalPlan, sqlContext: 
SQLContext) extends Loggi
 )
   }
 
+  /* This function handles the SQL generation when generators.
--- End diff --

Will do. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...

2016-03-08 Thread dilipbiswal

Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/11596#discussion_r55475944
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala 
---
@@ -445,4 +461,98 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest 
with SQLTestUtils {
   "f1", "b[0].f1", "f1", "c[foo]", "d[0]"
 )
   }
+
+  test("SQL generator for explode in projection list") {
+// Basic Explode
+checkHiveQl("SELECT explode(array(1,2,3)) FROM src")
+
+// Explode with Alias
+checkHiveQl("SELECT explode(array(1,2,3)) as value FROM src")
+
+// Explode without FROM
+checkHiveQl("select explode(array(1,2,3)) AS gencol")
+
+// non-generated columns in projection list
+checkHiveQl("SELECT key as c1, explode(array(1,2,3)) as c2, value as 
c3 FROM t1")
+  }
+
+  test("SQL generation for json_tuple as generator") {
+checkHiveQl("SELECT key, json_tuple(jstring, 'f1', 'f2', 'f3', 'f4', 
'f5') FROM parquet_t3")
+  }
+
+  test("SQL generation for lateral views") {
--- End diff --

OK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...

2016-03-08 Thread dilipbiswal

Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/11596#discussion_r55475915
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala ---
@@ -94,6 +94,12 @@ class SQLBuilder(logicalPlan: LogicalPlan, sqlContext: 
SQLContext) extends Loggi
 case Distinct(p: Project) =>
   projectToSQL(p, isDistinct = true)
 
+case p@ Project(_, g: Generate) =>
--- End diff --

Will do. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...

2016-03-08 Thread dilipbiswal

Github user dilipbiswal commented on the pull request:

https://github.com/apache/spark/pull/11596#issuecomment-194130778
  
@cloud-fan Thank you.
1. Can any Generate operator be converted to a LATERAL VIEW? Does the plan 
tree have to match some kind of pattern?

Yes, based on the tests i have tried. I am testing this more to see if i 
uncover any issue. Now i am converting
```SQL
select explode(array(1,2,3)) from t1
``` 
to
```SQL
select tablequalifier.colname from t1 LATERAL VIEW explode(array(1,2,3)) 
tablequalifier as colname
``` 
2. Can any Generate operator be converted to a "SELECT ...udtf..." format?

No. As we have a restriction of allowing one generate in a projection list. 
Thats why we try to
use the LATERAL VIEW clause.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13674][SQL] Add wholestage codegen supp...

2016-03-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11517#issuecomment-194128908
  
**[Test build #52726 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52726/consoleFull)**
 for PR 11517 at commit 
[`6144810`](https://github.com/apache/spark/commit/6144810614245912f659c94f5341dd351b08e89e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13674][SQL] Add wholestage codegen supp...

2016-03-08 Thread viirya

Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/11517#issuecomment-194128858
  
cc @davies @nongli 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13674][SQL] Add wholestage codegen supp...

2016-03-08 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/11517#discussion_r55475400
  
--- Diff: 
core/src/test/scala/org/apache/spark/rdd/PartitionwiseSampledRDDSuite.scala ---
@@ -29,6 +29,8 @@ class MockSampler extends RandomSampler[Long, Long] {
 s = seed
   }
 
+  override def sample(): Int = 1
--- End diff --

A change needed for new `RandomSampler` API, also in #11578.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13674][SQL] Add wholestage codegen supp...

2016-03-08 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/11517#discussion_r55475367
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/random/RandomSampler.scala ---
@@ -41,6 +41,12 @@ trait RandomSampler[T, U] extends Pseudorandom with 
Cloneable with Serializable
   /** take a random sample */
   def sample(items: Iterator[T]): Iterator[U]
 
+  /**
+   * Whether to sample the next item or not.
+   * Return how many times the next item will be sampled. Return 0 if it 
is not sampled.
+   */
+  def sample(): Int
+
--- End diff --

These changes on `RandomSampler` is submitted in #11578.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13436][SPARKR] Added parameter drop to ...

2016-03-08 Thread olarayej

Github user olarayej commented on the pull request:

https://github.com/apache/spark/pull/11318#issuecomment-194128322
  
@felixcheung @sun-rui @shivaram Can you folks please take a look at this 
one? Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...

2016-03-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11596#issuecomment-194126735
  
**[Test build #52725 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52725/consoleFull)**
 for PR 11596 at commit 
[`d7f0207`](https://github.com/apache/spark/commit/d7f020701b3b063beb00293af3ff4149211926ab).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...

2016-03-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11599#issuecomment-194126377
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...

2016-03-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11599#issuecomment-194126378
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52721/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...

2016-03-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11599#issuecomment-194126222
  
**[Test build #52721 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52721/consoleFull)**
 for PR 11599 at commit 
[`0fa21ac`](https://github.com/apache/spark/commit/0fa21acd2614fbdf128561b34c72088008d62a05).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12792][SPARKR] Refactor RRDD to support...

2016-03-08 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/10947#issuecomment-194125116
  
For bulk movement of code, can you comment in the pr yourself which part 
was actually changed, and which part was simply moving code from one place to 
another?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13449] Naive Bayes wrapper in SparkR

2016-03-08 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/11486#issuecomment-194125311
  
From SparkR test failure:
```
1. Error: naiveBayes 
---
there is no package called 'mlbench'
1: withCallingHandlers(eval(code, new_test_environment), error = 
capture_calls, message = function(c) invokeRestart("muffleMessage"))
2: eval(code, new_test_environment)
3: eval(expr, envir, enclos)
4: data(HouseVotes84, package = "mlbench") at test_mllib.R:146
5: find.package(package, lib.loc, verbose = verbose)
6: stop(gettextf("there is no package called %s", sQuote(pkg)), domain = NA)
Error: Test failures
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13449] Naive Bayes wrapper in SparkR

2016-03-08 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/11486#discussion_r55474419
  
--- Diff: R/pkg/R/mllib.R ---
@@ -192,3 +210,37 @@ setMethod("fitted", signature(object = 
"PipelineModel"),
   stop(paste("Unsupported model", modelName, sep = " "))
 }
   })
+
+#' Fit a naive Bayes model
+#'
+#' Fit a naive Bayes model, similarly to R's naiveBayes() except for 
omitting two arguments 'subset'
+#' and 'na.action'. Users can use 'subset' function and 'fillna' or 
'na.omit' function of DataFrame,
+#' respectviely, to preprocess their DataFrame. We use na.omit in this 
interface to avoid potential
+#' errors.
+#'
+#' @param object A symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#' @param data DataFrame for training
+#' @param lambda Smoothing parameter
+#' @param modelType Either 'multinomial' or 'bernoulli'. Default 
"multinomial".
+#' @param ... Undefined parameters
+#' @return A fitted naive Bayes model.
+#' @rdname naiveBayes
+#' @export
+#' @examples
+#'\dontrun{
+#' data(HouseVotes84, package = "mlbench")
+#' sc <- sparkR.init()
+#' sqlContext <- sparkRSQL.init(sc)
+#' df <- createDataFrame(sqlContext, HouseVotes84)
+#' model <- glm(Class ~ ., df, lambda = 1, modelType = "multinomial")
--- End diff --

shouldn't this be `navieBayes` instead of `glm`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13449] Naive Bayes wrapper in SparkR

2016-03-08 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/11486#discussion_r55474361
  
--- Diff: R/pkg/R/generics.R ---
@@ -1168,3 +1168,7 @@ setGeneric("kmeans")
 #' @rdname fitted
 #' @export
 setGeneric("fitted")
+
+#' @rdname naiveBayes
+#' @export
+setGeneric("naiveBayes", function(formula, data, ...) { 
standardGeneric("naiveBayes") })
--- End diff --

could you check manually that the 
http://ugrad.stat.ubc.ca/R/library/e1071/html/naiveBayes.html
e1071 package naiveBayes still work when SparkR is loaded after e1071?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13449] Naive Bayes wrapper in SparkR

2016-03-08 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/11486#discussion_r55474302
  
--- Diff: R/pkg/R/mllib.R ---
@@ -192,3 +210,37 @@ setMethod("fitted", signature(object = 
"PipelineModel"),
   stop(paste("Unsupported model", modelName, sep = " "))
 }
   })
+
+#' Fit a naive Bayes model
+#'
+#' Fit a naive Bayes model, similarly to R's naiveBayes() except for 
omitting two arguments 'subset'
+#' and 'na.action'. Users can use 'subset' function and 'fillna' or 
'na.omit' function of DataFrame,
+#' respectviely, to preprocess their DataFrame. We use na.omit in this 
interface to avoid potential
+#' errors.
+#'
+#' @param object A symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#' @param data DataFrame for training
+#' @param lambda Smoothing parameter
+#' @param modelType Either 'multinomial' or 'bernoulli'. Default 
"multinomial".
+#' @param ... Undefined parameters
--- End diff --

suggest removing `...` "param" from roxygen2 doc here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13449] Naive Bayes wrapper in SparkR

2016-03-08 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/11486#discussion_r55474272
  
--- Diff: R/pkg/R/mllib.R ---
@@ -192,3 +210,37 @@ setMethod("fitted", signature(object = 
"PipelineModel"),
   stop(paste("Unsupported model", modelName, sep = " "))
 }
   })
+
+#' Fit a naive Bayes model
+#'
+#' Fit a naive Bayes model, similarly to R's naiveBayes() except for 
omitting two arguments 'subset'
+#' and 'na.action'. Users can use 'subset' function and 'fillna' or 
'na.omit' function of DataFrame,
+#' respectviely, to preprocess their DataFrame. We use na.omit in this 
interface to avoid potential
+#' errors.
--- End diff --

what is this potential error?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13449] Naive Bayes wrapper in SparkR

2016-03-08 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/11486#discussion_r55474264
  
--- Diff: R/pkg/R/mllib.R ---
@@ -192,3 +210,37 @@ setMethod("fitted", signature(object = 
"PipelineModel"),
   stop(paste("Unsupported model", modelName, sep = " "))
 }
   })
+
+#' Fit a naive Bayes model
+#'
+#' Fit a naive Bayes model, similarly to R's naiveBayes() except for 
omitting two arguments 'subset'
+#' and 'na.action'. Users can use 'subset' function and 'fillna' or 
'na.omit' function of DataFrame,
+#' respectviely, to preprocess their DataFrame. We use na.omit in this 
interface to avoid potential
--- End diff --

typo: respectively


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...

2016-03-08 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/11220#discussion_r55474213
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -303,8 +303,28 @@ setMethod("colnames",
 #' @rdname columns
 #' @name colnames<-
 setMethod("colnames<-",
-  signature(x = "DataFrame", value = "character"),
--- End diff --

I think in this case the base:: definition is implemented in such a way 
that it *should* (or at least it attempts to) handle different parameter types:

```
> getMethod("colnames<-")
Method Definition (Class "derivedDefaultMethod"):

function (x, value)
{
if (is.data.frame(x)) {
names(x) <- value
}
else {
dn <- dimnames(x)
if (is.null(dn)) {
if (is.null(value))
return(x)
if ((nd <- length(dim(x))) < 2L)
stop("attempt to set 'colnames' on an object with less than 
two dimensions")
dn <- vector("list", nd)
}
if (length(dn) < 2L)
stop("attempt to set 'colnames' on an object with less than two 
dimensions")
if (is.null(value))
dn[2L] <- list(NULL)
else dn[[2L]] <- value
dimnames(x) <- dn
}
x
}



Signatures:
x value
target  "ANY" "ANY"
defined "ANY" "ANY"
```

So from what I can see the error you see actually comes from the base 
implement, as "sort of" expected. Though I'm ok if we are to add explicit 
checks to make the error more clear for the user.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12718][SPARK-13720][SQL] SQL generation...

2016-03-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11555#issuecomment-194122022
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52720/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12718][SPARK-13720][SQL] SQL generation...

2016-03-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11555#issuecomment-194122020
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12718][SPARK-13720][SQL] SQL generation...

2016-03-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11555#issuecomment-194121910
  
**[Test build #52720 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52720/consoleFull)**
 for PR 11555 at commit 
[`aa0a32b`](https://github.com/apache/spark/commit/aa0a32b30149620978fbdd26485f01982baa6531).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13696] Remove BlockStore class & simpli...

2016-03-08 Thread nongli

Github user nongli commented on a diff in the pull request:

https://github.com/apache/spark/pull/11534#discussion_r55473498
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -789,117 +752,193 @@ private[spark] class BlockManager(
   // lockNewBlockForWriting returned a read lock on the existing 
block, so we must free it:
   releaseLock(blockId)
 }
-return DoPutSucceeded
+return true
   }
 }
 
 val startTimeMs = System.currentTimeMillis
 
-// Size of the block in bytes
-var size = 0L
-
-// The level we actually use to put the block
-val putLevel = effectiveStorageLevel.getOrElse(level)
-
-// If we're storing bytes, then initiate the replication before 
storing them locally.
+// Since we're storing bytes, initiate the replication before storing 
them locally.
 // This is faster as data is already serialized and ready to send.
-val replicationFuture = data match {
-  case b: ByteBufferValues if putLevel.replication > 1 =>
-// Duplicate doesn't copy the bytes, but just creates a wrapper
-val bufferView = b.buffer.duplicate()
-Future {
-  // This is a blocking action and should run in 
futureExecutionContext which is a cached
-  // thread pool
-  replicate(blockId, bufferView, putLevel)
-}(futureExecutionContext)
-  case _ => null
+val replicationFuture = if (level.replication > 1) {
+  // Duplicate doesn't copy the bytes, but just creates a wrapper
+  val bufferView = bytes.duplicate()
+  Future {
+// This is a blocking action and should run in 
futureExecutionContext which is a cached
+// thread pool
+replicate(blockId, bufferView, level)
+  }(futureExecutionContext)
+} else {
+  null
 }
 
 var blockWasSuccessfullyStored = false
-var iteratorFromFailedMemoryStorePut: Option[Iterator[Any]] = None
-
-putBlockInfo.synchronized {
-  logTrace("Put for block %s took %s to get into synchronized block"
-.format(blockId, Utils.getUsedTimeMs(startTimeMs)))
 
-  try {
-if (putLevel.useMemory) {
-  // Put it in memory first, even if it also has useDisk set to 
true;
-  // We will drop it to disk later if the memory store can't hold 
it.
-  data match {
-case IteratorValues(iterator) =>
-  memoryStore.putIterator(blockId, iterator(), putLevel) match 
{
-case Right(s) =>
-  size = s
-case Left(iter) =>
-  iteratorFromFailedMemoryStorePut = Some(iter)
-  }
-case ByteBufferValues(bytes) =>
-  bytes.rewind()
-  size = bytes.limit()
-  memoryStore.putBytes(blockId, bytes, putLevel)
-  }
-} else if (putLevel.useDisk) {
-  data match {
-case IteratorValues(iterator) =>
-  diskStore.putIterator(blockId, iterator(), putLevel) match {
-case Right(s) =>
-  size = s
-// putIterator() will never return Left (see its return 
type).
-  }
-case ByteBufferValues(bytes) =>
-  bytes.rewind()
-  size = bytes.limit()
-  diskStore.putBytes(blockId, bytes, putLevel)
-  }
+bytes.rewind()
+val size = bytes.limit()
+
+try {
+  if (level.useMemory) {
+// Put it in memory first, even if it also has useDisk set to true;
+// We will drop it to disk later if the memory store can't hold it.
+val putSucceeded = if (level.deserialized) {
+  val values = dataDeserialize(blockId, bytes.duplicate())
+  memoryStore.putIterator(blockId, values, level).isRight
 } else {
-  assert(putLevel == StorageLevel.NONE)
-  throw new BlockException(
-blockId, s"Attempted to put block $blockId without specifying 
storage level!")
+  memoryStore.putBytes(blockId, size, () => bytes)
+}
+if (!putSucceeded && level.useDisk) {
+  logWarning(s"Persisting block $blockId to disk instead.")
+  diskStore.putBytes(blockId, bytes)
 }
+  } else if (level.useDisk) {
+diskStore.putBytes(blockId, bytes)
+  }
 
-val putBlockStatus = getCurrentBlockStatus(blockId, putBlockInfo)
-blockWasSuccessfullyStored = putBlockStatus.storageLevel.isValid
-if (blockWasSuccessfullyStored) {
-

[GitHub] spark pull request: [SPARK-13696] Remove BlockStore class & simpli...

2016-03-08 Thread nongli

Github user nongli commented on a diff in the pull request:

https://github.com/apache/spark/pull/11534#discussion_r55473363
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -789,117 +752,193 @@ private[spark] class BlockManager(
   // lockNewBlockForWriting returned a read lock on the existing 
block, so we must free it:
   releaseLock(blockId)
 }
-return DoPutSucceeded
+return true
   }
 }
 
 val startTimeMs = System.currentTimeMillis
 
-// Size of the block in bytes
-var size = 0L
-
-// The level we actually use to put the block
-val putLevel = effectiveStorageLevel.getOrElse(level)
-
-// If we're storing bytes, then initiate the replication before 
storing them locally.
+// Since we're storing bytes, initiate the replication before storing 
them locally.
 // This is faster as data is already serialized and ready to send.
-val replicationFuture = data match {
-  case b: ByteBufferValues if putLevel.replication > 1 =>
-// Duplicate doesn't copy the bytes, but just creates a wrapper
-val bufferView = b.buffer.duplicate()
-Future {
-  // This is a blocking action and should run in 
futureExecutionContext which is a cached
-  // thread pool
-  replicate(blockId, bufferView, putLevel)
-}(futureExecutionContext)
-  case _ => null
+val replicationFuture = if (level.replication > 1) {
+  // Duplicate doesn't copy the bytes, but just creates a wrapper
+  val bufferView = bytes.duplicate()
+  Future {
+// This is a blocking action and should run in 
futureExecutionContext which is a cached
+// thread pool
+replicate(blockId, bufferView, level)
+  }(futureExecutionContext)
+} else {
+  null
 }
 
 var blockWasSuccessfullyStored = false
-var iteratorFromFailedMemoryStorePut: Option[Iterator[Any]] = None
-
-putBlockInfo.synchronized {
-  logTrace("Put for block %s took %s to get into synchronized block"
-.format(blockId, Utils.getUsedTimeMs(startTimeMs)))
 
-  try {
-if (putLevel.useMemory) {
-  // Put it in memory first, even if it also has useDisk set to 
true;
-  // We will drop it to disk later if the memory store can't hold 
it.
-  data match {
-case IteratorValues(iterator) =>
-  memoryStore.putIterator(blockId, iterator(), putLevel) match 
{
-case Right(s) =>
-  size = s
-case Left(iter) =>
-  iteratorFromFailedMemoryStorePut = Some(iter)
-  }
-case ByteBufferValues(bytes) =>
-  bytes.rewind()
-  size = bytes.limit()
-  memoryStore.putBytes(blockId, bytes, putLevel)
-  }
-} else if (putLevel.useDisk) {
-  data match {
-case IteratorValues(iterator) =>
-  diskStore.putIterator(blockId, iterator(), putLevel) match {
-case Right(s) =>
-  size = s
-// putIterator() will never return Left (see its return 
type).
-  }
-case ByteBufferValues(bytes) =>
-  bytes.rewind()
-  size = bytes.limit()
-  diskStore.putBytes(blockId, bytes, putLevel)
-  }
+bytes.rewind()
+val size = bytes.limit()
+
+try {
+  if (level.useMemory) {
+// Put it in memory first, even if it also has useDisk set to true;
+// We will drop it to disk later if the memory store can't hold it.
+val putSucceeded = if (level.deserialized) {
+  val values = dataDeserialize(blockId, bytes.duplicate())
+  memoryStore.putIterator(blockId, values, level).isRight
 } else {
-  assert(putLevel == StorageLevel.NONE)
-  throw new BlockException(
-blockId, s"Attempted to put block $blockId without specifying 
storage level!")
+  memoryStore.putBytes(blockId, size, () => bytes)
+}
+if (!putSucceeded && level.useDisk) {
+  logWarning(s"Persisting block $blockId to disk instead.")
+  diskStore.putBytes(blockId, bytes)
 }
+  } else if (level.useDisk) {
+diskStore.putBytes(blockId, bytes)
+  }
 
-val putBlockStatus = getCurrentBlockStatus(blockId, putBlockInfo)
-blockWasSuccessfullyStored = putBlockStatus.storageLevel.isValid
-if (blockWasSuccessfullyStored) {
-

[GitHub] spark pull request: [SPARK-13449] Naive Bayes wrapper in SparkR

2016-03-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11486#issuecomment-194117566
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13449] Naive Bayes wrapper in SparkR

2016-03-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11486#issuecomment-194117569
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52723/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13449] Naive Bayes wrapper in SparkR

2016-03-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11486#issuecomment-194117451
  
**[Test build #52723 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52723/consoleFull)**
 for PR 11486 at commit 
[`30e9c37`](https://github.com/apache/spark/commit/30e9c372207ed206a7dc294b5726ad008a18ed12).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13696] Remove BlockStore class & simpli...

2016-03-08 Thread nongli

Github user nongli commented on a diff in the pull request:

https://github.com/apache/spark/pull/11534#discussion_r55472870
  
--- Diff: core/src/main/scala/org/apache/spark/storage/DiskStore.scala ---
@@ -17,112 +17,100 @@
 
 package org.apache.spark.storage
 
-import java.io.{File, FileOutputStream, IOException, RandomAccessFile}
+import java.io.{FileOutputStream, IOException, RandomAccessFile}
 import java.nio.ByteBuffer
 import java.nio.channels.FileChannel.MapMode
 
-import org.apache.spark.Logging
+import com.google.common.io.Closeables
+
+import org.apache.spark.{Logging, SparkConf}
 import org.apache.spark.util.Utils
 
 /**
  * Stores BlockManager blocks on disk.
  */
-private[spark] class DiskStore(blockManager: BlockManager, diskManager: 
DiskBlockManager)
-  extends BlockStore(blockManager) with Logging {
+private[spark] class DiskStore(conf: SparkConf, diskManager: 
DiskBlockManager) extends Logging {
 
-  val minMemoryMapBytes = 
blockManager.conf.getSizeAsBytes("spark.storage.memoryMapThreshold", "2m")
+  private val minMemoryMapBytes = 
conf.getSizeAsBytes("spark.storage.memoryMapThreshold", "2m")
 
-  override def getSize(blockId: BlockId): Long = {
+  def getSize(blockId: BlockId): Long = {
 diskManager.getFile(blockId.name).length
   }
 
-  override def putBytes(blockId: BlockId, _bytes: ByteBuffer, level: 
StorageLevel): Unit = {
-// So that we do not modify the input offsets !
-// duplicate does not copy buffer, so inexpensive
-val bytes = _bytes.duplicate()
+  /**
+   * Invokes the provided callback function to write the specific block.
+   *
+   * @throws IllegalStateException if the block already exists in the disk 
store.
+   */
+  def put(blockId: BlockId)(writeFunc: FileOutputStream => Unit): Unit = {
+if (contains(blockId)) {
+  throw new IllegalStateException(s"Block $blockId is already present 
in the disk store")
+}
 logDebug(s"Attempting to put block $blockId")
 val startTime = System.currentTimeMillis
 val file = diskManager.getFile(blockId)
-val channel = new FileOutputStream(file).getChannel
-Utils.tryWithSafeFinally {
-  while (bytes.remaining > 0) {
-channel.write(bytes)
+val fileOutputStream = new FileOutputStream(file)
+var threwException: Boolean = true
+try {
+  writeFunc(fileOutputStream)
+  threwException = false
+} finally {
+  try {
+Closeables.close(fileOutputStream, threwException)
+  } finally {
+ if (threwException) {
+  remove(blockId)
+}
   }
-} {
-  channel.close()
 }
 val finishTime = System.currentTimeMillis
 logDebug("Block %s stored as %s file on disk in %d ms".format(
-  file.getName, Utils.bytesToString(bytes.limit), finishTime - 
startTime))
+  file.getName,
+  Utils.bytesToString(file.length()),
+  finishTime - startTime))
   }
 
-  override def putIterator(
-  blockId: BlockId,
-  values: Iterator[Any],
-  level: StorageLevel): Right[Iterator[Any], Long] = {
-logDebug(s"Attempting to write values for block $blockId")
-val startTime = System.currentTimeMillis
-val file = diskManager.getFile(blockId)
-val outputStream = new FileOutputStream(file)
-try {
+  def putBytes(blockId: BlockId, _bytes: ByteBuffer): Unit = {
--- End diff --

Comment function and guarantees on _bytes on return.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...

2016-03-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11599#issuecomment-194115203
  
**[Test build #52724 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52724/consoleFull)**
 for PR 11599 at commit 
[`a31b1b5`](https://github.com/apache/spark/commit/a31b1b588949f2f92981f7d1a7d04d6e1806ccd1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12718][SPARK-13720][SQL] SQL generation...

2016-03-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11555#issuecomment-194114531
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12718][SPARK-13720][SQL] SQL generation...

2016-03-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11555#issuecomment-194114532
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52718/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12718][SPARK-13720][SQL] SQL generation...

2016-03-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11555#issuecomment-194113999
  
**[Test build #52718 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52718/consoleFull)**
 for PR 11555 at commit 
[`dab7a2f`](https://github.com/apache/spark/commit/dab7a2f1a5cc0438405b0fa1cf532ab883bed7e7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...

2016-03-08 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/11599#discussion_r55472267
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala
 ---
@@ -157,6 +157,14 @@ class ColumnPruningSuite extends PlanTest {
 comparePlans(Optimize.execute(query), expected)
   }
 
+  test("Eliminate the Project with an empty projectList") {
+val input = OneRowRelation
+val query =
+  Project(Literal(1).as("1") :: Nil, Project(Literal(1).as("1") :: 
Nil, input)).analyze
--- End diff --

Let me add an empty List too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...

2016-03-08 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/11599#discussion_r55472012
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala
 ---
@@ -157,6 +157,14 @@ class ColumnPruningSuite extends PlanTest {
 comparePlans(Optimize.execute(query), expected)
   }
 
+  test("Eliminate the Project with an empty projectList") {
+val input = OneRowRelation
+val query =
+  Project(Literal(1).as("1") :: Nil, Project(Literal(1).as("1") :: 
Nil, input)).analyze
--- End diff --

When running `Optimize.execute(query)`, the second Project's `projectList` 
is pruned to empty at first. Then, the second Project will be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13656][SQL] Remove spark.sql.parquet.ca...

2016-03-08 Thread maropu

Github user maropu commented on the pull request:

https://github.com/apache/spark/pull/11576#issuecomment-194112721
  
Yes, ... Scanning all files are pretty expensive though, we can add a 
metadata file for detecting it;
when `DataFrameWriter` writes something in the files, it updates the 
metadata.
Then, `ParquetRelation` checks the metadata to detect updates.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...

2016-03-08 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/11599#discussion_r55471810
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala
 ---
@@ -157,6 +157,14 @@ class ColumnPruningSuite extends PlanTest {
 comparePlans(Optimize.execute(query), expected)
   }
 
+  test("Eliminate the Project with an empty projectList") {
+val input = OneRowRelation
+val query =
+  Project(Literal(1).as("1") :: Nil, Project(Literal(1).as("1") :: 
Nil, input)).analyze
--- End diff --

Where do you test empty projectList?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13242] [SQL] codegen fallback in case-w...

2016-03-08 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/11592#issuecomment-194112069
  
This looks pretty good. Just some minor comments on naming. It'd be great 
to be more consistent among "branch", "case", and "switch".



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...

2016-03-08 Thread sun-rui

Github user sun-rui commented on the pull request:

https://github.com/apache/spark/pull/11220#issuecomment-194111890
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13696] Remove BlockStore class & simpli...

2016-03-08 Thread nongli

Github user nongli commented on a diff in the pull request:

https://github.com/apache/spark/pull/11534#discussion_r55471546
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala ---
@@ -34,13 +35,15 @@ private case class MemoryEntry(value: Any, size: Long, 
deserialized: Boolean)
  * Stores blocks in memory, either as Arrays of deserialized Java objects 
or as
  * serialized ByteBuffers.
  */
-private[spark] class MemoryStore(blockManager: BlockManager, 
memoryManager: MemoryManager)
-  extends BlockStore(blockManager) {
+private[spark] class MemoryStore(
+conf: SparkConf,
+blockManager: BlockManager,
+memoryManager: MemoryManager)
+  extends Logging {
 
   // Note: all changes to memory allocations, notably putting blocks, 
evicting blocks, and
   // acquiring or releasing unroll memory, must be synchronized on 
`memoryManager`!
 
-  private val conf = blockManager.conf
   private val entries = new LinkedHashMap[BlockId, MemoryEntry](32, 0.75f, 
true)
--- End diff --

Hmm okay. DiskStore also needs to be thread safe right? It just happens it 
doesn't maintain any state right now that's interesting. I think it's worth 
commenting so people modifying this know what the expected properties are.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13242] [SQL] codegen fallback in case-w...

2016-03-08 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/11592#discussion_r55471524
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala
 ---
@@ -58,6 +58,27 @@ class CodeGenerationSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 }
   }
 
+  test("SPARK-13242: complicated case-when expressions") {
--- End diff --

SPARK-13242: case-when expression with large number of branches (or cases)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13242] [SQL] codegen fallback in case-w...

2016-03-08 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/11592#discussion_r55471486
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala
 ---
@@ -136,7 +136,16 @@ case class CaseWhen(branches: Seq[(Expression, 
Expression)], elseValue: Option[E
 }
   }
 
+  def shouldCodegen: Boolean = {
+branches.length < CaseWhen.MAX_NUMBER_OF_SWITCHES
+  }
+
   override def genCode(ctx: CodegenContext, ev: ExprCode): String = {
+if (!shouldCodegen) {
+  // Fallback to interpreted mode if there are too many branches, or 
it may reach the
+  // 64K limit (number of bytecode for single Java method).
+  return super.genCode(ctx, ev)
--- End diff --

this super is slightly brittle. Can you see if there is a way in Scala to 
explicitly call genCode in CodegenFallback?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13242] [SQL] codegen fallback in case-w...

2016-03-08 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/11592#discussion_r55471461
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala
 ---
@@ -136,7 +136,16 @@ case class CaseWhen(branches: Seq[(Expression, 
Expression)], elseValue: Option[E
 }
   }
 
+  def shouldCodegen: Boolean = {
+branches.length < CaseWhen.MAX_NUMBER_OF_SWITCHES
+  }
+
   override def genCode(ctx: CodegenContext, ev: ExprCode): String = {
+if (!shouldCodegen) {
+  // Fallback to interpreted mode if there are too many branches, or 
it may reach the
+  // 64K limit (number of bytecode for single Java method).
--- End diff --

number of bytecode for single Java method -> limit on bytecode size for a 
single function


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13242] [SQL] codegen fallback in case-w...

2016-03-08 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/11592#discussion_r55471445
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala
 ---
@@ -136,7 +136,16 @@ case class CaseWhen(branches: Seq[(Expression, 
Expression)], elseValue: Option[E
 }
   }
 
+  def shouldCodegen: Boolean = {
+branches.length < CaseWhen.MAX_NUMBER_OF_SWITCHES
+  }
+
   override def genCode(ctx: CodegenContext, ev: ExprCode): String = {
+if (!shouldCodegen) {
+  // Fallback to interpreted mode if there are too many branches, or 
it may reach the
--- End diff --

or -> as


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13449] Naive Bayes wrapper in SparkR

2016-03-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11486#issuecomment-194111388
  
**[Test build #52723 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52723/consoleFull)**
 for PR 11486 at commit 
[`30e9c37`](https://github.com/apache/spark/commit/30e9c372207ed206a7dc294b5726ad008a18ed12).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13242] [SQL] codegen fallback in case-w...

2016-03-08 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/11592#discussion_r55471424
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala
 ---
@@ -205,6 +214,9 @@ case class CaseWhen(branches: Seq[(Expression, 
Expression)], elseValue: Option[E
 /** Factory methods for CaseWhen. */
 object CaseWhen {
 
+  // The maxium number of switches supported with codegen.
+  val MAX_NUMBER_OF_SWITCHES = 20
--- End diff --

MAX_NUM_CASES_FOR_CODEGEN


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...

2016-03-08 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/11596#discussion_r55471264
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala 
---
@@ -445,4 +461,98 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest 
with SQLTestUtils {
   "f1", "b[0].f1", "f1", "c[foo]", "d[0]"
 )
   }
+
+  test("SQL generator for explode in projection list") {
+// Basic Explode
+checkHiveQl("SELECT explode(array(1,2,3)) FROM src")
+
+// Explode with Alias
+checkHiveQl("SELECT explode(array(1,2,3)) as value FROM src")
+
+// Explode without FROM
+checkHiveQl("select explode(array(1,2,3)) AS gencol")
+
+// non-generated columns in projection list
+checkHiveQl("SELECT key as c1, explode(array(1,2,3)) as c2, value as 
c3 FROM t1")
+  }
+
+  test("SQL generation for json_tuple as generator") {
+checkHiveQl("SELECT key, json_tuple(jstring, 'f1', 'f2', 'f3', 'f4', 
'f5') FROM parquet_t3")
+  }
+
+  test("SQL generation for lateral views") {
--- End diff --

Please add another test case for joining the lateral view with a regular 
table. Thanks! 

```LATERAL VIEW ..,  t1 as table1```  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...

2016-03-08 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/11596#discussion_r55471184
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala 
---
@@ -445,4 +461,98 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest 
with SQLTestUtils {
   "f1", "b[0].f1", "f1", "c[foo]", "d[0]"
 )
   }
+
+  test("SQL generator for explode in projection list") {
+// Basic Explode
+checkHiveQl("SELECT explode(array(1,2,3)) FROM src")
+
+// Explode with Alias
+checkHiveQl("SELECT explode(array(1,2,3)) as value FROM src")
+
+// Explode without FROM
+checkHiveQl("select explode(array(1,2,3)) AS gencol")
+
+// non-generated columns in projection list
+checkHiveQl("SELECT key as c1, explode(array(1,2,3)) as c2, value as 
c3 FROM t1")
+  }
+
+  test("SQL generation for json_tuple as generator") {
+checkHiveQl("SELECT key, json_tuple(jstring, 'f1', 'f2', 'f3', 'f4', 
'f5') FROM parquet_t3")
+  }
+
+  test("SQL generation for lateral views") {
+// Filter and OUTER clause
+checkHiveQl(
+  """
+|SELECT key, value
+|FROM t1
+|LATERAL VIEW OUTER explode(value) gentab as gencol
+|WHERE key = 1
+  """.stripMargin
+)
+
+// single lateral view
+checkHiveQl(
+  """
+|SELECT *
+|FROM t1
+|LATERAL VIEW explode(array(1,2,3)) gentab AS gencol
+|SORT BY key ASC, gencol ASC LIMIT 1
+  """.stripMargin
+)
+
+// multiple lateral views
+checkHiveQl(
+  """
+|SELECT gentab2.*
+|FROM t1
+|LATERAL VIEW explode(array(array(1,2,3))) gentab1 AS gencol1
+|LATERAL VIEW explode(gentab1.gencol1) gentab2 AS gencol2 LIMIT 3
+  """.stripMargin
+)
+
+// No generated column aliases
+checkHiveQl(
+  """SELECT gentab.*
+|FROM t1
+|LATERAL VIEW explode(map('key1', 100, 'key2', 200)) gentab limit 2
+  """.stripMargin
+)
+  }
+
+  test("SQL generation for lateral views in subquery") {
+// Subquries in FROM clause using Generate
+checkHiveQl(
+  """
+|SELECT subq.gencol
+|FROM
+|(SELECT * from t1 LATERAL VIEW explode(value) gentab AS gencol) 
subq
+  """.stripMargin)
+
+checkHiveQl(
+  """
+|SELECT subq.key
+|FROM
+|(SELECT key, value from t1 LATERAL VIEW explode(value) gentab AS 
gencol) subq
+  """.stripMargin
+)
+  }
+
+  test("SQL generation for UDTF") {
--- End diff --

Please add more UDTF cases, as what @cloud-fan suggested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...

2016-03-08 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/11596#discussion_r55471043
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala ---
@@ -297,6 +303,60 @@ class SQLBuilder(logicalPlan: LogicalPlan, sqlContext: 
SQLContext) extends Loggi
 )
   }
 
+  /* This function handles the SQL generation when generators.
--- End diff --

style issue: please use `/**` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 8 >

1 - 100 of 777 matches

Mail list logo