[GitHub] spark issue #13461: [SPARK-15721][ML] Make DefaultParamsReadable, DefaultPar...

2016-06-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13461
  
**[Test build #3067 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3067/consoleFull)**
 for PR 13461 at commit 
[`1b1bc93`](https://github.com/apache/spark/commit/1b1bc93f0d606d3a517a49b397957d99c35c4b99).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12836: [SPARK-12922][SparkR][WIP] Implement gapply() on DataFra...

2016-06-05 Thread NarineK
Github user NarineK commented on the issue:

https://github.com/apache/spark/pull/12836
  
Locally, run-tests.sh run successfully, but it fails on jenkins ... 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12836: [SPARK-12922][SparkR][WIP] Implement gapply() on DataFra...

2016-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12836
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12836: [SPARK-12922][SparkR][WIP] Implement gapply() on DataFra...

2016-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12836
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60020/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12836: [SPARK-12922][SparkR][WIP] Implement gapply() on DataFra...

2016-06-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12836
  
**[Test build #60020 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60020/consoleFull)**
 for PR 12836 at commit 
[`e4fa8e6`](https://github.com/apache/spark/commit/e4fa8e66896be19430ae4cfabef2669b5ecc4dd7).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13446: [SPARK-15704] [SQL] add a test case in DatasetAggregator...

2016-06-05 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/13446
  
Sorry I was interrupted by something and forgot about it...
thanks, merging to master and 2.0!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #9113: [SPARK-11100][SQL]HiveThriftServer HA issue,HiveTh...

2016-06-05 Thread xiaowangyu
Github user xiaowangyu closed the pull request at:

https://github.com/apache/spark/pull/9113


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #9113: [SPARK-11100][SQL]HiveThriftServer HA issue,HiveThriftSer...

2016-06-05 Thread xiaowangyu
Github user xiaowangyu commented on the issue:

https://github.com/apache/spark/pull/9113
  
Thanks! I close it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13516: [MLLIB][DOC] Edit logistic regression docs to pro...

2016-06-05 Thread goodsoldiersvejk
Github user goodsoldiersvejk closed the pull request at:

https://github.com/apache/spark/pull/13516


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13519: [SPARK-15771] [ML] [Examples] Use 'accuracy' rather than...

2016-06-05 Thread zhengruifeng
Github user zhengruifeng commented on the issue:

https://github.com/apache/spark/pull/13519
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13515: [MINOR] Fix Typos 'an -> a'

2016-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13515
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13515: [MINOR] Fix Typos 'an -> a'

2016-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13515
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60016/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13515: [MINOR] Fix Typos 'an -> a'

2016-06-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13515
  
**[Test build #60016 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60016/consoleFull)**
 for PR 13515 at commit 
[`6de11a6`](https://github.com/apache/spark/commit/6de11a63e1f2a42ffaef9c4e24f1f448087f5b8f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13517: [SPARK-14839][SQL] Support for other types as opt...

2016-06-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/13517#discussion_r65835938
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -180,6 +180,9 @@ def json(self, path, schema=None, 
primitivesAsString=None, prefersDecimal=None,
 :param path: string represents path to the JSON dataset,
  or RDD of Strings storing JSON objects.
 :param schema: an optional :class:`StructType` for the input 
schema.
+:param samplingRatio: sets the ratio for sampling and reading the 
input data to infer
--- End diff --

Ah, I see. It does not affect the actual I/O but just drops some and then 
try to infer the schema. 
I will remove the change.

BTW, actually, I have found another one 
[`mergeSchema`](https://github.com/apache/spark/blob/431542765785304edb76a19885fbc5f9b8ae7d64/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L148-L152)
 option in Parquet data source, which I guess should be located in 
`ParquetOptions`. Can this be done here together maybe..?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13461: [SPARK-15721][ML] Make DefaultParamsReadable, DefaultPar...

2016-06-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13461
  
**[Test build #3067 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3067/consoleFull)**
 for PR 13461 at commit 
[`1b1bc93`](https://github.com/apache/spark/commit/1b1bc93f0d606d3a517a49b397957d99c35c4b99).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13285: [Spark-15129][R][DOC]R API changes in ML

2016-06-05 Thread yanboliang
Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/13285
  
@GayathriMurali I think what is there for ```include_example``` is OK. 
Please see my other inline comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13285: [Spark-15129][R][DOC]R API changes in ML

2016-06-05 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/13285#discussion_r65835075
  
--- Diff: docs/sparkr.md ---
@@ -285,71 +285,28 @@ head(teenagers)
 
 # Machine Learning
 
-SparkR allows the fitting of generalized linear models over DataFrames 
using the [glm()](api/R/glm.html) function. Under the hood, SparkR uses MLlib 
to train a model of the specified family. Currently the gaussian and binomial 
families are supported. We support a subset of the available R formula 
operators for model fitting, including '~', '.', ':', '+', and '-'.
+SparkR supports the following Machine Learning algorithms.
 
-The [summary()](api/R/summary.html) function gives the summary of a model 
produced by [glm()](api/R/glm.html).
+* Generalized Linear Regression Model [spark.glm()](api/R/glm.html)
+* Naive Bayes [spark.naiveBayes()](api/R/naiveBayes.html)
+* KMeans [spark.kmeans()](api/R/kmeans.html)
+* AFT Survival Regression [spark.survreg()](api/R/survreg.html)
 
-* For gaussian GLM model, it returns a list with 'devianceResiduals' and 
'coefficients' components. The 'devianceResiduals' gives the min/max deviance 
residuals of the estimation; the 'coefficients' gives the estimated 
coefficients and their estimated standard errors, t values and p-values. (It 
only available when model fitted by normal solver.)
-* For binomial GLM model, it returns a list with 'coefficients' component 
which gives the estimated coefficients.
+Generalized Linear Regression can be used to train a model from a 
specified family. Currently the Gaussian, Binomial, Poisson and Gamma families 
are supported. We support a subset of the available R formula operators for 
model fitting, including '~', '.', ':', '+', and '-'.
 
-The examples below show the use of building gaussian GLM model and 
binomial GLM model using SparkR.
+The [summary()](api/R/summary.html) function gives the summary of a model 
produced by different algorithms listed above.
+This summary is same as the result of summary() function in R.
 
-## Gaussian GLM model
+## Model persistence
 
-
-{% highlight r %}
-# Create the DataFrame
-df <- createDataFrame(sqlContext, iris)
-
-# Fit a gaussian GLM model over the dataset.
-model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = 
"gaussian")
-
-# Model summary are returned in a similar format to R's native glm().
-summary(model)
-##$devianceResiduals
-## Min   Max 
-## -1.307112 1.412532
-##
-##$coefficients
-##   Estimate  Std. Error t value  Pr(>|t|)
-##(Intercept)2.251393  0.3697543  6.08889  9.568102e-09
-##Sepal_Width0.8035609 0.106339   7.556598 4.187317e-12
-##Species_versicolor 1.458743  0.1121079  13.01195 0   
-##Species_virginica  1.946817  0.100015   19.46525 0   
-
-# Make predictions based on the model.
-predictions <- predict(model, newData = df)
-head(select(predictions, "Sepal_Length", "prediction"))
-##  Sepal_Length prediction
-##1  5.1   5.063856
-##2  4.9   4.662076
-##3  4.7   4.822788
-##4  4.6   4.742432
-##5  5.0   5.144212
-##6  5.4   5.385281
-{% endhighlight %}
-
+* write.ml allows users to save a fitted model in a given input path
+* read.ml allows users to read/load the model which was saved using 
write.ml in a given path
 
-## Binomial GLM model
+Model persistence is supported for all Machine Learning algorithms for all 
families.
 
-
-{% highlight r %}
-# Create the DataFrame
-df <- createDataFrame(sqlContext, iris)
-training <- filter(df, df$Species != "setosa")
-
-# Fit a binomial GLM model over the dataset.
-model <- glm(Species ~ Sepal_Length + Sepal_Width, data = training, family 
= "binomial")
-
-# Model coefficients are returned in a similar format to R's native glm().
-summary(model)
-##$coefficients
-##   Estimate
-##(Intercept)  -13.046005
-##Sepal_Length   1.902373
-##Sepal_Width0.404655
-{% endhighlight %}
-
+The examples below show the use of building Gaussian GLM, NaiveBayes, 
kMeans and AFTSurvivalReg models using SparkR
--- End diff --

Further more, you should make these names consistent.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, 

[GitHub] spark pull request #13285: [Spark-15129][R][DOC]R API changes in ML

2016-06-05 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/13285#discussion_r65835028
  
--- Diff: docs/sparkr.md ---
@@ -285,71 +285,28 @@ head(teenagers)
 
 # Machine Learning
 
-SparkR allows the fitting of generalized linear models over DataFrames 
using the [glm()](api/R/glm.html) function. Under the hood, SparkR uses MLlib 
to train a model of the specified family. Currently the gaussian and binomial 
families are supported. We support a subset of the available R formula 
operators for model fitting, including '~', '.', ':', '+', and '-'.
+SparkR supports the following Machine Learning algorithms.
 
-The [summary()](api/R/summary.html) function gives the summary of a model 
produced by [glm()](api/R/glm.html).
+* Generalized Linear Regression Model [spark.glm()](api/R/glm.html)
+* Naive Bayes [spark.naiveBayes()](api/R/naiveBayes.html)
+* KMeans [spark.kmeans()](api/R/kmeans.html)
+* AFT Survival Regression [spark.survreg()](api/R/survreg.html)
 
-* For gaussian GLM model, it returns a list with 'devianceResiduals' and 
'coefficients' components. The 'devianceResiduals' gives the min/max deviance 
residuals of the estimation; the 'coefficients' gives the estimated 
coefficients and their estimated standard errors, t values and p-values. (It 
only available when model fitted by normal solver.)
-* For binomial GLM model, it returns a list with 'coefficients' component 
which gives the estimated coefficients.
+Generalized Linear Regression can be used to train a model from a 
specified family. Currently the Gaussian, Binomial, Poisson and Gamma families 
are supported. We support a subset of the available R formula operators for 
model fitting, including '~', '.', ':', '+', and '-'.
 
-The examples below show the use of building gaussian GLM model and 
binomial GLM model using SparkR.
+The [summary()](api/R/summary.html) function gives the summary of a model 
produced by different algorithms listed above.
+This summary is same as the result of summary() function in R.
 
-## Gaussian GLM model
+## Model persistence
 
-
-{% highlight r %}
-# Create the DataFrame
-df <- createDataFrame(sqlContext, iris)
-
-# Fit a gaussian GLM model over the dataset.
-model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = 
"gaussian")
-
-# Model summary are returned in a similar format to R's native glm().
-summary(model)
-##$devianceResiduals
-## Min   Max 
-## -1.307112 1.412532
-##
-##$coefficients
-##   Estimate  Std. Error t value  Pr(>|t|)
-##(Intercept)2.251393  0.3697543  6.08889  9.568102e-09
-##Sepal_Width0.8035609 0.106339   7.556598 4.187317e-12
-##Species_versicolor 1.458743  0.1121079  13.01195 0   
-##Species_virginica  1.946817  0.100015   19.46525 0   
-
-# Make predictions based on the model.
-predictions <- predict(model, newData = df)
-head(select(predictions, "Sepal_Length", "prediction"))
-##  Sepal_Length prediction
-##1  5.1   5.063856
-##2  4.9   4.662076
-##3  4.7   4.822788
-##4  4.6   4.742432
-##5  5.0   5.144212
-##6  5.4   5.385281
-{% endhighlight %}
-
+* write.ml allows users to save a fitted model in a given input path
+* read.ml allows users to read/load the model which was saved using 
write.ml in a given path
 
-## Binomial GLM model
+Model persistence is supported for all Machine Learning algorithms for all 
families.
 
-
-{% highlight r %}
-# Create the DataFrame
-df <- createDataFrame(sqlContext, iris)
-training <- filter(df, df$Species != "setosa")
-
-# Fit a binomial GLM model over the dataset.
-model <- glm(Species ~ Sepal_Length + Sepal_Width, data = training, family 
= "binomial")
-
-# Model coefficients are returned in a similar format to R's native glm().
-summary(model)
-##$coefficients
-##   Estimate
-##(Intercept)  -13.046005
-##Sepal_Length   1.902373
-##Sepal_Width0.404655
-{% endhighlight %}
-
+The examples below show the use of building Gaussian GLM, NaiveBayes, 
kMeans and AFTSurvivalReg models using SparkR
--- End diff --

The example include glm with gaussian family, glm with binomial family.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For 

[GitHub] spark pull request #13517: [SPARK-14839][SQL] Support for other types as opt...

2016-06-05 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13517#discussion_r65834648
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -180,6 +180,9 @@ def json(self, path, schema=None, 
primitivesAsString=None, prefersDecimal=None,
 :param path: string represents path to the JSON dataset,
  or RDD of Strings storing JSON objects.
 :param schema: an optional :class:`StructType` for the input 
schema.
+:param samplingRatio: sets the ratio for sampling and reading the 
input data to infer
--- End diff --

it was actually intentional that samplingRatio was undocumented, because 
regardless the value, Spark still needs to read all the data so this might as 
well be 1 all the time.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #9113: [SPARK-11100][SQL]HiveThriftServer HA issue,HiveThriftSer...

2016-06-05 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/9113
  
@viper-kun no - as I said, "I don't think anybody has thought a lot about 
it yet."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13446: [SPARK-15704] [SQL] add a test case in DatasetAggregator...

2016-06-05 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/13446
  
@cloud-fan next time please leave a message on the pr saying it was merged 
and the branches it was merged in.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12836: [SPARK-12922][SparkR][WIP] Implement gapply() on DataFra...

2016-06-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12836
  
**[Test build #60020 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60020/consoleFull)**
 for PR 12836 at commit 
[`e4fa8e6`](https://github.com/apache/spark/commit/e4fa8e66896be19430ae4cfabef2669b5ecc4dd7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13446: [SPARK-15704] [SQL] add a test case in DatasetAgg...

2016-06-05 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13446


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12836: [SPARK-12922][SparkR][WIP] Implement gapply() on DataFra...

2016-06-05 Thread NarineK
Github user NarineK commented on the issue:

https://github.com/apache/spark/pull/12836
  
@shivaram, I didn't change the code, but merged with master, because prior 
to this the build was failing because some pyspark tests didn't pass.

After my today's merge, when I run gapply test cases from R studio 
everything passes but if I run using ./run-tests.sh - command line, it fails on 
arrange ... 

I'm changing the test cases, so that I call order after collecting the 
dataframe ... 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13373: [SPARK-15616] [SQL] Metastore relation should fallback t...

2016-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13373
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13373: [SPARK-15616] [SQL] Metastore relation should fallback t...

2016-06-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13373
  
**[Test build #60018 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60018/consoleFull)**
 for PR 13373 at commit 
[`8b9b07d`](https://github.com/apache/spark/commit/8b9b07d8ced030563c2485fa3ac271cb69aa4ed0).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13373: [SPARK-15616] [SQL] Metastore relation should fallback t...

2016-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13373
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60018/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13505
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60015/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13505
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13505
  
**[Test build #60015 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60015/consoleFull)**
 for PR 13505 at commit 
[`5504b6c`](https://github.com/apache/spark/commit/5504b6c2dd3ac7959b2cb7e139a54208368a9a45).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13505
  
**[Test build #60019 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60019/consoleFull)**
 for PR 13505 at commit 
[`5504b6c`](https://github.com/apache/spark/commit/5504b6c2dd3ac7959b2cb7e139a54208368a9a45).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-05 Thread JoshRosen
Github user JoshRosen commented on the issue:

https://github.com/apache/spark/pull/13505
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13373: [SPARK-15616] [SQL] Metastore relation should fallback t...

2016-06-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13373
  
**[Test build #60018 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60018/consoleFull)**
 for PR 13373 at commit 
[`8b9b07d`](https://github.com/apache/spark/commit/8b9b07d8ced030563c2485fa3ac271cb69aa4ed0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13147: [SPARK-6320][SQL] Move planLater method into GenericStra...

2016-06-05 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/13147
  
@marmbrus Do you have any other thoughts on this?
If so, let me know them and why don't we merge the minimal version same as 
for `branch-2.0` into `master` for now?
I think the API difference between `master` and `branch-2.0` for a long 
time is not desirable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13373: [SPARK-15616] [SQL] Metastore relation should fallback t...

2016-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13373
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13373: [SPARK-15616] [SQL] Metastore relation should fallback t...

2016-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13373
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60017/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13373: [SPARK-15616] [SQL] Metastore relation should fallback t...

2016-06-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13373
  
**[Test build #60017 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60017/consoleFull)**
 for PR 13373 at commit 
[`dd6bdf0`](https://github.com/apache/spark/commit/dd6bdf05b1156b6e1471ceadc817c3f8a54270b2).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class PushFilterIntoRelation(conf: SQLConf) extends 
Rule[LogicalPlan] with PredicateHelper `
  * `case class PushProjectIntoRelation(conf: SQLConf) extends 
Rule[LogicalPlan] `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13373: [SPARK-15616] [SQL] Metastore relation should fallback t...

2016-06-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13373
  
**[Test build #60017 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60017/consoleFull)**
 for PR 13373 at commit 
[`dd6bdf0`](https://github.com/apache/spark/commit/dd6bdf05b1156b6e1471ceadc817c3f8a54270b2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13515: [MINOR] Fix Typos 'an -> a'

2016-06-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13515
  
**[Test build #60016 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60016/consoleFull)**
 for PR 13515 at commit 
[`6de11a6`](https://github.com/apache/spark/commit/6de11a63e1f2a42ffaef9c4e24f1f448087f5b8f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13515: [MINOR] Fix Typos 'an -> a'

2016-06-05 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/13515#discussion_r6583
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala
 ---
@@ -37,7 +37,7 @@ import org.apache.spark.sql.hive.test.{TestHive, 
TestHiveQueryExecution}
  * Allows the creations of tests that execute the same query against both 
hive
  * and catalyst, comparing the results.
  *
- * The "golden" results from Hive are cached in an retrieved both from the 
classpath and
+ * The "golden" results from Hive are cached in a retrieved both from the 
classpath and
--- End diff --

Thanks, I will fix this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #9113: [SPARK-11100][SQL]HiveThriftServer HA issue,HiveThriftSer...

2016-06-05 Thread viper-kun
Github user viper-kun commented on the issue:

https://github.com/apache/spark/pull/9113
  
@rxin  Is there any design about replacement?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #9162: [SPARK-10655][SQL] Adding additional data type map...

2016-06-05 Thread sureshthalamati
Github user sureshthalamati commented on a diff in the pull request:

https://github.com/apache/spark/pull/9162#discussion_r65828274
  
--- Diff: 
external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala
 ---
@@ -47,19 +49,20 @@ class DB2IntegrationSuite extends 
DockerJDBCIntegrationSuite {
 conn.prepareStatement("INSERT INTO tbl VALUES 
(17,'dave')").executeUpdate()
 
 conn.prepareStatement("CREATE TABLE numbers ( small SMALLINT, med 
INTEGER, big BIGINT, "
-  + "deci DECIMAL(31,20), flt FLOAT, dbl DOUBLE)").executeUpdate()
+  + "deci DECIMAL(31,20), flt FLOAT, dbl DOUBLE, real REAL, decflt 
DECFLOAT)").executeUpdate()
--- End diff --

Thanks for reviewing @gatorsmile . Added test cases for those two 
variations of the DECFLOAT types.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12836: [SPARK-12922][SparkR][WIP] Implement gapply() on DataFra...

2016-06-05 Thread shivaram
Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/12836
  
The error was
```
1. Error: gapply() on a DataFrame 
--
java.lang.OutOfMemoryJava heap space
```

@NarineK Do you think there was any code change that could have caused this 
or is this just flakiness in Jenkins ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13491: [SPARK-15748][SQL] Replace inefficient foldLeft()...

2016-06-05 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13491


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13491: [SPARK-15748][SQL] Replace inefficient foldLeft() call w...

2016-06-05 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/13491
  
Merging in master/2.0.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12836: [SPARK-12922][SparkR][WIP] Implement gapply() on DataFra...

2016-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12836
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12836: [SPARK-12922][SparkR][WIP] Implement gapply() on DataFra...

2016-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12836
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60013/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12836: [SPARK-12922][SparkR][WIP] Implement gapply() on DataFra...

2016-06-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12836
  
**[Test build #60013 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60013/consoleFull)**
 for PR 12836 at commit 
[`249568e`](https://github.com/apache/spark/commit/249568e2d244b3b81d53dfce797f8c021602749f).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13481: [SPARK-15738][PYSPARK][ML] Adding Pyspark ml RFormula __...

2016-06-05 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/13481
  
That looks pretty good to me too, thanks @MLnick! I'll put that in soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13401: [SPARK-15657][SQL] RowEncoder should validate the...

2016-06-05 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13401


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13401: [SPARK-15657][SQL] RowEncoder should validate the data t...

2016-06-05 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/13401
  
LGTM, merging to master and branch-2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13505
  
**[Test build #60015 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60015/consoleFull)**
 for PR 13505 at commit 
[`5504b6c`](https://github.com/apache/spark/commit/5504b6c2dd3ac7959b2cb7e139a54208368a9a45).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13488: [MINOR][R][DOC] Fix R documentation generation instructi...

2016-06-05 Thread vectorijk
Github user vectorijk commented on the issue:

https://github.com/apache/spark/pull/13488
  
Thanks

On Sun, Jun 5, 2016, 13:05 asfgit  wrote:

> Closed #13488  via 8a91105
> 

> .
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or mute the
> thread
> 

> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-05 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/13505
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13505
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60011/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13505
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13505
  
**[Test build #60011 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60011/consoleFull)**
 for PR 13505 at commit 
[`5504b6c`](https://github.com/apache/spark/commit/5504b6c2dd3ac7959b2cb7e139a54208368a9a45).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13520: [SPARK-15773][CORE][EXAMPLE] Avoid creating local variab...

2016-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13520
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60014/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13520: [SPARK-15773][CORE][EXAMPLE] Avoid creating local variab...

2016-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13520
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-05 Thread koertkuipers
Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13512
  
@cloud-fan i am running into some trouble updating my branch to the latest 
master. i get errors in tests due to Analyzer.validateTopLevelTupleFields

the issue seems to be that in KeyValueGroupedDataset[K, T] the Aggregators 
are supposed to operate on T, but the logicalPlan at this point already has K 
appended to T because AppendColumns(func, inputPlan) is applied to the plan 
before its passed into KeyValueGroupedDataset. so validateTopLevelTupleFields 
also sees the column for the key in the inputs and believes the deserializer 
for T is missing a field.

any suggestions on how to get around this? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13520: [SPARK-15773][CORE][EXAMPLE] Avoid creating local variab...

2016-06-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13520
  
**[Test build #60014 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60014/consoleFull)**
 for PR 13520 at commit 
[`0a5d82f`](https://github.com/apache/spark/commit/0a5d82fc8c1b3e0910231060090181e143e5215a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-05 Thread koertkuipers
Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13512
  
@cloud-fan from the (added) unit tests:
```
val df2 = Seq("a" -> 1, "a" -> 3, "b" -> 3).toDF("i", "j")
checkAnswer(df2.groupBy("i").agg(ComplexResultAgg.toColumn),
  Row("a", Row(2, 4)) :: Row("b", Row(1, 3)) :: Nil)
```
this shows how the underlying type is Row (with a schema consisting of 
Strings and Ints), and it gets converted to the input type of the Aggregator 
which is (String, Long), so this involves both conversion and upcast.

and:
```
val df3 = Seq(("a", "x", 1), ("a", "y", 3), ("b", "x", 3)).toDF("i", "j", 
"k")
checkAnswer(df3.groupBy("i").agg(ComplexResultAgg("i", "k")),
  Row("a", Row(2, 4)) :: Row("b", Row(1, 3)) :: Nil)
```
this is similar to the previous example but i also select the columns i 
want the Aggregator to operate on (namely columns "i" and "k")


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13520: [SPARK-15773][CORE][EXAMPLE] Avoid creating local variab...

2016-06-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13520
  
**[Test build #60014 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60014/consoleFull)**
 for PR 13520 at commit 
[`0a5d82f`](https://github.com/apache/spark/commit/0a5d82fc8c1b3e0910231060090181e143e5215a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13520: [SPARK-15773][CORE][EXAMPLE] Avoid creating local...

2016-06-05 Thread dongjoon-hyun
GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/13520

[SPARK-15773][CORE][EXAMPLE] Avoid creating local variable `sc` in examples 
if possible

## What changes were proposed in this pull request?

Instead of using local variable `sc` like the following example, this PR 
uses `spark.sparkContext`. This makes examples more concise, and also fixes 
some misleading, i.e., creating SparkContext from SparkSession.
```
-println("Creating SparkContext")
-val sc = spark.sparkContext
-
 println("Writing local file to DFS")
 val dfsFilename = dfsDirPath + "/dfs_read_write_test"
-val fileRDD = sc.parallelize(fileContents)
+val fileRDD = spark.sparkContext.parallelize(fileContents)
```

This will change 12 files (+30 lines, -52 lines).

## How was this patch tested?

Manual.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK-15773

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13520.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13520


commit 0a5d82fc8c1b3e0910231060090181e143e5215a
Author: Dongjoon Hyun 
Date:   2016-06-05T21:42:42Z

[SPARK-15773][CORE][EXAMPLE] Avoid creating local variable `sc` in examples 
if possible




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13513: [SPARK-15698][SQL][Streaming] Add the ability to ...

2016-06-05 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/13513#discussion_r65825524
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -529,7 +529,28 @@ object SQLConf {
   .internal()
   .doc("How long in milliseconds a file is guaranteed to be visible 
for all readers.")
   .timeConf(TimeUnit.MILLISECONDS)
-  .createWithDefault(60 * 1000L) // 10 minutes
+  .createWithDefault(60 * 10 * 1000L) // 10 minutes
+
+  val FILE_SOURCE_LOG_DELETION = 
SQLConfigBuilder("spark.sql.streaming.fileSource.log.deletion")
+.internal()
+.doc("Whether to delete the expired log files in file stream source.")
+.booleanConf
+.createWithDefault(true)
+
+  val FILE_SOURCE_LOG_COMPACT_INTERVAL =
+SQLConfigBuilder("spark.sql.streaming.fileSource.log.compactInterval")
+  .internal()
+  .doc("Number of log files after which all the previous files " +
+"are compacted into the next log file.")
+  .intConf
+  .createWithDefault(10)
+
+  val FILE_SOURCE_LOG_CLEANUP_DELAY =
+SQLConfigBuilder("spark.sql.streaming.fileSource.log.cleanupDelay")
+  .internal()
+  .doc("How long in milliseconds a file is guaranteed to be visible 
for all readers.")
+  .timeConf(TimeUnit.MILLISECONDS)
+  .createWithDefault(60 * 10 * 1000L) // 10 minutes
--- End diff --

A nitpick but think it'd be easier to "decode" - `10 * 60 * 1000L`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13513: [SPARK-15698][SQL][Streaming] Add the ability to ...

2016-06-05 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/13513#discussion_r65825474
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -129,3 +131,86 @@ class FileStreamSource(
 
   override def toString: String = s"FileStreamSource[$qualifiedBasePath]"
 }
+
+class FileStreamSourceLog(sparkSession: SparkSession, path: String)
+  extends HDFSMetadataLog[Seq[String]](sparkSession, path) {
+
+  // Configurations about metadata compaction
+  private val compactInterval = 
sparkSession.conf.get(SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL)
+  require(compactInterval > 0,
+s"Please set ${SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL.key} (was 
$compactInterval) to a " +
+  s"positive value.")
+
+  private val fileCleanupDelayMs = 
sparkSession.conf.get(SQLConf.FILE_SOURCE_LOG_CLEANUP_DELAY)
+
+  private val isDeletingExpiredLog = 
sparkSession.conf.get(SQLConf.FILE_SOURCE_LOG_DELETION)
+
+  private var compactBatchId: Long = -1L
+
+  private def isCompactionBatch(batchId: Long, compactInterval: Long): 
Boolean = {
+batchId % compactInterval == 0
+  }
+
+  override def add(batchId: Long, metadata: Seq[String]): Boolean = {
+if (isCompactionBatch(batchId, compactInterval)) {
+  compactMetadataLog(batchId - 1)
+}
+
+super.add(batchId, metadata)
+  }
+
+  private def compactMetadataLog(batchId: Long): Unit = {
+// read out compact metadata and merge with new metadata.
+val batches = super.get(Some(compactBatchId), Some(batchId))
+val totalMetadata = batches.flatMap(_._2)
+if (totalMetadata.isEmpty) {
+  return
+}
+
+// Remove old compact metadata file and rewrite.
+val renamedPath = new Path(path, 
s".${batchId.toString}-${UUID.randomUUID.toString}.tmp")
+fileManager.rename(batchIdToPath(batchId), renamedPath)
+
+var isSuccess = false
+try {
+  isSuccess = super.add(batchId, totalMetadata)
+} catch {
+  case NonFatal(e) => isSuccess = false
--- End diff --

Why are you setting `isSuccess` to `false` since it's `false` already?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13513: [SPARK-15698][SQL][Streaming] Add the ability to ...

2016-06-05 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/13513#discussion_r65825480
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -129,3 +131,86 @@ class FileStreamSource(
 
   override def toString: String = s"FileStreamSource[$qualifiedBasePath]"
 }
+
+class FileStreamSourceLog(sparkSession: SparkSession, path: String)
+  extends HDFSMetadataLog[Seq[String]](sparkSession, path) {
+
+  // Configurations about metadata compaction
+  private val compactInterval = 
sparkSession.conf.get(SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL)
+  require(compactInterval > 0,
+s"Please set ${SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL.key} (was 
$compactInterval) to a " +
+  s"positive value.")
+
+  private val fileCleanupDelayMs = 
sparkSession.conf.get(SQLConf.FILE_SOURCE_LOG_CLEANUP_DELAY)
+
+  private val isDeletingExpiredLog = 
sparkSession.conf.get(SQLConf.FILE_SOURCE_LOG_DELETION)
+
+  private var compactBatchId: Long = -1L
+
+  private def isCompactionBatch(batchId: Long, compactInterval: Long): 
Boolean = {
+batchId % compactInterval == 0
+  }
+
+  override def add(batchId: Long, metadata: Seq[String]): Boolean = {
+if (isCompactionBatch(batchId, compactInterval)) {
+  compactMetadataLog(batchId - 1)
+}
+
+super.add(batchId, metadata)
+  }
+
+  private def compactMetadataLog(batchId: Long): Unit = {
+// read out compact metadata and merge with new metadata.
+val batches = super.get(Some(compactBatchId), Some(batchId))
+val totalMetadata = batches.flatMap(_._2)
+if (totalMetadata.isEmpty) {
+  return
+}
+
+// Remove old compact metadata file and rewrite.
+val renamedPath = new Path(path, 
s".${batchId.toString}-${UUID.randomUUID.toString}.tmp")
+fileManager.rename(batchIdToPath(batchId), renamedPath)
+
+var isSuccess = false
+try {
+  isSuccess = super.add(batchId, totalMetadata)
+} catch {
+  case NonFatal(e) => isSuccess = false
+} finally {
+  if (!isSuccess) {
+// Rollback to the previous status if compaction is failed.
--- End diff --

s/status/state ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13513: [SPARK-15698][SQL][Streaming] Add the ability to ...

2016-06-05 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/13513#discussion_r65825440
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -129,3 +131,86 @@ class FileStreamSource(
 
   override def toString: String = s"FileStreamSource[$qualifiedBasePath]"
 }
+
+class FileStreamSourceLog(sparkSession: SparkSession, path: String)
+  extends HDFSMetadataLog[Seq[String]](sparkSession, path) {
+
+  // Configurations about metadata compaction
+  private val compactInterval = 
sparkSession.conf.get(SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL)
+  require(compactInterval > 0,
+s"Please set ${SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL.key} (was 
$compactInterval) to a " +
--- End diff --

I'd move `(was $compactInterval)` at the end of the message.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12836: [SPARK-12922][SparkR][WIP] Implement gapply() on DataFra...

2016-06-05 Thread shivaram
Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/12836
  
Yeah I think we can still make this to 2.0 -- Are there any other comments 
@sun-rui ? 
Also pinging @davies / @rxin again for a SQL reviewer to take a look at this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #9113: [SPARK-11100][SQL]HiveThriftServer HA issue,HiveThriftSer...

2016-06-05 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/9113
  
We have currently inlined the Hive thrift server into the code base, but 
the long-term replacement is to be determined. I don't think anybody has 
thought a lot about it yet.

Do you mind closing this pull request for now? Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12836: [SPARK-12922][SparkR][WIP] Implement gapply() on DataFra...

2016-06-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12836
  
**[Test build #60013 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60013/consoleFull)**
 for PR 12836 at commit 
[`249568e`](https://github.com/apache/spark/commit/249568e2d244b3b81d53dfce797f8c021602749f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13444: [SPARK-15530][SQL] Set #parallelism for file list...

2016-06-05 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13444#discussion_r65823862
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala
 ---
@@ -409,13 +409,24 @@ private[sql] object HadoopFsRelation extends Logging {
   def listLeafFilesInParallel(
   paths: Seq[Path],
   hadoopConf: Configuration,
-  sparkContext: SparkContext): mutable.LinkedHashSet[FileStatus] = {
+  sparkSession: SparkSession): mutable.LinkedHashSet[FileStatus] = {
+assert(paths.size >= 
sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold)
 logInfo(s"Listing leaf files and directories in parallel under: 
${paths.mkString(", ")}")
 
+val sparkContext = sparkSession.sparkContext
+val sqlConf = sparkSession.sessionState.conf
 val serializableConfiguration = new 
SerializableConfiguration(hadoopConf)
 val serializedPaths = paths.map(_.toString)
 
-val fakeStatuses = 
sparkContext.parallelize(serializedPaths).mapPartitions { paths =>
+// Set the number of parallelism to prevent following file listing 
from generating many tasks
+// in case of large #defaultParallelism.
+val numParallelism = Math.min(
+  paths.size / Math.max(sqlConf.parallelPartitionDiscoveryThreshold, 
1) + 1,
+  sparkContext.defaultParallelism)
--- End diff --

I am not sure this `Math.min` can help if we have a small cluster (say, 
defaultParallelism is 4). I think in general, we need to create more tasks than 
`defaultParallelism` to help load balancing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13444: [SPARK-15530][SQL] Set #parallelism for file list...

2016-06-05 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13444#discussion_r65823818
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala
 ---
@@ -75,7 +75,7 @@ class ListingFileCatalog(
 
   protected def listLeafFiles(paths: Seq[Path]): 
mutable.LinkedHashSet[FileStatus] = {
 if (paths.length >= 
sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
--- End diff --

oh, this flag is used at here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13295: [SPARK-15294][SPARKR][MINOR] Add pivot functionality to ...

2016-06-05 Thread shivaram
Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/13295
  
@mhnatiuk It looks like the tests are failing with the error message I've 
pasted below. From reading the code I think this is related to the `if` check 
we have in the pivot implementation where it should be `length(values) != 
length(unique(values))` instead of `==` as we have right now ?

```
1. Error: pivot GroupedData column 
-
error in evaluating the argument 'x' in selecting a method for function 
'collect': 
  error in evaluating the argument 'x' in selecting a method for function 
'summarize': 
  Values in list are not unique
Calls: pivot -> pivot
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13488: [MINOR][R][DOC] Fix R documentation generation in...

2016-06-05 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13488


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13488: [MINOR][R][DOC] Fix R documentation generation instructi...

2016-06-05 Thread shivaram
Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/13488
  
Merging this to master and branch-2.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13508: [SPARK-15766][SparkR]:R should export is.nan

2016-06-05 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/13508
  
don't we need to for `isnan` 
https://github.com/apache/spark/blob/d642b273544bb77ef7f584326aa2d214649ac61b/R/pkg/R/functions.R#L651
this is different from `isNaN`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13476: [SPARK-15684][SparkR]Not mask startsWith and ends...

2016-06-05 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13476#discussion_r65823398
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -1137,6 +1137,13 @@ test_that("string operators", {
   expect_equal(count(where(df, like(df$name, "A%"))), 1)
   expect_equal(count(where(df, startsWith(df$name, "A"))), 1)
   expect_equal(first(select(df, substr(df$name, 1, 2)))[[1]], "Mi")
+  if (as.numeric(R.version$major) >= 3 && as.numeric(R.version$minor) >= 
3) {
--- End diff --

actually, I think you should only put line 1144 and 1145 into the version 
check. startsWith(character) and endsWith(character) are not in R 3.2~ that's 
why the tests fail in Jenkins. The other with DataFrame/column should work 
though and we should test for it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13476: [SPARK-15684][SparkR]Not mask startsWith and ends...

2016-06-05 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13476#discussion_r65823343
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -1137,6 +1137,13 @@ test_that("string operators", {
   expect_equal(count(where(df, like(df$name, "A%"))), 1)
   expect_equal(count(where(df, startsWith(df$name, "A"))), 1)
   expect_equal(first(select(df, substr(df$name, 1, 2)))[[1]], "Mi")
+  if (as.numeric(R.version$major) >= 3 && as.numeric(R.version$minor) >= 
3) {
--- End diff --

Do you know why this is needed? Jenkins is running with R 3.2~, not R 
3.3.0, so this check has effectively disabled all the tests below.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13476: [SPARK-15684][SparkR]Not mask startsWith and ends...

2016-06-05 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/13476#discussion_r65822915
  
--- Diff: R/pkg/R/generics.R ---
@@ -691,11 +691,11 @@ setGeneric("contains", function(x, ...) { 
standardGeneric("contains") })
 
 #' @rdname column
 #' @export
-setGeneric("desc", function(x) { standardGeneric("desc") })
+setGeneric("endsWith", function(x, suffix) { standardGeneric("endsWith") })
--- End diff --

can you move this back into the existing order ? it is sorted 
alphabetically if i'm not wrong.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13476: [SPARK-15684][SparkR]Not mask startsWith and ends...

2016-06-05 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/13476#discussion_r65822919
  
--- Diff: R/pkg/R/generics.R ---
@@ -723,11 +723,11 @@ setGeneric("like", function(x, ...) { 
standardGeneric("like") })
 
 #' @rdname column
 #' @export
-setGeneric("rlike", function(x, ...) { standardGeneric("rlike") })
+setGeneric("startsWith", function(x, prefix) { 
standardGeneric("startsWith") })
 
 #' @rdname column
 #' @export
-setGeneric("startsWith", function(x, ...) { standardGeneric("startsWith") 
})
+setGeneric("rlike", function(x, ...) { standardGeneric("rlike") })
--- End diff --

same as above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #12313: [SPARK-14543] [SQL] Improve InsertIntoTable colum...

2016-06-05 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/12313#discussion_r65822892
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -505,6 +506,117 @@ class Analyzer(
 }
   }
 
+  object ResolveOutputColumns extends Rule[LogicalPlan] {
+def apply(plan: LogicalPlan): LogicalPlan = plan.transform {
+  case ins @ InsertIntoTable(relation: LogicalPlan, partition, _, _, 
_, _)
+  if relation.resolved && !ins.resolved =>
+resolveOutputColumns(ins, expectedColumns(relation, partition), 
relation.toString)
+}
+
+private def resolveOutputColumns(
+insertInto: InsertIntoTable,
+columns: Seq[Attribute],
+relation: String) = {
+  val resolved = if (insertInto.isMatchByName) {
+projectAndCastOutputColumns(columns, insertInto.child, relation)
+  } else {
+castAndRenameOutputColumns(columns, insertInto.child, relation)
+  }
+
+  if (resolved == insertInto.child.output) {
+insertInto
+  } else {
+insertInto.copy(child = Project(resolved, insertInto.child))
+  }
+}
+
+/**
+ * Resolves output columns by input column name, adding casts if 
necessary.
+ */
+private def projectAndCastOutputColumns(
+output: Seq[Attribute],
+data: LogicalPlan,
+relation: String): Seq[NamedExpression] = {
+  output.map { col =>
+data.resolveQuoted(col.name, resolver) match {
+  case Some(inCol) if col.dataType != inCol.dataType =>
+Alias(UpCast(inCol, col.dataType, Seq()), col.name)()
+  case Some(inCol) => inCol
+  case None =>
+throw new AnalysisException(
+  s"Cannot resolve ${col.name} in 
${data.output.mkString(",")}")
+}
+  }
+}
+
+private def castAndRenameOutputColumns(
+output: Seq[Attribute],
+data: LogicalPlan,
+relation: String): Seq[NamedExpression] = {
+  val outputNames = output.map(_.name)
+  // incoming expressions may not have names
+  val inputNames = data.output.flatMap(col => Option(col.name))
+  if (output.size > data.output.size) {
+// always a problem
+throw new AnalysisException(
+  s"""Not enough data columns to write into $relation:
+ |Data columns: ${data.output.mkString(",")}
+ |Table columns: ${outputNames.mkString(",")}""".stripMargin)
+  } else if (output.size < data.output.size) {
+if (outputNames.toSet.subsetOf(inputNames.toSet)) {
--- End diff --

do we really need to distinguish these 2 cases? How about we just say that 
the number of columns mismatch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #12313: [SPARK-14543] [SQL] Improve InsertIntoTable colum...

2016-06-05 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/12313#discussion_r65822835
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -505,6 +506,117 @@ class Analyzer(
 }
   }
 
+  object ResolveOutputColumns extends Rule[LogicalPlan] {
+def apply(plan: LogicalPlan): LogicalPlan = plan.transform {
+  case ins @ InsertIntoTable(relation: LogicalPlan, partition, _, _, 
_, _)
+  if relation.resolved && !ins.resolved =>
+resolveOutputColumns(ins, expectedColumns(relation, partition), 
relation.toString)
+}
+
+private def resolveOutputColumns(
+insertInto: InsertIntoTable,
+columns: Seq[Attribute],
+relation: String) = {
+  val resolved = if (insertInto.isMatchByName) {
+projectAndCastOutputColumns(columns, insertInto.child, relation)
+  } else {
+castAndRenameOutputColumns(columns, insertInto.child, relation)
+  }
+
+  if (resolved == insertInto.child.output) {
+insertInto
+  } else {
+insertInto.copy(child = Project(resolved, insertInto.child))
+  }
+}
+
+/**
+ * Resolves output columns by input column name, adding casts if 
necessary.
+ */
+private def projectAndCastOutputColumns(
+output: Seq[Attribute],
+data: LogicalPlan,
+relation: String): Seq[NamedExpression] = {
+  output.map { col =>
+data.resolveQuoted(col.name, resolver) match {
+  case Some(inCol) if col.dataType != inCol.dataType =>
+Alias(UpCast(inCol, col.dataType, Seq()), col.name)()
+  case Some(inCol) => inCol
+  case None =>
+throw new AnalysisException(
+  s"Cannot resolve ${col.name} in 
${data.output.mkString(",")}")
+}
+  }
+}
+
+private def castAndRenameOutputColumns(
+output: Seq[Attribute],
+data: LogicalPlan,
+relation: String): Seq[NamedExpression] = {
+  val outputNames = output.map(_.name)
+  // incoming expressions may not have names
+  val inputNames = data.output.flatMap(col => Option(col.name))
+  if (output.size > data.output.size) {
+// always a problem
+throw new AnalysisException(
+  s"""Not enough data columns to write into $relation:
+ |Data columns: ${data.output.mkString(",")}
+ |Table columns: ${outputNames.mkString(",")}""".stripMargin)
+  } else if (output.size < data.output.size) {
+if (outputNames.toSet.subsetOf(inputNames.toSet)) {
+  throw new AnalysisException(
+s"""Table column names are a subset of the input data columns:
+   |Data columns: ${inputNames.mkString(",")}
+   |Table columns: ${outputNames.mkString(",")}""".stripMargin)
+} else {
+  // be conservative and fail if there are too many columns
+  throw new AnalysisException(
+s"""Extra data columns to write into $relation:
+   |Data columns: ${data.output.mkString(",")}
+   |Table columns: ${outputNames.mkString(",")}""".stripMargin)
+}
+  } else {
+// check for reordered names and warn. this may be on purpose, so 
it isn't an error.
+if (outputNames.toSet == inputNames.toSet && outputNames != 
inputNames) {
+  logWarning(
+s"""Data column names match the table in a different order:
+   |Data columns: ${inputNames.mkString(",")}
+   |Table columns: ${outputNames.mkString(",")}""".stripMargin)
+}
+  }
+
+  data.output.zip(output).map {
+case (in, out) if !in.dataType.sameType(out.dataType) =>
+  Alias(Cast(in, out.dataType), out.name)()
+case (in, out) if in.name != out.name =>
+  Alias(in, out.name)()
+case (in, _) => in
+  }
+}
+
+private def expectedColumns(
+data: LogicalPlan,
+partitionData: Map[String, Option[String]]): Seq[Attribute] = {
+  data match {
+case partitioned: CatalogRelation =>
+  val tablePartitionNames = 
partitioned.catalogTable.partitionColumns.map(_.name)
+  val (inputPartCols, dataColumns) = data.output.partition { attr 
=>
+tablePartitionNames.contains(attr.name)
+  }
+  // Get the dynamic partition columns in partition order
+  val dynamicNames = tablePartitionNames.filter(
+name => partitionData.getOrElse(name, None).isEmpty)
--- 

[GitHub] spark pull request #12313: [SPARK-14543] [SQL] Improve InsertIntoTable colum...

2016-06-05 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/12313#discussion_r65822816
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -505,6 +506,117 @@ class Analyzer(
 }
   }
 
+  object ResolveOutputColumns extends Rule[LogicalPlan] {
+def apply(plan: LogicalPlan): LogicalPlan = plan.transform {
--- End diff --

should use `plan.resolveOperators`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #7898: [SPARK-9560][MLlib] add lda data generator

2016-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/7898
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60012/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #7898: [SPARK-9560][MLlib] add lda data generator

2016-06-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/7898
  
**[Test build #60012 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60012/consoleFull)**
 for PR 7898 at commit 
[`aff4d7f`](https://github.com/apache/spark/commit/aff4d7f683f01fc46e50333677e1634488fb45a1).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #7898: [SPARK-9560][MLlib] add lda data generator

2016-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/7898
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #7898: [SPARK-9560][MLlib] add lda data generator

2016-06-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/7898
  
**[Test build #60012 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60012/consoleFull)**
 for PR 7898 at commit 
[`aff4d7f`](https://github.com/apache/spark/commit/aff4d7f683f01fc46e50333677e1634488fb45a1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-05 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/13512
  
Can you give some examples to show how this PR make the aggregator API more 
friendly and easier to use?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13514: [SPARK-15770][ML] Annotation audit for Experiment...

2016-06-05 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13514


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13514: [SPARK-15770][ML] Annotation audit for Experimental and ...

2016-06-05 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/13514
  
Merging in master/2.0. Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13285: [Spark-15129][R][DOC]R API changes in ML

2016-06-05 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/13285#discussion_r65822439
  
--- Diff: docs/sparkr.md ---
@@ -285,71 +285,28 @@ head(teenagers)
 
 # Machine Learning
 
-SparkR allows the fitting of generalized linear models over DataFrames 
using the [glm()](api/R/glm.html) function. Under the hood, SparkR uses MLlib 
to train a model of the specified family. Currently the gaussian and binomial 
families are supported. We support a subset of the available R formula 
operators for model fitting, including '~', '.', ':', '+', and '-'.
+SparkR supports the following Machine Learning algorithms.
 
-The [summary()](api/R/summary.html) function gives the summary of a model 
produced by [glm()](api/R/glm.html).
+* Generalized Linear Regression Model [spark.glm()](api/R/glm.html)
+* Naive Bayes [spark.naiveBayes()](api/R/naiveBayes.html)
+* KMeans [spark.kmeans()](api/R/kmeans.html)
+* AFT Survival Regression [spark.survreg()](api/R/survreg.html)
 
-* For gaussian GLM model, it returns a list with 'devianceResiduals' and 
'coefficients' components. The 'devianceResiduals' gives the min/max deviance 
residuals of the estimation; the 'coefficients' gives the estimated 
coefficients and their estimated standard errors, t values and p-values. (It 
only available when model fitted by normal solver.)
-* For binomial GLM model, it returns a list with 'coefficients' component 
which gives the estimated coefficients.
+Generalized Linear Regression can be used to train a model from a 
specified family. Currently the Gaussian, Binomial, Poisson and Gamma families 
are supported. We support a subset of the available R formula operators for 
model fitting, including '~', '.', ':', '+', and '-'.
 
-The examples below show the use of building gaussian GLM model and 
binomial GLM model using SparkR.
+The [summary()](api/R/summary.html) function gives the summary of a model 
produced by different algorithms listed above.
+This summary is same as the result of summary() function in R.
 
-## Gaussian GLM model
+## Model persistence
 
-
-{% highlight r %}
-# Create the DataFrame
-df <- createDataFrame(sqlContext, iris)
-
-# Fit a gaussian GLM model over the dataset.
-model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = 
"gaussian")
-
-# Model summary are returned in a similar format to R's native glm().
-summary(model)
-##$devianceResiduals
-## Min   Max 
-## -1.307112 1.412532
-##
-##$coefficients
-##   Estimate  Std. Error t value  Pr(>|t|)
-##(Intercept)2.251393  0.3697543  6.08889  9.568102e-09
-##Sepal_Width0.8035609 0.106339   7.556598 4.187317e-12
-##Species_versicolor 1.458743  0.1121079  13.01195 0   
-##Species_virginica  1.946817  0.100015   19.46525 0   
-
-# Make predictions based on the model.
-predictions <- predict(model, newData = df)
-head(select(predictions, "Sepal_Length", "prediction"))
-##  Sepal_Length prediction
-##1  5.1   5.063856
-##2  4.9   4.662076
-##3  4.7   4.822788
-##4  4.6   4.742432
-##5  5.0   5.144212
-##6  5.4   5.385281
-{% endhighlight %}
-
+* write.ml allows users to save a fitted model in a given input path
--- End diff --

```[write.ml](api/R/write.ml.html)``` and ditto for ```read.ml```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13285: [Spark-15129][R][DOC]R API changes in ML

2016-06-05 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/13285#discussion_r65822414
  
--- Diff: docs/sparkr.md ---
@@ -285,71 +285,28 @@ head(teenagers)
 
 # Machine Learning
 
-SparkR allows the fitting of generalized linear models over DataFrames 
using the [glm()](api/R/glm.html) function. Under the hood, SparkR uses MLlib 
to train a model of the specified family. Currently the gaussian and binomial 
families are supported. We support a subset of the available R formula 
operators for model fitting, including '~', '.', ':', '+', and '-'.
+SparkR supports the following Machine Learning algorithms.
 
-The [summary()](api/R/summary.html) function gives the summary of a model 
produced by [glm()](api/R/glm.html).
+* Generalized Linear Regression Model [spark.glm()](api/R/glm.html)
+* Naive Bayes [spark.naiveBayes()](api/R/naiveBayes.html)
+* KMeans [spark.kmeans()](api/R/kmeans.html)
+* AFT Survival Regression [spark.survreg()](api/R/survreg.html)
 
-* For gaussian GLM model, it returns a list with 'devianceResiduals' and 
'coefficients' components. The 'devianceResiduals' gives the min/max deviance 
residuals of the estimation; the 'coefficients' gives the estimated 
coefficients and their estimated standard errors, t values and p-values. (It 
only available when model fitted by normal solver.)
-* For binomial GLM model, it returns a list with 'coefficients' component 
which gives the estimated coefficients.
+Generalized Linear Regression can be used to train a model from a 
specified family. Currently the Gaussian, Binomial, Poisson and Gamma families 
are supported. We support a subset of the available R formula operators for 
model fitting, including '~', '.', ':', '+', and '-'.
 
-The examples below show the use of building gaussian GLM model and 
binomial GLM model using SparkR.
+The [summary()](api/R/summary.html) function gives the summary of a model 
produced by different algorithms listed above.
+This summary is same as the result of summary() function in R.
--- End diff --

```It produces the similar result compared with R summary function.```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-05 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/13505
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13285: [Spark-15129][R][DOC]R API changes in ML

2016-06-05 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/13285#discussion_r65822355
  
--- Diff: docs/sparkr.md ---
@@ -285,71 +285,28 @@ head(teenagers)
 
 # Machine Learning
 
-SparkR allows the fitting of generalized linear models over DataFrames 
using the [glm()](api/R/glm.html) function. Under the hood, SparkR uses MLlib 
to train a model of the specified family. Currently the gaussian and binomial 
families are supported. We support a subset of the available R formula 
operators for model fitting, including '~', '.', ':', '+', and '-'.
+SparkR supports the following Machine Learning algorithms.
 
-The [summary()](api/R/summary.html) function gives the summary of a model 
produced by [glm()](api/R/glm.html).
+* Generalized Linear Regression Model [spark.glm()](api/R/glm.html)
+* Naive Bayes [spark.naiveBayes()](api/R/naiveBayes.html)
+* KMeans [spark.kmeans()](api/R/kmeans.html)
+* AFT Survival Regression [spark.survreg()](api/R/survreg.html)
 
-* For gaussian GLM model, it returns a list with 'devianceResiduals' and 
'coefficients' components. The 'devianceResiduals' gives the min/max deviance 
residuals of the estimation; the 'coefficients' gives the estimated 
coefficients and their estimated standard errors, t values and p-values. (It 
only available when model fitted by normal solver.)
-* For binomial GLM model, it returns a list with 'coefficients' component 
which gives the estimated coefficients.
+Generalized Linear Regression can be used to train a model from a 
specified family. Currently the Gaussian, Binomial, Poisson and Gamma families 
are supported. We support a subset of the available R formula operators for 
model fitting, including '~', '.', ':', '+', and '-'.
--- End diff --

```[Generalized Linear Regression Model](api/R/spark.glm.html) can be ...```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #1110: [SPARK-2174][MLLIB] treeReduce and treeAggregate

2016-06-05 Thread debasish83
Github user debasish83 commented on the issue:

https://github.com/apache/spark/pull/1110
  
@mengxr say I have 20 nodes and 16 cores on each node, do you recommend 
running treeReduce with 320 partitions and OpenBLAS with numThreads=1 on each 
partition for SeqOp OR treeReduce with 20 partitions and OpenBLAS with 
numThreads=16 on each partition for SeqOp...Do you have plans on further 
improvement ideas of decreasing network shuffle using treeReduce/treeAggregate 
or if there is a JIRA open so that we can move the discussion on it ? Looks 
like shuffle is compressed by default on Spark using snappy already...do you 
recommend compressing the vector logically ?

SparkContext: 20 nodes, 16 cores, sc.defaultParallelism 320

def gramSize(n: Int) = (n*n+1)/2

val combOp = (v1: Array[Float], v2: Array[Float]) => {
  var i = 0
  while (i < v1.length) {
v1(i) += v2(i)
i += 1
  }
  v1
}

val n = gramSize(4096)
val vv = sc.parallelize(0 until sc.defaultParallelism).map(i => 
Array.fill[Float](n)(0))

Option 1: 320 partitions, 1 thread on combOp per partition

val start = System.nanoTime(); 
vv.treeReduce(combOp, 2); 
val reduceTime = (System.nanoTime() - start)*1e-9
reduceTime: Double = 5.639030243006

Option 2: 20 partitions, 1 thread on combOp per partition

val coalescedvv = vv.coalesce(20)
coalescedvv.count

val start = System.nanoTime(); 
coalescedvv.treeReduce(combOp, 2); 
val reduceTime = (System.nanoTime() - start)*1e-9
reduceTime: Double = 3.914068564004

Option 3: 20 partitions, OpenBLAS numThread=16 per partition

Setting up OpenBLAS on cluster, I will update soon.

Let me know your thoughts. I think if underlying operations are Dense BLAS 
level1, level2 or level3, running with higher OpenBLAS threads and reducing 
number of partitions should help in decreasing cross partition shuffle.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13285: [Spark-15129][R][DOC]R API changes in ML

2016-06-05 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/13285#discussion_r65822337
  
--- Diff: docs/sparkr.md ---
@@ -285,71 +285,28 @@ head(teenagers)
 
 # Machine Learning
 
-SparkR allows the fitting of generalized linear models over DataFrames 
using the [glm()](api/R/glm.html) function. Under the hood, SparkR uses MLlib 
to train a model of the specified family. Currently the gaussian and binomial 
families are supported. We support a subset of the available R formula 
operators for model fitting, including '~', '.', ':', '+', and '-'.
+SparkR supports the following Machine Learning algorithms.
 
-The [summary()](api/R/summary.html) function gives the summary of a model 
produced by [glm()](api/R/glm.html).
+* Generalized Linear Regression Model [spark.glm()](api/R/glm.html)
--- End diff --

The correct API link should be ```api/R/spark.glm.html```, and the 
following three lines. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13285: [Spark-15129][R][DOC]R API changes in ML

2016-06-05 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/13285#discussion_r65822186
  
--- Diff: examples/src/main/r/ml.R ---
@@ -25,8 +25,9 @@ library(SparkR)
 sc <- sparkR.init(appName="SparkR-ML-example")
 sqlContext <- sparkRSQL.init(sc)
 
- spark.glm and glm 
##
 
+ spark.glm and glm 
##
+# $example on$
--- End diff --

Move this line to L28, it's better we can include the annotation line 
```spark.glm and glm``` which can help users to understand corresponding code. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13505
  
**[Test build #60011 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60011/consoleFull)**
 for PR 13505 at commit 
[`5504b6c`](https://github.com/apache/spark/commit/5504b6c2dd3ac7959b2cb7e139a54208368a9a45).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13505: [SPARK-15764][SQL] Replace N^2 loop in BindRefere...

2016-06-05 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/13505#discussion_r65821620
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
 ---
@@ -86,11 +88,42 @@ package object expressions  {
   /**
* Helper functions for working with `Seq[Attribute]`.
*/
-  implicit class AttributeSeq(attrs: Seq[Attribute]) {
+  implicit class AttributeSeq(val attrs: Seq[Attribute]) {
 /** Creates a StructType with a schema matching this `Seq[Attribute]`. 
*/
 def toStructType: StructType = {
   StructType(attrs.map(a => StructField(a.name, a.dataType, 
a.nullable)))
 }
+
+// It's possible that `attrs` is a linked list, which can lead to bad 
O(n^2) loops when
+// accessing attributes by their ordinals. To avoid this performance 
penalty, convert the input
+// to an array.
+private lazy val attrsArray = attrs.toArray
+
+private lazy val exprIdToOrdinal = {
+  val arr = attrsArray
+  val map = Maps.newHashMapWithExpectedSize[ExprId, Int](arr.length)
+  var index = 0
+  while (index < arr.length) {
+val exprId = arr(index).exprId
+if (!map.containsKey(exprId)) {
+  map.put(exprId, index)
--- End diff --

I was being conservative here in order to match the behavior of the old 
linear scan, which stopped upon finding the first entry. However, we can remove 
the need for this check if we iterate over `arr` in reverse-order.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >