[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16386
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73030/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16386
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16386
  
**[Test build #73030 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73030/testReport)**
 for PR 16386 at commit 
[`58118f2`](https://github.com/apache/spark/commit/58118f2762a3abfb5ca82c156eabb9622844b764).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16968: [SPARK-19337] [ML] [Dcoc] Documentation and examp...

2017-02-16 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16968#discussion_r101683048
  
--- Diff: docs/ml-classification-regression.md ---
@@ -363,6 +363,51 @@ Refer to the [R API docs](api/R/spark.mlp.html) for 
more details.
 
 
 
+## Linear Support Vector Machine
+
+A [support vector 
machine](https://en.wikipedia.org/wiki/Support_vector_machine) constructs a 
hyperplane
+or set of hyperplanes in a high- or infinite-dimensional space, which can 
be used for classification,
+regression, or other tasks. Intuitively, a good separation is achieved by 
the hyperplane that has
+the largest distance to the nearest training-data point of any class 
(so-called functional margin),
+since in general the larger the margin the lower the generalization error 
of the classifier. LinearSVC
+in Spark ML supports binary calssification with linear SVM. Internally, it 
optimizes the 
--- End diff --

calssification -> classification


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16968: [SPARK-19337] [ML] [Dcoc] Documentation and examp...

2017-02-16 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16968#discussion_r101683098
  
--- Diff: docs/ml-classification-regression.md ---
@@ -363,6 +363,51 @@ Refer to the [R API docs](api/R/spark.mlp.html) for 
more details.
 
 
 
+## Linear Support Vector Machine
+
+A [support vector 
machine](https://en.wikipedia.org/wiki/Support_vector_machine) constructs a 
hyperplane
+or set of hyperplanes in a high- or infinite-dimensional space, which can 
be used for classification,
+regression, or other tasks. Intuitively, a good separation is achieved by 
the hyperplane that has
+the largest distance to the nearest training-data point of any class 
(so-called functional margin),
--- End diff --

"largest distance" -> "longest distance"? I think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16969: [SPARK-19639][SPARKR][Example]:Add spark.svmLinea...

2017-02-16 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16969#discussion_r101683604
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -564,6 +566,26 @@ model <- spark.logit(df, Species ~ ., regParam = 0.056)
 summary(model)
 ```
 
+ Linear Support Vector Machine (SVM) Classifier
+
+[Linear Support Vector Machine 
(SVM)](https://en.wikipedia.org/wiki/Support_vector_machine#Linear_SVM) 
classifier is an SVM classifier with linear kernel.
+This is a binary classifier. Multi-class classification can be achieved by 
one-vs-the-rest strategy. We use a simple example to show how to use 
`spark.svmLinear`
+for binary classification.
+
+```{r}
+# load training data and create a DataFrame
+t <- as.data.frame(Titanic)
+training <- createDataFrame(t)
+# fit a Linear SVM classifier model
+model <- spark.svmLinear(training,  Survived ~ ., regParam = 0.01)
--- End diff --

should it go with `maxIter = 10` here too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16969: [SPARK-19639][SPARKR][Example]:Add spark.svmLinea...

2017-02-16 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16969#discussion_r101683480
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -564,6 +566,26 @@ model <- spark.logit(df, Species ~ ., regParam = 0.056)
 summary(model)
 ```
 
+ Linear Support Vector Machine (SVM) Classifier
+
+[Linear Support Vector Machine 
(SVM)](https://en.wikipedia.org/wiki/Support_vector_machine#Linear_SVM) 
classifier is an SVM classifier with linear kernel.
+This is a binary classifier. Multi-class classification can be achieved by 
one-vs-the-rest strategy. We use a simple example to show how to use 
`spark.svmLinear`
--- End diff --

we actually don't have support for `one-vs-the-rest strategy` in R at the 
moment (existing JIRA or design still open), so perhaps it's best we don't 
reference that here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16969: [SPARK-19639][SPARKR][Example]:Add spark.svmLinea...

2017-02-16 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16969#discussion_r101683657
  
--- Diff: examples/src/main/r/ml/svmLinear.R ---
@@ -0,0 +1,41 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# To run this example use
+# ./bin/spark-submit examples/src/main/r/ml/svmLinear.R
+
+# Load SparkR library into your R session
+library(SparkR)
+
+# Initialize SparkSession
+sparkR.session(appName = "SparkR-ML-svmLinear-example")
+
+# $example on$
+# load training data
+t <- as.data.frame(Titanic)
+training <- createDataFrame(t)
+
+# fit linearSvc model
--- End diff --

`linearSvc` -> `svmLinear`? or `Linear SVM`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16969: [SPARK-19639][SPARKR][Example]:Add spark.svmLinea...

2017-02-16 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16969#discussion_r101683410
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -564,6 +566,26 @@ model <- spark.logit(df, Species ~ ., regParam = 0.056)
 summary(model)
 ```
 
+ Linear Support Vector Machine (SVM) Classifier
+
+[Linear Support Vector Machine 
(SVM)](https://en.wikipedia.org/wiki/Support_vector_machine#Linear_SVM) 
classifier is an SVM classifier with linear kernel.
--- End diff --

`linear kernels`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16969: [SPARK-19639][SPARKR][Example]:Add spark.svmLinea...

2017-02-16 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16969#discussion_r101683369
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -471,6 +471,8 @@ SparkR supports the following machine learning models 
and algorithms.
 
 * Logistic Regression
 
+* Linear Support Vector Machine (SVM) Classifier
--- End diff --

shouldn't `Linear` go before `Logistic`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16386
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73029/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16386
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16386
  
**[Test build #73029 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73029/testReport)**
 for PR 16386 at commit 
[`e323317`](https://github.com/apache/spark/commit/e32331794a63e8fcfe60a0901672e69dcfd6fe15).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16969: [SPARK-19639][SPARKR][Example]:Add spark.svmLinear examp...

2017-02-16 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/16969
  
are we merging this after #16968?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16972: [SPARK-19556][CORE][WIP] Broadcast data is not en...

2017-02-16 Thread uncleGen
Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16972#discussion_r101682505
  
--- Diff: core/src/main/scala/org/apache/spark/storage/DiskStore.scala ---
@@ -73,17 +81,52 @@ private[spark] class DiskStore(conf: SparkConf, 
diskManager: DiskBlockManager) e
   }
 
   def putBytes(blockId: BlockId, bytes: ChunkedByteBuffer): Unit = {
+val bytesToStore = if (serializerManager.encryptionEnabled) {
+  try {
+val data = bytes.toByteBuffer
+val in = new ByteBufferInputStream(data, true)
+val byteBufOut = new ByteBufferOutputStream(data.remaining())
+val out = CryptoStreamUtils.createCryptoOutputStream(byteBufOut, 
conf,
+  serializerManager.encryptionKey.get)
+try {
+  ByteStreams.copy(in, out)
+} finally {
+  in.close()
+  out.close()
+}
+new ChunkedByteBuffer(byteBufOut.toByteBuffer)
+  } finally {
+bytes.dispose()
+  }
+} else {
+  bytes
+}
+
 put(blockId) { fileOutputStream =>
   val channel = fileOutputStream.getChannel
   Utils.tryWithSafeFinally {
-bytes.writeFully(channel)
+bytesToStore.writeFully(channel)
   } {
 channel.close()
   }
 }
   }
 
   def getBytes(blockId: BlockId): ChunkedByteBuffer = {
+val bytes = readBytes(blockId)
+
+val in = 
serializerManager.wrapForEncryption(bytes.toInputStream(dispose = true))
+new ChunkedByteBuffer(ByteBuffer.wrap(IOUtils.toByteArray(in)))
+  }
+
+  def getBytesAsValues[T](blockId: BlockId, classTag: ClassTag[T]): 
Iterator[T] = {
+val bytes = readBytes(blockId)
+
+serializerManager
+  .dataDeserializeStream(blockId, bytes.toInputStream(dispose = 
true))(classTag)
+  }
+
+  private[storage] def readBytes(blockId: BlockId): ChunkedByteBuffer = {
--- End diff --

abstract it for unit test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16972: [SPARK-19556][CORE][WIP] Broadcast data is not en...

2017-02-16 Thread uncleGen
Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16972#discussion_r101682451
  
--- Diff: core/src/test/scala/org/apache/spark/storage/DiskStoreSuite.scala 
---
@@ -39,27 +40,27 @@ class DiskStoreSuite extends SparkFunSuite {
 val blockId = BlockId("rdd_1_2")
 val diskBlockManager = new DiskBlockManager(new SparkConf(), 
deleteFilesOnStop = true)
 
-val diskStoreMapped = new DiskStore(new SparkConf().set(confKey, "0"), 
diskBlockManager)
+val conf = new SparkConf()
+val serializer = new KryoSerializer(conf)
+val serializerManager = new SerializerManager(serializer, conf)
+
+conf.set(confKey, "0")
+val diskStoreMapped = new DiskStore(conf, serializerManager, 
diskBlockManager)
 diskStoreMapped.putBytes(blockId, byteBuffer)
-val mapped = diskStoreMapped.getBytes(blockId)
+val mapped = diskStoreMapped.readBytes(blockId)
 assert(diskStoreMapped.remove(blockId))
 
-val diskStoreNotMapped = new DiskStore(new SparkConf().set(confKey, 
"1m"), diskBlockManager)
+conf.set(confKey, "1m")
+val diskStoreNotMapped = new DiskStore(conf, serializerManager, 
diskBlockManager)
 diskStoreNotMapped.putBytes(blockId, byteBuffer)
-val notMapped = diskStoreNotMapped.getBytes(blockId)
+val notMapped = diskStoreNotMapped.readBytes(blockId)
 
 // Not possible to do isInstanceOf due to visibility of HeapByteBuffer
 
assert(notMapped.getChunks().forall(_.getClass.getName.endsWith("HeapByteBuffer")),
   "Expected HeapByteBuffer for un-mapped read")
 assert(mapped.getChunks().forall(_.isInstanceOf[MappedByteBuffer]),
   "Expected MappedByteBuffer for mapped read")
 
-def arrayFromByteBuffer(in: ByteBuffer): Array[Byte] = {
-  val array = new Array[Byte](in.remaining())
--- End diff --

remove unused


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16972: [SPARK-19556][CORE][WIP] Broadcast data is not en...

2017-02-16 Thread uncleGen
Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16972#discussion_r101682663
  
--- Diff: core/src/main/scala/org/apache/spark/storage/DiskStore.scala ---
@@ -21,17 +21,25 @@ import java.io.{FileOutputStream, IOException, 
RandomAccessFile}
 import java.nio.ByteBuffer
 import java.nio.channels.FileChannel.MapMode
 
-import com.google.common.io.Closeables
+import scala.reflect.ClassTag
+
+import com.google.common.io.{ByteStreams, Closeables}
+import org.apache.commons.io.IOUtils
 
-import org.apache.spark.SparkConf
 import org.apache.spark.internal.Logging
-import org.apache.spark.util.Utils
+import org.apache.spark.SparkConf
+import org.apache.spark.security.CryptoStreamUtils
+import org.apache.spark.serializer.SerializerManager
+import org.apache.spark.util.{ByteBufferInputStream, 
ByteBufferOutputStream, Utils}
 import org.apache.spark.util.io.ChunkedByteBuffer
 
 /**
  * Stores BlockManager blocks on disk.
  */
-private[spark] class DiskStore(conf: SparkConf, diskManager: 
DiskBlockManager) extends Logging {
+private[spark] class DiskStore(
+conf: SparkConf,
+serializerManager: SerializerManager,
--- End diff --

add `serializerManager ` to do decryption work in `DiskStore`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16972: [SPARK-19556][CORE][WIP] Broadcast data is not en...

2017-02-16 Thread uncleGen
Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16972#discussion_r101682730
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala ---
@@ -344,7 +370,7 @@ private[spark] class MemoryStore(
 val serializationStream: SerializationStream = {
   val autoPick = !blockId.isInstanceOf[StreamBlockId]
   val ser = serializerManager.getSerializer(classTag, 
autoPick).newInstance()
-  ser.serializeStream(serializerManager.wrapStream(blockId, 
redirectableStream))
+  ser.serializeStream(serializerManager.wrapForCompression(blockId, 
redirectableStream))
 }
--- End diff --

`MemoryStore` will not do encryption work


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16972: [SPARK-19556][CORE][WIP] Broadcast data is not encrypted...

2017-02-16 Thread uncleGen
Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/16972
  
@vanzin I will add some unit test in future. But could you please review 
this first? I think I may be missing something.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16972: [SPARK-19556][CORE][WIP] Broadcast data is not encrypted...

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16972
  
**[Test build #73036 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73036/testReport)**
 for PR 16972 at commit 
[`f9a91d6`](https://github.com/apache/spark/commit/f9a91d63af3191b853ef88bd48293bcc19f3ec4c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16972: [SPARK-19556][CORE][WIP] Broadcast data is not en...

2017-02-16 Thread uncleGen
GitHub user uncleGen opened a pull request:

https://github.com/apache/spark/pull/16972

[SPARK-19556][CORE][WIP] Broadcast data is not encrypted when I/O 
encryption is on

## What changes were proposed in this pull request?

`TorrentBroadcast` uses a couple of "back doors" into the block manager to 
write and read data.

The thing these block manager methods have in common is that they bypass 
the encryption code; so broadcast data is stored unencrypted in the block 
manager, causing unencrypted data to be written to disk if those blocks need to 
be evicted from memory.

The correct fix here is actually not to change `TorrentBroadcast`, but to 
fix the block manager so that:

- data stored in memory is not encrypted
- data written to disk is encrypted

This would simplify the code paths that use BlockManager / 
SerializerManager APIs.

## How was this patch tested?

update and add unit test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uncleGen/spark SPARK-19556

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16972.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16972


commit 63d909b4a0cc108fd8756436b21c65614abb9466
Author: uncleGen 
Date:   2017-02-15T14:12:47Z

cp

commit f9a91d63af3191b853ef88bd48293bcc19f3ec4c
Author: uncleGen 
Date:   2017-02-17T03:54:44Z

refactor blockmanager: data stored in memory is not encrypted, data written 
to disk is encrypted




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-16 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r101682138
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1018,7 +1025,9 @@ private[spark] class BlockManager(
   try {
 replicate(blockId, bytesToReplicate, level, remoteClassTag)
   } finally {
-bytesToReplicate.dispose()
+if (!level.useOffHeap) {
--- End diff --

As `StorageUtils.dispose` only cleans up a memory-mapped `ByteBuffer`, I 
don't think calling `bytesToReplicate.dispose()` here would be a problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-16 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r101680844
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -813,7 +813,14 @@ private[spark] class BlockManager(
   false
   }
 } else {
-  memoryStore.putBytes(blockId, size, level.memoryMode, () => 
bytes)
+  val memoryMode = level.memoryMode
+  memoryStore.putBytes(blockId, size, memoryMode, () => {
+if (memoryMode == MemoryMode.OFF_HEAP) {
--- End diff --

I did a search. Looks like the buffers passed to `putBytes` (the only 
caller of private `doPutBytes`) across Spark are all duplicated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16923: [SPARK-19038][Hive][YARN] Correctly figure out ke...

2017-02-16 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/16923#discussion_r101680633
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
@@ -106,21 +106,33 @@ private[hive] class HiveClientImpl(
 
 // Set up kerberos credentials for UserGroupInformation.loginUser 
within
 // current class loader
-// Instead of using the spark conf of the current spark context, a new
-// instance of SparkConf is needed for the original value of 
spark.yarn.keytab
-// and spark.yarn.principal set in SparkSubmit, as yarn.Client resets 
the
-// keytab configuration for the link name in distributed cache
 if (sparkConf.contains("spark.yarn.principal") && 
sparkConf.contains("spark.yarn.keytab")) {
   val principalName = sparkConf.get("spark.yarn.principal")
-  val keytabFileName = sparkConf.get("spark.yarn.keytab")
-  if (!new File(keytabFileName).exists()) {
-throw new SparkException(s"Keytab file: ${keytabFileName}" +
-  " specified in spark.yarn.keytab does not exist")
-  } else {
-logInfo("Attempting to login to Kerberos" +
-  s" using principal: ${principalName} and keytab: 
${keytabFileName}")
-UserGroupInformation.loginUserFromKeytab(principalName, 
keytabFileName)
+  val keytabFileName = {
+val keytab = sparkConf.get("spark.yarn.keytab")
+if (new File(keytab).exists()) {
+  keytab
+} else {
+  // Instead of using the spark conf of the current spark context, 
a new
+  // instance of SparkConf is needed for the original value of 
spark.yarn.keytab
+  // set in SparkSubmit, as yarn.Client resets the keytab 
configuration for the link name
+  // in distributed cache, and this will make Spark driver fail to 
get correct keytab
+  // path in yarn client mode.
+  val originKeytab = new SparkConf().get("spark.yarn.keytab")
+  require(originKeytab != null,
+"spark.yarn.keytab is not configured, this is unexpected")
+  if (new File(originKeytab).exists()) {
+originKeytab
+  } else {
+throw new SparkException(s"Keytab file: $originKeytab " +
+  s"specified in spark.yarn.keytab does not exist")
+  }
+}
   }
+
+  logInfo("Attempting to login to Kerberos" +
+s" using principal: ${principalName} and keytab: 
${keytabFileName}")
+  UserGroupInformation.loginUserFromKeytab(principalName, 
keytabFileName)
--- End diff --

Perhaps the reason is described in 
[here](https://github.com/apache/spark/pull/9272).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16923: [SPARK-19038][Hive][YARN] Correctly figure out ke...

2017-02-16 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/16923#discussion_r101680479
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
@@ -106,21 +106,33 @@ private[hive] class HiveClientImpl(
 
 // Set up kerberos credentials for UserGroupInformation.loginUser 
within
 // current class loader
-// Instead of using the spark conf of the current spark context, a new
-// instance of SparkConf is needed for the original value of 
spark.yarn.keytab
-// and spark.yarn.principal set in SparkSubmit, as yarn.Client resets 
the
-// keytab configuration for the link name in distributed cache
 if (sparkConf.contains("spark.yarn.principal") && 
sparkConf.contains("spark.yarn.keytab")) {
   val principalName = sparkConf.get("spark.yarn.principal")
-  val keytabFileName = sparkConf.get("spark.yarn.keytab")
-  if (!new File(keytabFileName).exists()) {
-throw new SparkException(s"Keytab file: ${keytabFileName}" +
-  " specified in spark.yarn.keytab does not exist")
-  } else {
-logInfo("Attempting to login to Kerberos" +
-  s" using principal: ${principalName} and keytab: 
${keytabFileName}")
-UserGroupInformation.loginUserFromKeytab(principalName, 
keytabFileName)
+  val keytabFileName = {
+val keytab = sparkConf.get("spark.yarn.keytab")
+if (new File(keytab).exists()) {
+  keytab
+} else {
+  // Instead of using the spark conf of the current spark context, 
a new
+  // instance of SparkConf is needed for the original value of 
spark.yarn.keytab
+  // set in SparkSubmit, as yarn.Client resets the keytab 
configuration for the link name
+  // in distributed cache, and this will make Spark driver fail to 
get correct keytab
+  // path in yarn client mode.
+  val originKeytab = new SparkConf().get("spark.yarn.keytab")
+  require(originKeytab != null,
+"spark.yarn.keytab is not configured, this is unexpected")
+  if (new File(originKeytab).exists()) {
+originKeytab
+  } else {
+throw new SparkException(s"Keytab file: $originKeytab " +
+  s"specified in spark.yarn.keytab does not exist")
+  }
+}
   }
+
+  logInfo("Attempting to login to Kerberos" +
+s" using principal: ${principalName} and keytab: 
${keytabFileName}")
+  UserGroupInformation.loginUserFromKeytab(principalName, 
keytabFileName)
--- End diff --

Not clearly sure why here login from keytab is required. It is the behavior 
required by Hive?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16785
  
**[Test build #73035 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73035/testReport)**
 for PR 16785 at commit 
[`278c31c`](https://github.com/apache/spark/commit/278c31cf8aa27c71e0f5178bebcb426ec5fba6ce).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16970: [SPARK-19497][SS]Implement streaming deduplication

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16970
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73028/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16970: [SPARK-19497][SS]Implement streaming deduplication

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16970
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16970: [SPARK-19497][SS]Implement streaming deduplication

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16970
  
**[Test build #73028 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73028/testReport)**
 for PR 16970 at commit 
[`63a7f4c`](https://github.com/apache/spark/commit/63a7f4c62b2da32351d008f9719d513e14562e56).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class Deduplication(`
  * `trait WatermarkSupport extends SparkPlan `
  * `case class DeduplicationExec(`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16971
  
**[Test build #73033 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73033/testReport)**
 for PR 16971 at commit 
[`d5e79a8`](https://github.com/apache/spark/commit/d5e79a809b1edd91a7e0c1d8046bb8bfec2ba4c9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16785
  
**[Test build #73034 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73034/testReport)**
 for PR 16785 at commit 
[`4ba93fe`](https://github.com/apache/spark/commit/4ba93fecfcbdeebecc9526a90b5800c98b3f35a7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...

2017-02-16 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16785
  
> this looks like a very big hammer to solve this problem. Can't we try a 
different approach?
I think we should try to avoid optimizing already optimized code snippets, 
you might be able to do this using some kind of a fence. It would even be 
better if we would have a recursive node.

@cloud-fan @hvanhovell Ok. I've figured out to add a check to reduce the 
candidates of aliased constraints. It can achieve same speed-up (cut of half 
running time in benchmark) without parallel collection hammer.

Can you have time to look at it? Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16851: [SPARK-19508][Core] Improve error message when binding s...

2017-02-16 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16851
  
gentle ping @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-02-16 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request:

https://github.com/apache/spark/pull/16971

[SPARK-19573][SQL] Make NaN/null handling consistent in approxQuantile

## What changes were proposed in this pull request?
update `StatFunctions.multipleApproxQuantiles` to handle NaN/null

(Please fill in changes proposed in this fix)

## How was this patch tested?
existing tests and added tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhengruifeng/spark quantiles_nan

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16971.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16971


commit d5e79a809b1edd91a7e0c1d8046bb8bfec2ba4c9
Author: Zheng RuiFeng 
Date:   2017-02-17T03:18:43Z

create pr




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16923: [SPARK-19038][Hive][YARN] Correctly figure out ke...

2017-02-16 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16923#discussion_r101678941
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
@@ -106,21 +106,33 @@ private[hive] class HiveClientImpl(
 
 // Set up kerberos credentials for UserGroupInformation.loginUser 
within
 // current class loader
-// Instead of using the spark conf of the current spark context, a new
-// instance of SparkConf is needed for the original value of 
spark.yarn.keytab
-// and spark.yarn.principal set in SparkSubmit, as yarn.Client resets 
the
-// keytab configuration for the link name in distributed cache
 if (sparkConf.contains("spark.yarn.principal") && 
sparkConf.contains("spark.yarn.keytab")) {
   val principalName = sparkConf.get("spark.yarn.principal")
-  val keytabFileName = sparkConf.get("spark.yarn.keytab")
-  if (!new File(keytabFileName).exists()) {
-throw new SparkException(s"Keytab file: ${keytabFileName}" +
-  " specified in spark.yarn.keytab does not exist")
-  } else {
-logInfo("Attempting to login to Kerberos" +
-  s" using principal: ${principalName} and keytab: 
${keytabFileName}")
-UserGroupInformation.loginUserFromKeytab(principalName, 
keytabFileName)
+  val keytabFileName = {
+val keytab = sparkConf.get("spark.yarn.keytab")
+if (new File(keytab).exists()) {
+  keytab
+} else {
+  // Instead of using the spark conf of the current spark context, 
a new
+  // instance of SparkConf is needed for the original value of 
spark.yarn.keytab
+  // set in SparkSubmit, as yarn.Client resets the keytab 
configuration for the link name
+  // in distributed cache, and this will make Spark driver fail to 
get correct keytab
+  // path in yarn client mode.
+  val originKeytab = new SparkConf().get("spark.yarn.keytab")
+  require(originKeytab != null,
+"spark.yarn.keytab is not configured, this is unexpected")
+  if (new File(originKeytab).exists()) {
+originKeytab
+  } else {
+throw new SparkException(s"Keytab file: $originKeytab " +
+  s"specified in spark.yarn.keytab does not exist")
+  }
+}
   }
+
+  logInfo("Attempting to login to Kerberos" +
+s" using principal: ${principalName} and keytab: 
${keytabFileName}")
+  UserGroupInformation.loginUserFromKeytab(principalName, 
keytabFileName)
--- End diff --

I probably should look at the history of this file, but I'm a little 
puzzled about why is this login necessary.

`SparkSubmit.scala` already logs in the user with the provided keytab, 
before YARN's `Client.scala` has a chance to mess with it. So it seems to me 
like this code is redundant?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16951: [SPARK-18285][SPARKR] SparkR approxQuantile suppo...

2017-02-16 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16951#discussion_r101676637
  
--- Diff: R/pkg/R/stats.R ---
@@ -149,15 +149,18 @@ setMethod("freqItems", signature(x = 
"SparkDataFrame", cols = "character"),
 #' This method implements a variation of the Greenwald-Khanna algorithm 
(with some speed
 #' optimizations). The algorithm was first present in 
[[http://dx.doi.org/10.1145/375663.375670
 #' Space-efficient Online Computation of Quantile Summaries]] by Greenwald 
and Khanna.
+#' Note that rows containing any NA values will be removed before 
calculation.
 #'
 #' @param x A SparkDataFrame.
-#' @param col The name of the numerical column.
+#' @param cols The names of the numerical columns.
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16951: [SPARK-18285][SPARKR] SparkR approxQuantile suppo...

2017-02-16 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16951#discussion_r101676617
  
--- Diff: R/pkg/R/stats.R ---
@@ -149,15 +149,18 @@ setMethod("freqItems", signature(x = 
"SparkDataFrame", cols = "character"),
 #' This method implements a variation of the Greenwald-Khanna algorithm 
(with some speed
 #' optimizations). The algorithm was first present in 
[[http://dx.doi.org/10.1145/375663.375670
 #' Space-efficient Online Computation of Quantile Summaries]] by Greenwald 
and Khanna.
+#' Note that rows containing any NA values will be removed before 
calculation.
--- End diff --

Added test cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16949: [SPARK-16122][CORE] Add rest api for job environment

2017-02-16 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/16949
  
Sure, this PR is fine, I'd just prefer some minor API adjustments to bring 
it closer to the code I linked above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16951: [SPARK-18285][SPARKR] SparkR approxQuantile supports inp...

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16951
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16951: [SPARK-18285][SPARKR] SparkR approxQuantile supports inp...

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16951
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73031/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16951: [SPARK-18285][SPARKR] SparkR approxQuantile supports inp...

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16951
  
**[Test build #73031 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73031/testReport)**
 for PR 16951 at commit 
[`27b2cd6`](https://github.com/apache/spark/commit/27b2cd6aa3b27e44792cda4a23123b8ad3ef58aa).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-16 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r101675669
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1018,7 +1025,9 @@ private[spark] class BlockManager(
   try {
 replicate(blockId, bytesToReplicate, level, remoteClassTag)
   } finally {
-bytesToReplicate.dispose()
+if (!level.useOffHeap) {
--- End diff --

So maybe use `putBlockStatus.storageLevel` instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-16 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r101675576
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -813,7 +813,14 @@ private[spark] class BlockManager(
   false
   }
 } else {
-  memoryStore.putBytes(blockId, size, level.memoryMode, () => 
bytes)
+  val memoryMode = level.memoryMode
+  memoryStore.putBytes(blockId, size, memoryMode, () => {
+if (memoryMode == MemoryMode.OFF_HEAP) {
--- End diff --

Is it safe to store a ref to `bytes` if the memory is stored off-heap? If 
the caller changes the values in that memory or frees it, the buffer we put in 
the memory store will be affected. We don't want that kind of side-effect.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16964: [SPARK-19534][TESTS] Convert Java tests to use lambdas, ...

2017-02-16 Thread zzcclp
Github user zzcclp commented on the issue:

https://github.com/apache/spark/pull/16964
  
@srowen after update to master, in Eclipse IDE, there is an error in 
JavaConsumerStrategySuite.java line 52:

`final Map offsets = new HashMap<>();
offsets.put(tp1, 23L);
final scala.collection.Map sOffsets =
  JavaConverters.mapAsScalaMapConverter(offsets).asScala().mapValues(
new scala.runtime.AbstractFunction1() {
  @Override
  public Object apply(Long x) {
return (Object) x;
  }
}
  );`

Error is as follows:
**The method mapValues(Function1) is ambiguous for the type 
Map**

Is it related to updating Java 7 to 8?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-16 Thread sameeragarwal
Github user sameeragarwal commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r101664624
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -813,7 +813,14 @@ private[spark] class BlockManager(
   false
   }
 } else {
-  memoryStore.putBytes(blockId, size, level.memoryMode, () => 
bytes)
+  val memoryMode = level.memoryMode
+  memoryStore.putBytes(blockId, size, memoryMode, () => {
+if (memoryMode == MemoryMode.OFF_HEAP) {
--- End diff --

This condition can just check for `level.useOffHeap` right? But more 
generally, I had the same question that @viirya has. Is there a way to check if 
`bytes` is already off-heap and avoid the defensive copy? Or will that never be 
the case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-16 Thread sameeragarwal
Github user sameeragarwal commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r101674560
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1018,7 +1025,9 @@ private[spark] class BlockManager(
   try {
 replicate(blockId, bytesToReplicate, level, remoteClassTag)
   } finally {
-bytesToReplicate.dispose()
+if (!level.useOffHeap) {
--- End diff --

Similarly, aren't we deciding whether we should dispose `bytesToReplicate` 
based on the 'target' `StorageLevel`? Shouldn't we make that decision based on 
whether `bytesToReplicate` were stored off-heap or not?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16826: [WIP][SPARK-19540][SQL] Add ability to clone SparkSessio...

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16826
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16826: [WIP][SPARK-19540][SQL] Add ability to clone SparkSessio...

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16826
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73027/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16790: [SPARK-19450] Replace askWithRetry with askSync.

2017-02-16 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/16790
  
https://github.com/apache/spark/pull/16690#discussion_r101616883 causes the 
build to produce lots of deprecation warnings. 
@srowen @vanzin  How do you think about this ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16826: [WIP][SPARK-19540][SQL] Add ability to clone SparkSessio...

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16826
  
**[Test build #73027 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73027/testReport)**
 for PR 16826 at commit 
[`e2bbfa8`](https://github.com/apache/spark/commit/e2bbfa8c81f91c57f5628e771f42d414a1031d57).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16690: [SPARK-19347] ReceiverSupervisorImpl can add block to Re...

2017-02-16 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/16690
  
@srowen 
How do you think about https://github.com/apache/spark/pull/16790?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16476: [SPARK-19084][SQL] Implement expression field

2017-02-16 Thread gczsjdy
Github user gczsjdy commented on a diff in the pull request:

https://github.com/apache/spark/pull/16476#discussion_r101673768
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala
 ---
@@ -340,3 +341,91 @@ object CaseKeyWhen {
 CaseWhen(cases, elseValue)
   }
 }
+
+/**
+ * A function that returns the index of str in (str1, str2, ...) list or 0 
if not found.
+ * It takes at least 2 parameters, and all parameters' types should be 
subtypes of AtomicType.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, str1, str2, ...) - Returns the index of str in the 
str1,str2,... or 0 if not found.",
+  extended = """
+Examples:
+  > SELECT _FUNC_(10, 9, 3, 10, 4);
+   3
+  """)
+case class Field(children: Seq[Expression]) extends Expression {
+
+  override def nullable: Boolean = false
+  override def foldable: Boolean = children.forall(_.foldable)
+
+  private lazy val ordering = 
TypeUtils.getInterpretedOrdering(children(0).dataType)
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.length <= 1) {
+  TypeCheckResult.TypeCheckFailure(s"FIELD requires at least 2 
arguments")
+} else if (!children.forall(_.dataType.isInstanceOf[AtomicType])) {
+  TypeCheckResult.TypeCheckFailure(s"FIELD requires all arguments to 
be of AtomicType")
+} else
+  TypeCheckResult.TypeCheckSuccess
+  }
+
+  override def dataType: DataType = IntegerType
+
+  override def eval(input: InternalRow): Any = {
+val target = children.head.eval(input)
+val targetDataType = children.head.dataType
+def findEqual(target: Any, params: Seq[Expression], index: Int): Int = 
{
+  params.toList match {
+case Nil => 0
+case head::tail if targetDataType == head.dataType
+  && head.eval(input) != null && ordering.equiv(target, 
head.eval(input)) => index
+case _ => findEqual(target, params.tail, index + 1)
+  }
+}
+if(target == null)
+  0
+else
+  findEqual(target, children.tail, 1)
+  }
+
+  protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+val evalChildren = children.map(_.genCode(ctx))
+val target = evalChildren(0)
+val targetDataType = children(0).dataType
+val rest = evalChildren.drop(1)
+val restDataType = children.drop(1).map(_.dataType)
+
+def updateEval(evalWithIndex: ((ExprCode, DataType), Int)): String = {
+  val ((eval, dataType), index) = evalWithIndex
+  s"""
+${eval.code}
+if (${dataType.equals(targetDataType)}
+  && ${ctx.genEqual(targetDataType, eval.value, target.value)}) {
+  ${ev.value} = ${index};
+}
+  """
+}
+
+def genIfElseStructure(code1: String, code2: String): String = {
--- End diff --

For now, I use `reduceRight`, which I think is a 'special case' function of 
`foldRight`.
If I understand your meaning of floating `else` right(could you please 
explain it a little bit?), these 2 functions both can't avoid floating `else`, 
because we need nested `else` in `else` block, like this:
`if (xxx)
else {
  if (xxx)
  else {
  ...
  }
}
`, so if we avoid floating `else` in `genIfElseStructure`, `else` should be 
in `updateEval`, which will make the code unclear and complicated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16949: [SPARK-16122][CORE] Add rest api for job environment

2017-02-16 Thread uncleGen
Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/16949
  
@vanzin I opened a jira (https://issues.apache.org/jira/browse/SPARK-19642) 
to research and address the potential security flaws. Do you mind if I continue 
this pr?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16951: [SPARK-18285][SPARKR] SparkR approxQuantile supports inp...

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16951
  
**[Test build #73031 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73031/testReport)**
 for PR 16951 at commit 
[`27b2cd6`](https://github.com/apache/spark/commit/27b2cd6aa3b27e44792cda4a23123b8ad3ef58aa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16386
  
**[Test build #73032 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73032/testReport)**
 for PR 16386 at commit 
[`b801ab0`](https://github.com/apache/spark/commit/b801ab096c3bf426c7d2044291785359ace4922f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16826: [WIP][SPARK-19540][SQL] Add ability to clone SparkSessio...

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16826
  
Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16826: [WIP][SPARK-19540][SQL] Add ability to clone SparkSessio...

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16826
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73022/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16826: [WIP][SPARK-19540][SQL] Add ability to clone SparkSessio...

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16826
  
**[Test build #73022 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73022/testReport)**
 for PR 16826 at commit 
[`f423f74`](https://github.com/apache/spark/commit/f423f7481348c021d9a27986064dfbe389c5de77).
 * This patch **fails Spark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16386
  
**[Test build #73030 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73030/testReport)**
 for PR 16386 at commit 
[`58118f2`](https://github.com/apache/spark/commit/58118f2762a3abfb5ca82c156eabb9622844b764).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16968: [SPARK-19337] [ML] [Dcoc] Documentation and examples for...

2017-02-16 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/16968
  
I see. I will drop the R example here, whichever PR goes in later can 
finish the document update. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16386
  
**[Test build #73029 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73029/testReport)**
 for PR 16386 at commit 
[`e323317`](https://github.com/apache/spark/commit/e32331794a63e8fcfe60a0901672e69dcfd6fe15).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-16 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16386#discussion_r101671453
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -1764,4 +1769,117 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
 val df2 = spark.read.option("PREfersdecimaL", "true").json(records)
 assert(df2.schema == schema)
   }
+
+  test("SPARK-18352: Parse normal multi-line JSON files (compressed)") {
+withTempPath { dir =>
+  val path = dir.getCanonicalPath
+  primitiveFieldAndType
+.toDF("value")
+.write
+.option("compression", "GzIp")
+.text(path)
+
+  assert(new File(path).listFiles().exists(_.getName.endsWith(".gz")))
+
+  val jsonDF = spark.read.option("wholeFile", true).json(path)
--- End diff --

I rewrote these tests. Please take a look @gatorsmile and @cloud-fan.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2017-02-16 Thread NathanHowell
Github user NathanHowell commented on the issue:

https://github.com/apache/spark/pull/16386
  
@cloud-fan When implementing tests for the other modes I've uncovered an 
existing bug in schema inference in `DROPMALFORMED` mode: 
https://issues.apache.org/jira/browse/SPARK-19641. Since it is not introduced 
in this set of patches I will open a new pull request once this is one merged. 
You can inspect the fix here: 
https://github.com/NathanHowell/spark/commit/e233fd03346a73b3b447fa4c24f3b12c8b2e53ae


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16962: [SPARK-18120 ][SPARK-19557][SQL] Call QueryExecutionList...

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16962
  
**[Test build #73023 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73023/testReport)**
 for PR 16962 at commit 
[`d35fac3`](https://github.com/apache/spark/commit/d35fac34247d6d6f2593b5284bbaba906a8dd033).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16962: [SPARK-18120 ][SPARK-19557][SQL] Call QueryExecutionList...

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16962
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16962: [SPARK-18120 ][SPARK-19557][SQL] Call QueryExecutionList...

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16962
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73023/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16970: [SPARK-19497][SS]Implement streaming deduplication

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16970
  
**[Test build #73028 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73028/testReport)**
 for PR 16970 at commit 
[`63a7f4c`](https://github.com/apache/spark/commit/63a7f4c62b2da32351d008f9719d513e14562e56).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16970: [SPARK-19497][SS]Implement streaming deduplicatio...

2017-02-16 Thread zsxwing
GitHub user zsxwing opened a pull request:

https://github.com/apache/spark/pull/16970

[SPARK-19497][SS]Implement streaming deduplication

## What changes were proposed in this pull request?

This PR adds a special streaming deduplication operator to support 
`dropDuplicates` with `aggregation` and watermark. It reuses the 
`dropDuplicates` API but creates new logical plan `Deduplication` and new 
physical plan `DeduplicationExec`.

The following cases are supported:

- dropDuplicates() (one `dropDuplicates` with any output mode)
- withWatermark(...).dropDuplicates().groupBy(...)outputMode("append")
- dropDuplicates().groupBy(...)outputMode("update")

Not supported cases:

- dropDuplicates().dropDuplicates() (multiple `dropDuplicates`s)
- groupBy(...).dropDuplicates() (`dropDuplicates` after `aggregation`)
- dropDuplicates().groupBy(...)outputMode("complete")

## How was this patch tested?

The new unite tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zsxwing/spark dedup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16970.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16970


commit 63a7f4c62b2da32351d008f9719d513e14562e56
Author: Shixiong Zhu 
Date:   2017-02-15T19:01:57Z

Implement deduplication




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16938: [SPARK-19583][SQL]CTAS for data source table with a crea...

2017-02-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16938
  
One more case:
5. `CREATE TABLE` or `CTAS` without the location spec: if the default path 
exists, should we succeed or fail?

After we finishing the TABLE-level DDLs, we also need to do the same things 
for DATABASE-level DDLs and PARTITION-level DDLs. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-02-16 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/15125
  
LGTM. @felixcheung are we good to merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15125
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73019/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15125
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15125
  
**[Test build #73019 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73019/testReport)**
 for PR 15125 at commit 
[`f2efef6`](https://github.com/apache/spark/commit/f2efef6afe5692a70325bc788892f5b675755f08).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16826: [WIP][SPARK-19540][SQL] Add ability to clone SparkSessio...

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16826
  
**[Test build #73027 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73027/testReport)**
 for PR 16826 at commit 
[`e2bbfa8`](https://github.com/apache/spark/commit/e2bbfa8c81f91c57f5628e771f42d414a1031d57).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16962: [SPARK-18120 ][SPARK-19557][SQL] Call QueryExecutionList...

2017-02-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16962
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16962: [SPARK-18120 ][SPARK-19557][SQL] Call QueryExecut...

2017-02-16 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16962#discussion_r101666288
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommand.scala
 ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}
+import org.apache.spark.sql.catalyst.plans.QueryPlan
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.execution.command.RunnableCommand
+
+/**
+ * Saves the results of `query` in to a data source.
+ *
+ * Note that this command is different from 
[[InsertIntoDataSourceCommand]]. This command will call
+ * `CreatableRelationProvider.createRelation` to write out the data, while
+ * [[InsertIntoDataSourceCommand]] calls `InsertableRelation.insert`. 
Ideally these 2 data source
+ * interfaces should do the same thing, but as we've already published 
these 2 interfaces and the
+ * implementations may have different logic, we have to keep these 2 
different commands.
+ */
+case class SaveIntoDataSourceCommand(
+query: LogicalPlan,
+provider: String,
+partitionColumns: Seq[String],
+options: Map[String, String],
+mode: SaveMode) extends RunnableCommand {
+
+  override protected def innerChildren: Seq[QueryPlan[_]] = Seq(query)
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+DataSource(
+  sparkSession,
+  className = provider,
+  partitionColumns = partitionColumns,
+  options = options).write(mode, Dataset.ofRows(sparkSession, query))
+
--- End diff --

: )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r101666251
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/PowerIterationClusteringSuite.scala
 ---
@@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import scala.collection.mutable
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.util.DefaultReadWriteTest
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
+
+class PowerIterationClusteringSuite extends SparkFunSuite
+  with MLlibTestSparkContext with DefaultReadWriteTest {
+
+  @transient var data: Dataset[_] = _
+  final val r1 = 1.0
+  final val n1 = 10
+  final val r2 = 4.0
+  final val n2 = 40
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+data = PowerIterationClusteringSuite.generatePICData(spark, r1, r2, 
n1, n2)
+  }
+
+  test("default parameters") {
+val pic = new PowerIterationClustering()
+
+assert(pic.getK === 2)
+assert(pic.getMaxIter === 20)
+assert(pic.getInitMode === "random")
+assert(pic.getFeaturesCol === "features")
+assert(pic.getPredictionCol === "prediction")
+assert(pic.getIdCol === "id")
+  }
+
+  test("set parameters") {
+val pic = new PowerIterationClustering()
+  .setK(9)
+  .setMaxIter(33)
+  .setInitMode("degree")
+  .setFeaturesCol("test_feature")
+  .setPredictionCol("test_prediction")
+  .setIdCol("test_id")
+
+assert(pic.getK === 9)
+assert(pic.getMaxIter === 33)
+assert(pic.getInitMode === "degree")
+assert(pic.getFeaturesCol === "test_feature")
+assert(pic.getPredictionCol === "test_prediction")
+assert(pic.getIdCol === "test_id")
+  }
+
+  test("parameters validation") {
+intercept[IllegalArgumentException] {
+  new PowerIterationClustering().setK(1)
+}
+intercept[IllegalArgumentException] {
+  new PowerIterationClustering().setInitMode("no_such_a_mode")
+}
+  }
+
+  test("power iteration clustering") {
--- End diff --

can you also add a test with a dataframe that has some extra data in it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16611: [SPARK-17967][SPARK-17878][SQL][PYTHON] Support for arra...

2017-02-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16611
  
Sure, I will rebase and update.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r101665899
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.linalg.{Vector}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import 
org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.{col}
+import org.apache.spark.sql.types.{IntegerType, LongType, StructField, 
StructType}
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params 
with HasMaxIter
+  with HasFeaturesCol with HasPredictionCol {
+
+  /**
+   * The number of clusters to create (k). Must be > 1. Default: 2.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to use a random vector
+   * as vertex properties, or "degree" to use normalized sum similarities. 
Default: random.
+   */
+  @Since("2.2.0")
+  final val initMode = new Param[String](this, "initMode", "The 
initialization algorithm. " +
+"Supported options: 'random' and 'degree'.",
+(value: String) => validateInitMode(value))
+
+  private[spark] def validateInitMode(initMode: String): Boolean = {
+initMode match {
+  case "random" => true
+  case "degree" => true
+  case _ => false
+}
+  }
+
+  /** @group expertGetParam */
+  @Since("2.2.0")
+  def getInitMode: String = $(initMode)
+
+  /**
+   * Param for the column name for ids returned by 
[[PowerIterationClustering.transform()]].
+   * Default: "id"
+   * @group param
+   */
+  val idCol = new Param[String](this, "idCol", "column name for ids.")
+
+  /** @group getParam */
+  def getIdCol: String = $(idCol)
+
+  /**
+   * Validates the input schema
+   * @param schema input schema
+   */
+  protected def validateSchema(schema: StructType): Unit = {
+SchemaUtils.checkColumnType(schema, $(idCol), LongType)
+SchemaUtils.checkColumnType(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Power Iteration Clustering (PIC), a scalable graph clustering algorithm 
developed by
+ * [[http://www.icml2010.org/papers/387.pdf Lin and Cohen]]. From the 
abstract: PIC finds a very
+ * low-dimensional embedding of a dataset using truncated power iteration 
on a normalized pair-wise
+ * similarity matrix of the data.
+ *
+ * Note that we implement [[PowerIterationClustering]] as a transformer. 
The [[transform]] is an
+ * expensive operation, because it uses PIC algorithm to cluster the whole 
input dataset.
+ *
+ * @see [[http://en.wikipedia.org/wiki/Spectral_clustering Spectral 
clustering (Wikipedia)]]
+ */
+@Since("2.2.0")
+@Experimental
+class PowerIterationClustering private[clustering] (
+@Since("2.2.0") override val uid: String)
+  extends Transformer with PowerIterationClusteringParams with 
DefaultParamsWritable {
+
+  setDefault(
+k -> 2,
+maxIter -> 20,
+initMode -> 

[GitHub] spark issue #16923: [SPARK-19038][Hive][YARN] Correctly figure out keytab fi...

2017-02-16 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/16923
  
@vanzin , would you mind helping to review this PR, thanks a lot. 

IIUC the issue was introduced in #11510 .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r101664268
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.linalg.{Vector}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import 
org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.{col}
+import org.apache.spark.sql.types.{IntegerType, LongType, StructField, 
StructType}
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params 
with HasMaxIter
+  with HasFeaturesCol with HasPredictionCol {
+
+  /**
+   * The number of clusters to create (k). Must be > 1. Default: 2.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to use a random vector
+   * as vertex properties, or "degree" to use normalized sum similarities. 
Default: random.
+   */
+  @Since("2.2.0")
+  final val initMode = new Param[String](this, "initMode", "The 
initialization algorithm. " +
+"Supported options: 'random' and 'degree'.",
+(value: String) => validateInitMode(value))
+
+  private[spark] def validateInitMode(initMode: String): Boolean = {
+initMode match {
+  case "random" => true
+  case "degree" => true
+  case _ => false
+}
+  }
+
+  /** @group expertGetParam */
+  @Since("2.2.0")
+  def getInitMode: String = $(initMode)
+
+  /**
+   * Param for the column name for ids returned by 
[[PowerIterationClustering.transform()]].
+   * Default: "id"
+   * @group param
+   */
+  val idCol = new Param[String](this, "idCol", "column name for ids.")
--- End diff --

Instead of making an 'id' column, which does not convey much information, 
we should follow the example of `K-Means` and call it `prediction`. You already 
include the trait for that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r101663790
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.linalg.{Vector}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import 
org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.{col}
+import org.apache.spark.sql.types.{IntegerType, LongType, StructField, 
StructType}
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params 
with HasMaxIter
+  with HasFeaturesCol with HasPredictionCol {
+
+  /**
+   * The number of clusters to create (k). Must be > 1. Default: 2.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to use a random vector
+   * as vertex properties, or "degree" to use normalized sum similarities. 
Default: random.
+   */
+  @Since("2.2.0")
+  final val initMode = new Param[String](this, "initMode", "The 
initialization algorithm. " +
+"Supported options: 'random' and 'degree'.",
+(value: String) => validateInitMode(value))
+
+  private[spark] def validateInitMode(initMode: String): Boolean = {
+initMode match {
+  case "random" => true
+  case "degree" => true
+  case _ => false
+}
+  }
+
+  /** @group expertGetParam */
+  @Since("2.2.0")
+  def getInitMode: String = $(initMode)
+
+  /**
+   * Param for the column name for ids returned by 
[[PowerIterationClustering.transform()]].
+   * Default: "id"
+   * @group param
+   */
+  val idCol = new Param[String](this, "idCol", "column name for ids.")
+
+  /** @group getParam */
+  def getIdCol: String = $(idCol)
+
+  /**
+   * Validates the input schema
+   * @param schema input schema
+   */
+  protected def validateSchema(schema: StructType): Unit = {
--- End diff --

Instead of just validating the schema, we should validate and transform. 
You can follow the example in 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala#L92


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16969: [SPARK-19639][SPARKR][Example]:Add spark.svmLinear examp...

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16969
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16969: [SPARK-19639][SPARKR][Example]:Add spark.svmLinear examp...

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16969
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73025/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16969: [SPARK-19639][SPARKR][Example]:Add spark.svmLinear examp...

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16969
  
**[Test build #73025 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73025/testReport)**
 for PR 16969 at commit 
[`6e54a4d`](https://github.com/apache/spark/commit/6e54a4d0c2f05cb017b3ce4ce105a723d03a9306).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16826: [WIP][SPARK-19540][SQL] Add ability to clone SparkSessio...

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16826
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16826: [WIP][SPARK-19540][SQL] Add ability to clone SparkSessio...

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16826
  
**[Test build #73026 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73026/testReport)**
 for PR 16826 at commit 
[`2cee190`](https://github.com/apache/spark/commit/2cee190eb6c6902d39f68c25d928fbd5aaa522bc).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16826: [WIP][SPARK-19540][SQL] Add ability to clone SparkSessio...

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16826
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73026/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r101662332
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.linalg.{Vector}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import 
org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.{col}
+import org.apache.spark.sql.types.{IntegerType, LongType, StructField, 
StructType}
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params 
with HasMaxIter
+  with HasFeaturesCol with HasPredictionCol {
+
+  /**
+   * The number of clusters to create (k). Must be > 1. Default: 2.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to use a random vector
+   * as vertex properties, or "degree" to use normalized sum similarities. 
Default: random.
+   */
+  @Since("2.2.0")
+  final val initMode = new Param[String](this, "initMode", "The 
initialization algorithm. " +
+"Supported options: 'random' and 'degree'.",
+(value: String) => validateInitMode(value))
+
+  private[spark] def validateInitMode(initMode: String): Boolean = {
+initMode match {
+  case "random" => true
+  case "degree" => true
+  case _ => false
+}
+  }
+
+  /** @group expertGetParam */
+  @Since("2.2.0")
+  def getInitMode: String = $(initMode)
+
+  /**
+   * Param for the column name for ids returned by 
[[PowerIterationClustering.transform()]].
+   * Default: "id"
+   * @group param
+   */
+  val idCol = new Param[String](this, "idCol", "column name for ids.")
+
+  /** @group getParam */
+  def getIdCol: String = $(idCol)
+
+  /**
+   * Validates the input schema
+   * @param schema input schema
+   */
+  protected def validateSchema(schema: StructType): Unit = {
+SchemaUtils.checkColumnType(schema, $(idCol), LongType)
+SchemaUtils.checkColumnType(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Power Iteration Clustering (PIC), a scalable graph clustering algorithm 
developed by
+ * [[http://www.icml2010.org/papers/387.pdf Lin and Cohen]]. From the 
abstract: PIC finds a very
+ * low-dimensional embedding of a dataset using truncated power iteration 
on a normalized pair-wise
+ * similarity matrix of the data.
+ *
+ * Note that we implement [[PowerIterationClustering]] as a transformer. 
The [[transform]] is an
+ * expensive operation, because it uses PIC algorithm to cluster the whole 
input dataset.
+ *
+ * @see [[http://en.wikipedia.org/wiki/Spectral_clustering Spectral 
clustering (Wikipedia)]]
+ */
+@Since("2.2.0")
+@Experimental
+class PowerIterationClustering private[clustering] (
+@Since("2.2.0") override val uid: String)
--- End diff --

indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have 

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r101662298
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.linalg.{Vector}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import 
org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.{col}
+import org.apache.spark.sql.types.{IntegerType, LongType, StructField, 
StructType}
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params 
with HasMaxIter
+  with HasFeaturesCol with HasPredictionCol {
+
+  /**
+   * The number of clusters to create (k). Must be > 1. Default: 2.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to use a random vector
+   * as vertex properties, or "degree" to use normalized sum similarities. 
Default: random.
+   */
+  @Since("2.2.0")
+  final val initMode = new Param[String](this, "initMode", "The 
initialization algorithm. " +
+"Supported options: 'random' and 'degree'.",
+(value: String) => validateInitMode(value))
+
+  private[spark] def validateInitMode(initMode: String): Boolean = {
--- End diff --

No need with comment above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r101662273
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.linalg.{Vector}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import 
org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.{col}
+import org.apache.spark.sql.types.{IntegerType, LongType, StructField, 
StructType}
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params 
with HasMaxIter
+  with HasFeaturesCol with HasPredictionCol {
+
+  /**
+   * The number of clusters to create (k). Must be > 1. Default: 2.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to use a random vector
+   * as vertex properties, or "degree" to use normalized sum similarities. 
Default: random.
+   */
+  @Since("2.2.0")
+  final val initMode = new Param[String](this, "initMode", "The 
initialization algorithm. " +
+"Supported options: 'random' and 'degree'.",
+(value: String) => validateInitMode(value))
--- End diff --

You do not need to use write a function as you do below after that, it will 
allow more user-friendly error messages in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r101662038
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.linalg.{Vector}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import 
org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.{col}
--- End diff --

same here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r101662018
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.linalg.{Vector}
--- End diff --

no need for brackets


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16826: [WIP][SPARK-19540][SQL] Add ability to clone SparkSessio...

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16826
  
**[Test build #73026 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73026/testReport)**
 for PR 16826 at commit 
[`2cee190`](https://github.com/apache/spark/commit/2cee190eb6c6902d39f68c25d928fbd5aaa522bc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-02-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r101660601
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -0,0 +1,569 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
+
+import java.sql.{Date, Timestamp}
+
+import scala.collection.immutable.{HashSet, Map}
+import scala.collection.mutable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * @param catalystConf a configuration showing if CBO is enabled
+ */
+case class FilterEstimation(plan: Filter, catalystConf: CatalystConf) 
extends Logging {
+
+  /**
+   * We use a mutable colStats because we need to update the corresponding 
ColumnStat
+   * for a column after we apply a predicate condition.  For example, 
column c has
+   * [min, max] value as [0, 100].  In a range condition such as (c > 40 
AND c <= 50),
+   * we need to set the column's [min, max] value to [40, 100] after we 
evaluate the
+   * first condition c > 40.  We need to set the column's [min, max] value 
to [40, 50]
+   * after we evaluate the second condition c <= 50.
+   */
+  private var mutableColStats: mutable.Map[ExprId, ColumnStat] = 
mutable.Map.empty
+
+  /**
+   * Returns an option of Statistics for a Filter logical plan node.
+   * For a given compound expression condition, this method computes 
filter selectivity
+   * (or the percentage of rows meeting the filter condition), which
+   * is used to compute row count, size in bytes, and the updated 
statistics after a given
+   * predicated is applied.
+   *
+   * @return Option[Statistics] When there is no statistics collected, it 
returns None.
+   */
+  def estimate: Option[Statistics] = {
+// We first copy child node's statistics and then modify it based on 
filter selectivity.
+val stats: Statistics = plan.child.stats(catalystConf)
+if (stats.rowCount.isEmpty) return None
+
+// save a mutable copy of colStats so that we can later change it 
recursively
+mutableColStats = mutable.Map(stats.attributeStats.map(kv => 
(kv._1.exprId, kv._2)).toSeq: _*)
+
+// estimate selectivity of this filter predicate
+val filterSelectivity: Double = calculateConditions(plan.condition)
+
+// attributeStats has mapping Attribute-to-ColumnStat.
+// mutableColStats has mapping ExprId-to-ColumnStat.
+// We use an ExprId-to-Attribute map to facilitate the mapping 
Attribute-to-ColumnStat
+val expridToAttrMap: Map[ExprId, Attribute] =
+  stats.attributeStats.map(kv => (kv._1.exprId, kv._1))
+// copy mutableColStats contents to an immutable AttributeMap.
+val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] =
+  mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2)
+val newColStats = AttributeMap(mutableAttributeStats.toSeq)
+
+val filteredRowCount: BigInt =
+  EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * 
filterSelectivity)
+val filteredSizeInBytes: BigInt = EstimationUtils.ceil(BigDecimal(
+EstimationUtils.getOutputSize(plan.output, filteredRowCount, 
newColStats)
+))
+
+Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = 
Some(filteredRowCount),
+  attributeStats = newColStats))
+  }
+
+  /**
+   * Returns a percentage of rows meeting a compound condition in Filter 
node.
+   * A compound condition is decomposed into multiple single conditions 
linked with AND, OR, NOT.
+   * 

[GitHub] spark issue #16967: [MINOR][PYTHON] Fix typo docstring: 'top' -> 'topic'

2017-02-16 Thread rolando
Github user rolando commented on the issue:

https://github.com/apache/spark/pull/16967
  
I've (rip)grep'ed  over all `.py` and `.md` files in the repository 
searching for ` topi? ` (case insensitive regex) and haven't seen other case of 
this typo (writing `top` instead of `topic`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16938: [SPARK-19583][SQL]CTAS for data source table with a crea...

2017-02-16 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16938
  
ok let's discuss it case by case:

1. `CREATE TABLE ... LOCATION path` works if path exists, it's expected
2. `CREATE TABLE ... LOCATION path` fails if path doesn't exist, is it 
expected?
3. `CREATE TABLE ... LOCATION path AS SELECT ...`, shall we fail if path 
exists?
4. `ALTER TABLE ... SET LOCATION path`, shall we fail if path not exist?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16395: [SPARK-17075][SQL] implemented filter estimation

2017-02-16 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/16395#discussion_r101659663
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -0,0 +1,569 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
+
+import java.sql.{Date, Timestamp}
+
+import scala.collection.immutable.{HashSet, Map}
+import scala.collection.mutable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * @param catalystConf a configuration showing if CBO is enabled
+ */
+case class FilterEstimation(plan: Filter, catalystConf: CatalystConf) 
extends Logging {
+
+  /**
+   * We use a mutable colStats because we need to update the corresponding 
ColumnStat
+   * for a column after we apply a predicate condition.  For example, 
column c has
+   * [min, max] value as [0, 100].  In a range condition such as (c > 40 
AND c <= 50),
+   * we need to set the column's [min, max] value to [40, 100] after we 
evaluate the
+   * first condition c > 40.  We need to set the column's [min, max] value 
to [40, 50]
+   * after we evaluate the second condition c <= 50.
+   */
+  private var mutableColStats: mutable.Map[ExprId, ColumnStat] = 
mutable.Map.empty
+
+  /**
+   * Returns an option of Statistics for a Filter logical plan node.
+   * For a given compound expression condition, this method computes 
filter selectivity
+   * (or the percentage of rows meeting the filter condition), which
+   * is used to compute row count, size in bytes, and the updated 
statistics after a given
+   * predicated is applied.
+   *
+   * @return Option[Statistics] When there is no statistics collected, it 
returns None.
+   */
+  def estimate: Option[Statistics] = {
+// We first copy child node's statistics and then modify it based on 
filter selectivity.
+val stats: Statistics = plan.child.stats(catalystConf)
+if (stats.rowCount.isEmpty) return None
+
+// save a mutable copy of colStats so that we can later change it 
recursively
+mutableColStats = mutable.Map(stats.attributeStats.map(kv => 
(kv._1.exprId, kv._2)).toSeq: _*)
+
+// estimate selectivity of this filter predicate
+val filterSelectivity: Double = calculateConditions(plan.condition)
+
+// attributeStats has mapping Attribute-to-ColumnStat.
+// mutableColStats has mapping ExprId-to-ColumnStat.
+// We use an ExprId-to-Attribute map to facilitate the mapping 
Attribute-to-ColumnStat
+val expridToAttrMap: Map[ExprId, Attribute] =
+  stats.attributeStats.map(kv => (kv._1.exprId, kv._1))
+// copy mutableColStats contents to an immutable AttributeMap.
+val mutableAttributeStats: mutable.Map[Attribute, ColumnStat] =
+  mutableColStats.map(kv => expridToAttrMap(kv._1) -> kv._2)
+val newColStats = AttributeMap(mutableAttributeStats.toSeq)
+
+val filteredRowCount: BigInt =
+  EstimationUtils.ceil(BigDecimal(stats.rowCount.get) * 
filterSelectivity)
+val filteredSizeInBytes: BigInt = EstimationUtils.ceil(BigDecimal(
+EstimationUtils.getOutputSize(plan.output, filteredRowCount, 
newColStats)
+))
+
+Some(stats.copy(sizeInBytes = filteredSizeInBytes, rowCount = 
Some(filteredRowCount),
+  attributeStats = newColStats))
+  }
+
+  /**
+   * Returns a percentage of rows meeting a compound condition in Filter 
node.
+   * A compound condition is decomposed into multiple single conditions 
linked with AND, OR, NOT.
+   * For 

[GitHub] spark issue #16969: [SPARK-19639][SPARKR][Example]:Add spark.svmLinear examp...

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16969
  
**[Test build #73025 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73025/testReport)**
 for PR 16969 at commit 
[`6e54a4d`](https://github.com/apache/spark/commit/6e54a4d0c2f05cb017b3ce4ce105a723d03a9306).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16967: [MINOR][PYTHON] Fix typo docstring: 'top' -> 'topic'

2017-02-16 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/16967
  
Sure, it's worth a search for similar instances because sometimes typos 
spread via copy and paste. Could you make a pass over related code?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   6   >