[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...

2017-03-21 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15628#discussion_r107312905
  
--- Diff: 
mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala ---
@@ -161,6 +162,110 @@ sealed trait Matrix extends Serializable {
*/
   @Since("2.0.0")
   def numActives: Int
+
+  /**
+   * Converts this matrix to a sparse matrix.
+   *
+   * @param colMajor Whether the values of the resulting sparse matrix 
should be in column major
+   *or row major order. If `false`, resulting matrix 
will be row major.
+   */
+  private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix
+
+  /**
+   * Converts this matrix to a sparse matrix in column major order.
+   */
+  @Since("2.2.0")
+  def toCSC: SparseMatrix = toSparseMatrix(colMajor = true)
+
+  /**
+   * Converts this matrix to a sparse matrix in row major order.
+   */
+  @Since("2.2.0")
+  def toCSR: SparseMatrix = toSparseMatrix(colMajor = false)
+
+  /**
+   * Converts this matrix to a sparse matrix in column major order.
+   */
+  @Since("2.2.0")
+  def toSparse: SparseMatrix = toSparseMatrix(colMajor = true)
--- End diff --

I'm debating that should we keep the same ordering of layout when we call 
`toSparse` or `toDense`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17381: [SPARK-20023] [SQL] Output table comment for DESC FORMAT...

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17381
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17381: [SPARK-20023] [SQL] Output table comment for DESC FORMAT...

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17381
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75009/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17302: [SPARK-19959][SQL] Fix to throw NullPointerExcept...

2017-03-21 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/17302#discussion_r107315291
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -41,7 +41,20 @@ object CatalystSerde {
   }
 
   def generateObjAttr[T : Encoder]: Attribute = {
-AttributeReference("obj", encoderFor[T].deserializer.dataType, 
nullable = false)()
+val deserializer = encoderFor[T].deserializer
+val dataType = deserializer.dataType
+val nullable = if (deserializer.childrenResolved) {
--- End diff --

The point is whether `deserializer` is resolved or not. It is not a point 
whether `encodeFor[T]` is resolved or not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17100: [SPARK-13947][PYTHON][SQL] PySpark DataFrames: The error...

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17100
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17100: [SPARK-13947][PYTHON][SQL] PySpark DataFrames: The error...

2017-03-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17100
  
**[Test build #75013 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75013/testReport)**
 for PR 17100 at commit 
[`7bb6f35`](https://github.com/apache/spark/commit/7bb6f35d9153ea522b8ccb4bcd44bf1c70cc64f0).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17100: [SPARK-13947][PYTHON][SQL] PySpark DataFrames: The error...

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17100
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75013/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17348: [SPARK-20018][SQL] Pivot with timestamp and count should...

2017-03-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17348
  
@ueshin, you are right. I think we should consider the timezone.

```scala
val timestamp = java.sql.Timestamp.valueOf("2012-12-31 16:00:10.011")
spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
Seq(timestamp).toDF("a").groupBy("a").pivot("a").count()
```
```
++---+
|   a|2012-12-31 16:00:10.011|
++---+
|2012-12-30 23:00:...|  1|
++---+
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17348: [SPARK-20018][SQL] Pivot with timestamp and count...

2017-03-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17348#discussion_r107319052
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -486,14 +486,16 @@ class Analyzer(
   case Pivot(groupByExprs, pivotColumn, pivotValues, aggregates, 
child) =>
 val singleAgg = aggregates.size == 1
 def outputName(value: Literal, aggregate: Expression): String = {
+  val utf8val = Cast(value, StringType, 
Some(conf.sessionLocalTimeZone)).eval(EmptyRow)
--- End diff --

BTW, is this a correct way for handling timezone - @ueshin ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17335: [SPARK-19995][Hive][Yarn] Using real user to initialize ...

2017-03-21 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/17335
  
Ping @vanzin , mind reviewing again? Thanks a lot.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17348: [SPARK-20018][SQL] Pivot with timestamp and count...

2017-03-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17348#discussion_r107319641
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -486,14 +486,16 @@ class Analyzer(
   case Pivot(groupByExprs, pivotColumn, pivotValues, aggregates, 
child) =>
 val singleAgg = aggregates.size == 1
 def outputName(value: Literal, aggregate: Expression): String = {
+  val utf8Value = Cast(value, StringType, 
Some(conf.sessionLocalTimeZone)).eval(EmptyRow)
--- End diff --

It seems we can cast into `StringType` in all the ways - 
https://github.com/apache/spark/blob/e9e2c612d58a19ddcb4b6abfb7389a4b0f7ef6f8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L41



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17348: [SPARK-20018][SQL] Pivot with timestamp and count should...

2017-03-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17348
  
**[Test build #75019 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75019/testReport)**
 for PR 17348 at commit 
[`4e4cfa7`](https://github.com/apache/spark/commit/4e4cfa76727cc213f71d2f9b5bf2bf7c5905c54e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17382: [SPARK-20051][SS] Fix StreamSuite flaky test - recover f...

2017-03-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17382
  
**[Test build #75014 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75014/testReport)**
 for PR 17382 at commit 
[`8c0fcb1`](https://github.com/apache/spark/commit/8c0fcb1f5406e76908ee045d4530d09e5da1c017).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17350: [SPARK-20017][SQL] change the nullability of function 'S...

2017-03-21 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17350
  
Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17302: [SPARK-19959][SQL] Fix to throw NullPointerExcept...

2017-03-21 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/17302#discussion_r107321520
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -41,7 +41,20 @@ object CatalystSerde {
   }
 
   def generateObjAttr[T : Encoder]: Attribute = {
-AttributeReference("obj", encoderFor[T].deserializer.dataType, 
nullable = false)()
+val deserializer = encoderFor[T].deserializer
+val dataType = deserializer.dataType
+val nullable = if (deserializer.childrenResolved) {
--- End diff --

`deserializer` is resolved at 
[here](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L251-L260).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17378: [SPARK-20046][SQL] Facilitate loop optimizations in a JI...

2017-03-21 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/17378
  
cc @hvanhovell 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17295: [SPARK-19556][core] Do not encrypt block manager ...

2017-03-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17295#discussion_r107322905
  
--- Diff: 
core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala ---
@@ -219,18 +219,22 @@ private[spark] class TorrentBroadcast[T: 
ClassTag](obj: T, id: Long)
 case None =>
   logInfo("Started reading broadcast variable " + id)
   val startTimeMs = System.currentTimeMillis()
-  val blocks = readBlocks().flatMap(_.getChunks())
+  val blocks = readBlocks()
   logInfo("Reading broadcast variable " + id + " took" + 
Utils.getUsedTimeMs(startTimeMs))
 
-  val obj = TorrentBroadcast.unBlockifyObject[T](
-blocks, SparkEnv.get.serializer, compressionCodec)
-  // Store the merged copy in BlockManager so other tasks on this 
executor don't
-  // need to re-fetch it.
-  val storageLevel = StorageLevel.MEMORY_AND_DISK
-  if (!blockManager.putSingle(broadcastId, obj, storageLevel, 
tellMaster = false)) {
-throw new SparkException(s"Failed to store $broadcastId in 
BlockManager")
+  try {
+val obj = TorrentBroadcast.unBlockifyObject[T](
+  blocks.map(_.toInputStream()), SparkEnv.get.serializer, 
compressionCodec)
+// Store the merged copy in BlockManager so other tasks on 
this executor don't
+// need to re-fetch it.
+val storageLevel = StorageLevel.MEMORY_AND_DISK
+if (!blockManager.putSingle(broadcastId, obj, storageLevel, 
tellMaster = false)) {
+  throw new SparkException(s"Failed to store $broadcastId in 
BlockManager")
+}
+obj
+  } finally {
+blocks.foreach(_.dispose())
--- End diff --

ah good catch! we should dispose the blocks here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17295: [SPARK-19556][core] Do not encrypt block manager ...

2017-03-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17295#discussion_r107323246
  
--- Diff: 
core/src/main/scala/org/apache/spark/security/CryptoStreamUtils.scala ---
@@ -63,12 +84,27 @@ private[spark] object CryptoStreamUtils extends Logging 
{
   is: InputStream,
   sparkConf: SparkConf,
   key: Array[Byte]): InputStream = {
-val properties = toCryptoConf(sparkConf)
 val iv = new Array[Byte](IV_LENGTH_IN_BYTES)
-is.read(iv, 0, iv.length)
-val transformationStr = sparkConf.get(IO_CRYPTO_CIPHER_TRANSFORMATION)
-new CryptoInputStream(transformationStr, properties, is,
-  new SecretKeySpec(key, "AES"), new IvParameterSpec(iv))
+ByteStreams.readFully(is, iv)
+val params = new CryptoParams(key, sparkConf)
+new CryptoInputStream(params.transformation, params.conf, is, 
params.keySpec,
+  new IvParameterSpec(iv))
+  }
+
+  /**
+   * Wrap a `ReadableByteChannel` for decryption.
+   */
+  def createReadableChannel(
+  channel: ReadableByteChannel,
+  sparkConf: SparkConf,
+  key: Array[Byte]): ReadableByteChannel = {
+val iv = new Array[Byte](IV_LENGTH_IN_BYTES)
+val buf = ByteBuffer.wrap(iv)
+JavaUtils.readFully(channel, buf)
--- End diff --

why not use `ByteStreams.readFully`? the `buf` is not used else where


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17302: [SPARK-19959][SQL] Fix to throw NullPointerExcept...

2017-03-21 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17302#discussion_r107323253
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameImplicitsSuite.scala ---
@@ -51,4 +51,15 @@ class DataFrameImplicitsSuite extends QueryTest with 
SharedSQLContext {
   sparkContext.parallelize(1 to 10).map(_.toString).toDF("stringCol"),
   (1 to 10).map(i => Row(i.toString)))
   }
+
+  test("SPARK-19959: df[java.lang.Long].collect includes null throws 
NullPointerException") {
+val dfInt = sparkContext.parallelize(Seq[java.lang.Integer](0, null, 
2), 1).toDF
+assert(dfInt.collect === Array(Row(0), Row(null), Row(2)))
--- End diff --

Use `checkAnswer`?

```Scala
checkAnswer(
  sparkContext.parallelize(Seq[java.lang.Integer](0, null, 2), 1).toDF,
  Seq(Row(0), Row(null), Row(2)))
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17295: [SPARK-19556][core] Do not encrypt block manager ...

2017-03-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17295#discussion_r107323552
  
--- Diff: 
core/src/main/scala/org/apache/spark/security/CryptoStreamUtils.scala ---
@@ -48,12 +51,30 @@ private[spark] object CryptoStreamUtils extends Logging 
{
   os: OutputStream,
   sparkConf: SparkConf,
   key: Array[Byte]): OutputStream = {
-val properties = toCryptoConf(sparkConf)
-val iv = createInitializationVector(properties)
+val params = new CryptoParams(key, sparkConf)
+val iv = createInitializationVector(params.conf)
 os.write(iv)
-val transformationStr = sparkConf.get(IO_CRYPTO_CIPHER_TRANSFORMATION)
-new CryptoOutputStream(transformationStr, properties, os,
-  new SecretKeySpec(key, "AES"), new IvParameterSpec(iv))
+new CryptoOutputStream(params.transformation, params.conf, os, 
params.keySpec,
+  new IvParameterSpec(iv))
+  }
+
+  /**
+   * Wrap a `WritableByteChannel` for encryption.
+   */
+  def createWritableChannel(
+  channel: WritableByteChannel,
+  sparkConf: SparkConf,
+  key: Array[Byte]): WritableByteChannel = {
+val params = new CryptoParams(key, sparkConf)
+val iv = createInitializationVector(params.conf)
+val buf = ByteBuffer.wrap(iv)
+while (buf.hasRemaining()) {
--- End diff --

actually this logic is same with `CryptoHelperChannel`. Shall we create 
`CryptoHelperChannel` first and simply call `helper.write(buf)` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15628: [SPARK-17471][ML] Add compressed method to ML matrices

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15628
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15628: [SPARK-17471][ML] Add compressed method to ML matrices

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15628
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75016/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17361: [SPARK-20030][SS] Event-time-based timeout for Ma...

2017-03-21 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17361#discussion_r107307445
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/FlatMapGroupsWithStateSuite.scala
 ---
@@ -519,6 +588,52 @@ class FlatMapGroupsWithStateSuite extends 
StateStoreMetricsTest with BeforeAndAf
 )
   }
 
+  test("flatMapGroupsWithState - streaming with event time timeout") {
+// Function to maintain the max event time
+// Returns the max event time in the state, or -1 if the state was 
removed by timeout
+val stateFunc = (
+key: String,
+values: Iterator[(String, Long)],
+state: KeyedState[Long]) => {
+  val timeoutDelay = 5
+  if (key != "a") {
+Iterator.empty
+  } else {
+if (state.hasTimedOut) {
+  state.remove()
+  Iterator((key, -1))
+} else {
+  val valuesSeq = values.toSeq
+  val maxEventTime = math.max(valuesSeq.map(_._2).max, 
state.getOption.getOrElse(0L))
+  val timeoutTimestampMs = maxEventTime + timeoutDelay
+  state.update(maxEventTime)
+  state.setTimeoutTimestamp(timeoutTimestampMs * 1000)
+  Iterator((key, maxEventTime.toInt))
+}
+  }
+}
+val inputData = MemoryStream[(String, Int)]
+val result =
+  inputData.toDS
+.select($"_1".as("key"), $"_2".cast("timestamp").as("eventTime"))
+.withWatermark("eventTime", "10 seconds")
+.as[(String, Long)]
+.groupByKey[String]((x: (String, Long)) => x._1)
+.flatMapGroupsWithState[Long, (String, Int)](Update, 
EventTimeTimeout)(stateFunc)
--- End diff --

I was debugging and I left them there thinking it help readability of 
tests. I can remove them. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17361: [SPARK-20030][SS] Event-time-based timeout for Ma...

2017-03-21 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/17361#discussion_r107307617
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/FlatMapGroupsWithStateSuite.scala
 ---
@@ -519,6 +588,52 @@ class FlatMapGroupsWithStateSuite extends 
StateStoreMetricsTest with BeforeAndAf
 )
   }
 
+  test("flatMapGroupsWithState - streaming with event time timeout") {
+// Function to maintain the max event time
+// Returns the max event time in the state, or -1 if the state was 
removed by timeout
+val stateFunc = (
+key: String,
+values: Iterator[(String, Long)],
+state: KeyedState[Long]) => {
+  val timeoutDelay = 5
+  if (key != "a") {
+Iterator.empty
+  } else {
+if (state.hasTimedOut) {
+  state.remove()
+  Iterator((key, -1))
+} else {
+  val valuesSeq = values.toSeq
+  val maxEventTime = math.max(valuesSeq.map(_._2).max, 
state.getOption.getOrElse(0L))
+  val timeoutTimestampMs = maxEventTime + timeoutDelay
+  state.update(maxEventTime)
+  state.setTimeoutTimestamp(timeoutTimestampMs * 1000)
+  Iterator((key, maxEventTime.toInt))
+}
+  }
+}
+val inputData = MemoryStream[(String, Int)]
+val result =
+  inputData.toDS
+.select($"_1".as("key"), $"_2".cast("timestamp").as("eventTime"))
+.withWatermark("eventTime", "10 seconds")
+.as[(String, Long)]
+.groupByKey[String]((x: (String, Long)) => x._1)
+.flatMapGroupsWithState[Long, (String, Int)](Update, 
EventTimeTimeout)(stateFunc)
--- End diff --

As long as they aren't required its okay.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17361: [SPARK-20030][SS] Event-time-based timeout for Ma...

2017-03-21 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17361#discussion_r107307589
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/streaming/KeyedStateTimeout.java
 ---
@@ -34,9 +32,20 @@
 @InterfaceStability.Evolving
 public class KeyedStateTimeout {
 
-  /** Timeout based on processing time.  */
+  /**
+   * Timeout based on processing time. The duration of timeout can be set 
for each group in
+   * `map/flatMapGroupsWithState` by calling 
`KeyedState.setTimeoutDuration()`.
+   */
   public static KeyedStateTimeout ProcessingTimeTimeout() { return 
ProcessingTimeTimeout$.MODULE$; }
--- End diff --

Its just that if someone this does `import KeyedStateTimeout._` the code 
boils down to 
`flatMapGroupsWithState(Update, ProcessingTime) { ... } ` with no reference 
to timeout. 

Fine either way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17361: [SPARK-20030][SS] Event-time-based timeout for MapGroups...

2017-03-21 Thread marmbrus
Github user marmbrus commented on the issue:

https://github.com/apache/spark/pull/17361
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17361: [SPARK-20030][SS] Event-time-based timeout for Ma...

2017-03-21 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17361#discussion_r107309220
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala
 ---
@@ -147,49 +147,68 @@ object UnsupportedOperationChecker {
   throwError("Commands like CreateTable*, AlterTable*, Show* are 
not supported with " +
 "streaming DataFrames/Datasets")
 
-// mapGroupsWithState: Allowed only when no aggregation + Update 
output mode
-case m: FlatMapGroupsWithState if m.isStreaming && 
m.isMapGroupsWithState =>
-  if (collectStreamingAggregates(plan).isEmpty) {
-if (outputMode != InternalOutputModes.Update) {
-  throwError("mapGroupsWithState is not supported with " +
-s"$outputMode output mode on a streaming 
DataFrame/Dataset")
-} else {
-  // Allowed when no aggregation + Update output mode
-}
-  } else {
-throwError("mapGroupsWithState is not supported with 
aggregation " +
-  "on a streaming DataFrame/Dataset")
-  }
-
-// flatMapGroupsWithState without aggregation
-case m: FlatMapGroupsWithState
-  if m.isStreaming && collectStreamingAggregates(plan).isEmpty =>
-  m.outputMode match {
-case InternalOutputModes.Update =>
-  if (outputMode != InternalOutputModes.Update) {
-throwError("flatMapGroupsWithState in update mode is not 
supported with " +
+// mapGroupsWithState and flatMapGroupsWithState
+case m: FlatMapGroupsWithState if m.isStreaming =>
+
+  // Check compatibility with output modes and aggregations in 
query
+  val aggsAfterFlatMapGroups = collectStreamingAggregates(plan)
+
+  if (m.isMapGroupsWithState) {   // check 
mapGroupsWithState
+// allowed only in update query output mode and without 
aggregation
+if (aggsAfterFlatMapGroups.nonEmpty) {
+  throwError(
+"mapGroupsWithState is not supported with aggregation " +
+  "on a streaming DataFrame/Dataset")
+} else if (outputMode != InternalOutputModes.Update) {
+  throwError(
+"mapGroupsWithState is not supported with " +
   s"$outputMode output mode on a streaming 
DataFrame/Dataset")
+}
+  } else {   // check 
latMapGroupsWithState
+if (aggsAfterFlatMapGroups.isEmpty) {
+  // flatMapGroupsWithState without aggregation: operation's 
output mode must
+  // match query output mode
+  m.outputMode match {
+case InternalOutputModes.Update if outputMode != 
InternalOutputModes.Update =>
+  throwError(
+"flatMapGroupsWithState in update mode is not 
supported with " +
+  s"$outputMode output mode on a streaming 
DataFrame/Dataset")
+
+case InternalOutputModes.Append if outputMode != 
InternalOutputModes.Append =>
+  throwError(
+"flatMapGroupsWithState in append mode is not 
supported with " +
+  s"$outputMode output mode on a streaming 
DataFrame/Dataset")
+
+case _ =>
   }
-case InternalOutputModes.Append =>
-  if (outputMode != InternalOutputModes.Append) {
-throwError("flatMapGroupsWithState in append mode is not 
supported with " +
-  s"$outputMode output mode on a streaming 
DataFrame/Dataset")
+} else {
+  // flatMapGroupsWithState with aggregation: update operation 
mode not allowed, and
+  // *groupsWithState after aggregation not allowed
+  if (m.outputMode == InternalOutputModes.Update) {
+throwError(
+  "flatMapGroupsWithState in update mode is not supported 
with " +
+"aggregation on a streaming DataFrame/Dataset")
+  } else if (collectStreamingAggregates(m).nonEmpty) {
+throwError(
+  "flatMapGroupsWithState in append mode is not supported 
after " +
+s"aggregation on a streaming DataFrame/Dataset")
   }
+}
   }
 
-// flatMapGroupsWithState(Update) with aggregation
-case m: FlatMapGroupsWithState
-  if m.isStreaming && m.outputMode == InternalOutputModes.Update
-&& 

[GitHub] spark issue #17256: [SPARK-19919][SQL] Defer throwing the exception for empt...

2017-03-21 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/17256
  
LGTM, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

2017-03-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17316
  
**[Test build #75012 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75012/testReport)**
 for PR 17316 at commit 
[`7fd17dd`](https://github.com/apache/spark/commit/7fd17dd43441b2c7212f964efd921e8c2d429a9b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17361: [SPARK-20030][SS] Event-time-based timeout for MapGroups...

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17361
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75010/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17361: [SPARK-20030][SS] Event-time-based timeout for MapGroups...

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17361
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13932: [SPARK-15354] [CORE] Topology aware block replica...

2017-03-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13932#discussion_r107310985
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockReplicationPolicy.scala ---
@@ -53,6 +53,48 @@ trait BlockReplicationPolicy {
   numReplicas: Int): List[BlockManagerId]
 }
 
+object BlockReplicationUtils {
+  // scalastyle:off line.size.limit
+  /**
+   * Uses sampling algorithm by Robert Floyd. Finds a random sample in 
O(n) while
+   * minimizing space usage. Please see http://math.stackexchange.com/questions/178690/whats-the-proof-of-correctness-for-robert-floyds-algorithm-for-selecting-a-sin;>
+   * here.
+   *
+   * @param n total number of indices
+   * @param m number of samples needed
+   * @param r random number generator
+   * @return list of m random unique indices
+   */
+  // scalastyle:on line.size.limit
+  private def getSampleIds(n: Int, m: Int, r: Random): List[Int] = {
+val indices = (n - m + 1 to n).foldLeft(Set.empty[Int]) {case (set, i) 
=>
+  val t = r.nextInt(i) + 1
+  if (set.contains(t)) set + i else set + t
+}
+// we shuffle the result to ensure a random arrangement within the 
sample
+// to avoid any bias from set implementations
+r.shuffle(indices.map(_ - 1).toList)
--- End diff --

shall we use `LinkedHashSet` so that we don't need this extra shuffle?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13932: [SPARK-15354] [CORE] Topology aware block replica...

2017-03-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13932#discussion_r107311406
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockReplicationPolicy.scala ---
@@ -88,26 +131,94 @@ class RandomBlockReplicationPolicy
 logDebug(s"Prioritized peers : ${prioritizedPeers.mkString(", ")}")
 prioritizedPeers
   }
+}
+
+@DeveloperApi
+class BasicBlockReplicationPolicy
+  extends BlockReplicationPolicy
+with Logging {
 
-  // scalastyle:off line.size.limit
   /**
-   * Uses sampling algorithm by Robert Floyd. Finds a random sample in 
O(n) while
-   * minimizing space usage. Please see http://math.stackexchange.com/questions/178690/whats-the-proof-of-correctness-for-robert-floyds-algorithm-for-selecting-a-sin;>
-   * here.
+   * Method to prioritize a bunch of candidate peers of a block manager. 
This implementation
+   * replicates the behavior of block replication in HDFS, a peer is 
chosen within the rack,
--- End diff --

can we explain the replicating logic for any replication factor?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...

2017-03-21 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15628#discussion_r107306774
  
--- Diff: 
mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala ---
@@ -161,6 +162,109 @@ sealed trait Matrix extends Serializable {
*/
   @Since("2.0.0")
   def numActives: Int
+
+  /**
+   * Converts this matrix to a sparse matrix.
+   *
+   * @param colMajor Whether the values of the resulting sparse matrix 
should be in column major
+   *or row major order. If `false`, resulting matrix 
will be row major.
+   */
+  private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix
+
+  /**
+   * Converts this matrix to a sparse matrix in column major order.
+   */
+  @Since("2.2.0")
+  def toCSC: SparseMatrix = toSparseMatrix(colMajor = true)
+
+  /**
+   * Converts this matrix to a sparse matrix in row major order.
+   */
+  @Since("2.2.0")
+  def toCSR: SparseMatrix = toSparseMatrix(colMajor = false)
--- End diff --

Same question.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17348: [SPARK-20018][SQL] Pivot with timestamp and count should...

2017-03-21 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/17348
  
What if session local timezone is changed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...

2017-03-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17166
  
**[Test build #75005 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75005/testReport)**
 for PR 17166 at commit 
[`5707715`](https://github.com/apache/spark/commit/570771555c877fc0b7a8c989e14fdaf4aa79c217).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15009: [SPARK-17443][SPARK-11035] Stop Spark Application if lau...

2017-03-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15009
  
**[Test build #75006 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75006/testReport)**
 for PR 15009 at commit 
[`fe5b5d6`](https://github.com/apache/spark/commit/fe5b5d64b56dd55ad4a956619bf41c3492975f89).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17361: [SPARK-20030][SS] Event-time-based timeout for MapGroups...

2017-03-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17361
  
**[Test build #75017 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75017/testReport)**
 for PR 17361 at commit 
[`9c9668b`](https://github.com/apache/spark/commit/9c9668b9e2d76e3ef56f6a8094c76b5a38178d1b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17381: [SPARK-20023] [SQL] Output table comment for DESC FORMAT...

2017-03-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17381
  
**[Test build #75009 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75009/testReport)**
 for PR 17381 at commit 
[`6890bbc`](https://github.com/apache/spark/commit/6890bbc36cbefcb886fb772c8199ef64188d4db8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17350: [SPARK-20017][SQL] change the nullability of function 'S...

2017-03-21 Thread zhaorongsheng
Github user zhaorongsheng commented on the issue:

https://github.com/apache/spark/pull/17350
  
@gatorsmile OK, I will do it and I will give you feedback as soon as 
possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13932: [SPARK-15354] [CORE] Topology aware block replica...

2017-03-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13932#discussion_r107316887
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockReplicationPolicy.scala ---
@@ -88,26 +131,94 @@ class RandomBlockReplicationPolicy
 logDebug(s"Prioritized peers : ${prioritizedPeers.mkString(", ")}")
 prioritizedPeers
   }
+}
+
+@DeveloperApi
+class BasicBlockReplicationPolicy
+  extends BlockReplicationPolicy
+with Logging {
 
-  // scalastyle:off line.size.limit
   /**
-   * Uses sampling algorithm by Robert Floyd. Finds a random sample in 
O(n) while
-   * minimizing space usage. Please see http://math.stackexchange.com/questions/178690/whats-the-proof-of-correctness-for-robert-floyds-algorithm-for-selecting-a-sin;>
-   * here.
+   * Method to prioritize a bunch of candidate peers of a block manager. 
This implementation
+   * replicates the behavior of block replication in HDFS, a peer is 
chosen within the rack,
+   * one outside and that's it. This works best with a total replication 
factor of 3.
*
-   * @param n total number of indices
-   * @param m number of samples needed
-   * @param r random number generator
-   * @return list of m random unique indices
+   * @param blockManagerIdId of the current BlockManager for self 
identification
+   * @param peers A list of peers of a BlockManager
+   * @param peersReplicatedTo Set of peers already replicated to
+   * @param blockId   BlockId of the block being replicated. This 
can be used as a source of
+   *  randomness if needed.
+   * @param numReplicas Number of peers we need to replicate to
+   * @return A prioritized list of peers. Lower the index of a peer, 
higher its priority
*/
-  // scalastyle:on line.size.limit
-  private def getSampleIds(n: Int, m: Int, r: Random): List[Int] = {
-val indices = (n - m + 1 to n).foldLeft(Set.empty[Int]) {case (set, i) 
=>
-  val t = r.nextInt(i) + 1
-  if (set.contains(t)) set + i else set + t
+  override def prioritize(
+  blockManagerId: BlockManagerId,
+  peers: Seq[BlockManagerId],
+  peersReplicatedTo: mutable.HashSet[BlockManagerId],
+  blockId: BlockId,
+  numReplicas: Int): List[BlockManagerId] = {
+
+logDebug(s"Input peers : $peers")
+logDebug(s"BlockManagerId : $blockManagerId")
+
+val random = new Random(blockId.hashCode)
+
+// if block doesn't have topology info, we can't do much, so we 
randlomly shuffle
+// if there is, we see what's needed from peersReplicatedTo and based 
on numReplicas,
+// we choose whats needed
+if (blockManagerId.topologyInfo.isEmpty || numReplicas == 0) {
+  // no topology info for the block. The best we can do is randomly 
choose peers
+  BlockReplicationUtils.getRandomSample(peers, numReplicas, random)
+} else {
+  // we have topology information, we see what is left to be done from 
peersReplicatedTo
+  val doneWithinRack = peersReplicatedTo.exists(_.topologyInfo == 
blockManagerId.topologyInfo)
+  val doneOutsideRack = peersReplicatedTo.exists { p =>
+p.topologyInfo.isDefined && p.topologyInfo != 
blockManagerId.topologyInfo
+  }
+
+  if (doneOutsideRack && doneWithinRack) {
--- End diff --

what? I think this branch is where we should do smart replication


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13932: [SPARK-15354] [CORE] Topology aware block replica...

2017-03-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13932#discussion_r107316905
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockReplicationPolicy.scala ---
@@ -88,26 +131,94 @@ class RandomBlockReplicationPolicy
 logDebug(s"Prioritized peers : ${prioritizedPeers.mkString(", ")}")
 prioritizedPeers
   }
+}
+
+@DeveloperApi
+class BasicBlockReplicationPolicy
+  extends BlockReplicationPolicy
+with Logging {
 
-  // scalastyle:off line.size.limit
   /**
-   * Uses sampling algorithm by Robert Floyd. Finds a random sample in 
O(n) while
-   * minimizing space usage. Please see http://math.stackexchange.com/questions/178690/whats-the-proof-of-correctness-for-robert-floyds-algorithm-for-selecting-a-sin;>
-   * here.
+   * Method to prioritize a bunch of candidate peers of a block manager. 
This implementation
+   * replicates the behavior of block replication in HDFS, a peer is 
chosen within the rack,
+   * one outside and that's it. This works best with a total replication 
factor of 3.
*
-   * @param n total number of indices
-   * @param m number of samples needed
-   * @param r random number generator
-   * @return list of m random unique indices
+   * @param blockManagerIdId of the current BlockManager for self 
identification
+   * @param peers A list of peers of a BlockManager
+   * @param peersReplicatedTo Set of peers already replicated to
+   * @param blockId   BlockId of the block being replicated. This 
can be used as a source of
+   *  randomness if needed.
+   * @param numReplicas Number of peers we need to replicate to
+   * @return A prioritized list of peers. Lower the index of a peer, 
higher its priority
*/
-  // scalastyle:on line.size.limit
-  private def getSampleIds(n: Int, m: Int, r: Random): List[Int] = {
-val indices = (n - m + 1 to n).foldLeft(Set.empty[Int]) {case (set, i) 
=>
-  val t = r.nextInt(i) + 1
-  if (set.contains(t)) set + i else set + t
+  override def prioritize(
+  blockManagerId: BlockManagerId,
+  peers: Seq[BlockManagerId],
+  peersReplicatedTo: mutable.HashSet[BlockManagerId],
+  blockId: BlockId,
+  numReplicas: Int): List[BlockManagerId] = {
+
+logDebug(s"Input peers : $peers")
+logDebug(s"BlockManagerId : $blockManagerId")
+
+val random = new Random(blockId.hashCode)
+
+// if block doesn't have topology info, we can't do much, so we 
randlomly shuffle
+// if there is, we see what's needed from peersReplicatedTo and based 
on numReplicas,
+// we choose whats needed
+if (blockManagerId.topologyInfo.isEmpty || numReplicas == 0) {
+  // no topology info for the block. The best we can do is randomly 
choose peers
+  BlockReplicationUtils.getRandomSample(peers, numReplicas, random)
+} else {
+  // we have topology information, we see what is left to be done from 
peersReplicatedTo
+  val doneWithinRack = peersReplicatedTo.exists(_.topologyInfo == 
blockManagerId.topologyInfo)
+  val doneOutsideRack = peersReplicatedTo.exists { p =>
+p.topologyInfo.isDefined && p.topologyInfo != 
blockManagerId.topologyInfo
+  }
+
+  if (doneOutsideRack && doneWithinRack) {
+// we are done, we just return a random sample
--- End diff --

what? I think this branch is where we should do smart replication


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17348: [SPARK-20018][SQL] Pivot with timestamp and count should...

2017-03-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17348
  
**[Test build #75018 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75018/testReport)**
 for PR 17348 at commit 
[`93f05f3`](https://github.com/apache/spark/commit/93f05f3545d9af335ca1f6c711b6f84b9938b95e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17361: [SPARK-20030][SS] Event-time-based timeout for MapGroups...

2017-03-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17361
  
**[Test build #3604 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3604/testReport)**
 for PR 17361 at commit 
[`9c9668b`](https://github.com/apache/spark/commit/9c9668b9e2d76e3ef56f6a8094c76b5a38178d1b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17250: [SPARK-19911][STREAMING] Add builder interface fo...

2017-03-21 Thread budde
Github user budde commented on a diff in the pull request:

https://github.com/apache/spark/pull/17250#discussion_r107318916
  
--- Diff: 
external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/SerializableCredentialsProvider.scala
 ---
@@ -83,3 +84,146 @@ private[kinesis] final case class 
STSCredentialsProvider(
 }
   }
 }
+
+@InterfaceStability.Stable
+object SerializableCredentialsProvider {
--- End diff --

I agree we should definitely come up with a better name here. What about 
```SparkAWSCredentials```? Obviously it's not as succinct as 
```AWSCredentials``` but I think it's a clear name that avoids collisions.

I'm okay with ```CredentialsProvider``` otherwise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17379: [SPARK-20048][SQL] Cloning SessionState does not clone q...

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17379
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17361: [SPARK-20030][SS] Event-time-based timeout for Ma...

2017-03-21 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17361#discussion_r107307253
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala
 ---
@@ -147,49 +147,68 @@ object UnsupportedOperationChecker {
   throwError("Commands like CreateTable*, AlterTable*, Show* are 
not supported with " +
 "streaming DataFrames/Datasets")
 
-// mapGroupsWithState: Allowed only when no aggregation + Update 
output mode
-case m: FlatMapGroupsWithState if m.isStreaming && 
m.isMapGroupsWithState =>
-  if (collectStreamingAggregates(plan).isEmpty) {
-if (outputMode != InternalOutputModes.Update) {
-  throwError("mapGroupsWithState is not supported with " +
-s"$outputMode output mode on a streaming 
DataFrame/Dataset")
-} else {
-  // Allowed when no aggregation + Update output mode
-}
-  } else {
-throwError("mapGroupsWithState is not supported with 
aggregation " +
-  "on a streaming DataFrame/Dataset")
-  }
-
-// flatMapGroupsWithState without aggregation
-case m: FlatMapGroupsWithState
-  if m.isStreaming && collectStreamingAggregates(plan).isEmpty =>
-  m.outputMode match {
-case InternalOutputModes.Update =>
-  if (outputMode != InternalOutputModes.Update) {
-throwError("flatMapGroupsWithState in update mode is not 
supported with " +
+// mapGroupsWithState and flatMapGroupsWithState
+case m: FlatMapGroupsWithState if m.isStreaming =>
+
+  // Check compatibility with output modes and aggregations in 
query
+  val aggsAfterFlatMapGroups = collectStreamingAggregates(plan)
+
+  if (m.isMapGroupsWithState) {   // check 
mapGroupsWithState
+// allowed only in update query output mode and without 
aggregation
+if (aggsAfterFlatMapGroups.nonEmpty) {
+  throwError(
+"mapGroupsWithState is not supported with aggregation " +
+  "on a streaming DataFrame/Dataset")
+} else if (outputMode != InternalOutputModes.Update) {
+  throwError(
+"mapGroupsWithState is not supported with " +
   s"$outputMode output mode on a streaming 
DataFrame/Dataset")
+}
+  } else {   // check 
latMapGroupsWithState
+if (aggsAfterFlatMapGroups.isEmpty) {
+  // flatMapGroupsWithState without aggregation: operation's 
output mode must
+  // match query output mode
+  m.outputMode match {
+case InternalOutputModes.Update if outputMode != 
InternalOutputModes.Update =>
+  throwError(
+"flatMapGroupsWithState in update mode is not 
supported with " +
+  s"$outputMode output mode on a streaming 
DataFrame/Dataset")
+
+case InternalOutputModes.Append if outputMode != 
InternalOutputModes.Append =>
+  throwError(
+"flatMapGroupsWithState in append mode is not 
supported with " +
+  s"$outputMode output mode on a streaming 
DataFrame/Dataset")
+
+case _ =>
   }
-case InternalOutputModes.Append =>
-  if (outputMode != InternalOutputModes.Append) {
-throwError("flatMapGroupsWithState in append mode is not 
supported with " +
-  s"$outputMode output mode on a streaming 
DataFrame/Dataset")
+} else {
+  // flatMapGroupsWithState with aggregation: update operation 
mode not allowed, and
+  // *groupsWithState after aggregation not allowed
+  if (m.outputMode == InternalOutputModes.Update) {
+throwError(
+  "flatMapGroupsWithState in update mode is not supported 
with " +
+"aggregation on a streaming DataFrame/Dataset")
+  } else if (collectStreamingAggregates(m).nonEmpty) {
+throwError(
+  "flatMapGroupsWithState in append mode is not supported 
after " +
+s"aggregation on a streaming DataFrame/Dataset")
   }
+}
   }
 
-// flatMapGroupsWithState(Update) with aggregation
-case m: FlatMapGroupsWithState
-  if m.isStreaming && m.outputMode == InternalOutputModes.Update
-&& 

[GitHub] spark pull request #17361: [SPARK-20030][SS] Event-time-based timeout for Ma...

2017-03-21 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/17361#discussion_r107307367
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/KeyedStateImpl.scala
 ---
@@ -17,37 +17,45 @@
 
 package org.apache.spark.sql.execution.streaming
 
+import java.sql.Date
+
 import org.apache.commons.lang3.StringUtils
 
-import org.apache.spark.sql.streaming.KeyedState
+import org.apache.spark.sql.catalyst.plans.logical.{EventTimeTimeout, 
ProcessingTimeTimeout}
+import org.apache.spark.sql.execution.streaming.KeyedStateImpl._
+import org.apache.spark.sql.streaming.{KeyedState, KeyedStateTimeout}
 import org.apache.spark.unsafe.types.CalendarInterval
 
+
 /**
  * Internal implementation of the [[KeyedState]] interface. Methods are 
not thread-safe.
  * @param optionalValue Optional value of the state
  * @param batchProcessingTimeMs Processing time of current batch, used to 
calculate timestamp
  *  for processing time timeouts
- * @param isTimeoutEnabled Whether timeout is enabled. This will be used 
to check whether the user
- * is allowed to configure timeouts.
+ * @param timeoutConf   Type of timeout configured. Based on this, 
different operations will
+ *be supported.
--- End diff --

nit: indent is inconsistent


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17361: [SPARK-20030][SS] Event-time-based timeout for Ma...

2017-03-21 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17361#discussion_r107307283
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala
 ---
@@ -147,49 +147,68 @@ object UnsupportedOperationChecker {
   throwError("Commands like CreateTable*, AlterTable*, Show* are 
not supported with " +
 "streaming DataFrames/Datasets")
 
-// mapGroupsWithState: Allowed only when no aggregation + Update 
output mode
-case m: FlatMapGroupsWithState if m.isStreaming && 
m.isMapGroupsWithState =>
-  if (collectStreamingAggregates(plan).isEmpty) {
-if (outputMode != InternalOutputModes.Update) {
-  throwError("mapGroupsWithState is not supported with " +
-s"$outputMode output mode on a streaming 
DataFrame/Dataset")
-} else {
-  // Allowed when no aggregation + Update output mode
-}
-  } else {
-throwError("mapGroupsWithState is not supported with 
aggregation " +
-  "on a streaming DataFrame/Dataset")
-  }
-
-// flatMapGroupsWithState without aggregation
-case m: FlatMapGroupsWithState
-  if m.isStreaming && collectStreamingAggregates(plan).isEmpty =>
-  m.outputMode match {
-case InternalOutputModes.Update =>
-  if (outputMode != InternalOutputModes.Update) {
-throwError("flatMapGroupsWithState in update mode is not 
supported with " +
+// mapGroupsWithState and flatMapGroupsWithState
+case m: FlatMapGroupsWithState if m.isStreaming =>
+
+  // Check compatibility with output modes and aggregations in 
query
+  val aggsAfterFlatMapGroups = collectStreamingAggregates(plan)
+
+  if (m.isMapGroupsWithState) {   // check 
mapGroupsWithState
+// allowed only in update query output mode and without 
aggregation
+if (aggsAfterFlatMapGroups.nonEmpty) {
+  throwError(
+"mapGroupsWithState is not supported with aggregation " +
+  "on a streaming DataFrame/Dataset")
+} else if (outputMode != InternalOutputModes.Update) {
+  throwError(
+"mapGroupsWithState is not supported with " +
   s"$outputMode output mode on a streaming 
DataFrame/Dataset")
+}
+  } else {   // check 
latMapGroupsWithState
+if (aggsAfterFlatMapGroups.isEmpty) {
+  // flatMapGroupsWithState without aggregation: operation's 
output mode must
+  // match query output mode
+  m.outputMode match {
+case InternalOutputModes.Update if outputMode != 
InternalOutputModes.Update =>
+  throwError(
+"flatMapGroupsWithState in update mode is not 
supported with " +
+  s"$outputMode output mode on a streaming 
DataFrame/Dataset")
+
+case InternalOutputModes.Append if outputMode != 
InternalOutputModes.Append =>
+  throwError(
+"flatMapGroupsWithState in append mode is not 
supported with " +
+  s"$outputMode output mode on a streaming 
DataFrame/Dataset")
+
+case _ =>
   }
-case InternalOutputModes.Append =>
-  if (outputMode != InternalOutputModes.Append) {
-throwError("flatMapGroupsWithState in append mode is not 
supported with " +
-  s"$outputMode output mode on a streaming 
DataFrame/Dataset")
+} else {
+  // flatMapGroupsWithState with aggregation: update operation 
mode not allowed, and
+  // *groupsWithState after aggregation not allowed
+  if (m.outputMode == InternalOutputModes.Update) {
+throwError(
+  "flatMapGroupsWithState in update mode is not supported 
with " +
+"aggregation on a streaming DataFrame/Dataset")
+  } else if (collectStreamingAggregates(m).nonEmpty) {
+throwError(
+  "flatMapGroupsWithState in append mode is not supported 
after " +
+s"aggregation on a streaming DataFrame/Dataset")
   }
+}
   }
 
-// flatMapGroupsWithState(Update) with aggregation
-case m: FlatMapGroupsWithState
-  if m.isStreaming && m.outputMode == InternalOutputModes.Update
-&& 

[GitHub] spark pull request #17361: [SPARK-20030][SS] Event-time-based timeout for Ma...

2017-03-21 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/17361#discussion_r107307722
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala
 ---
@@ -147,49 +147,68 @@ object UnsupportedOperationChecker {
   throwError("Commands like CreateTable*, AlterTable*, Show* are 
not supported with " +
 "streaming DataFrames/Datasets")
 
-// mapGroupsWithState: Allowed only when no aggregation + Update 
output mode
-case m: FlatMapGroupsWithState if m.isStreaming && 
m.isMapGroupsWithState =>
-  if (collectStreamingAggregates(plan).isEmpty) {
-if (outputMode != InternalOutputModes.Update) {
-  throwError("mapGroupsWithState is not supported with " +
-s"$outputMode output mode on a streaming 
DataFrame/Dataset")
-} else {
-  // Allowed when no aggregation + Update output mode
-}
-  } else {
-throwError("mapGroupsWithState is not supported with 
aggregation " +
-  "on a streaming DataFrame/Dataset")
-  }
-
-// flatMapGroupsWithState without aggregation
-case m: FlatMapGroupsWithState
-  if m.isStreaming && collectStreamingAggregates(plan).isEmpty =>
-  m.outputMode match {
-case InternalOutputModes.Update =>
-  if (outputMode != InternalOutputModes.Update) {
-throwError("flatMapGroupsWithState in update mode is not 
supported with " +
+// mapGroupsWithState and flatMapGroupsWithState
+case m: FlatMapGroupsWithState if m.isStreaming =>
+
+  // Check compatibility with output modes and aggregations in 
query
+  val aggsAfterFlatMapGroups = collectStreamingAggregates(plan)
+
+  if (m.isMapGroupsWithState) {   // check 
mapGroupsWithState
+// allowed only in update query output mode and without 
aggregation
+if (aggsAfterFlatMapGroups.nonEmpty) {
+  throwError(
+"mapGroupsWithState is not supported with aggregation " +
+  "on a streaming DataFrame/Dataset")
+} else if (outputMode != InternalOutputModes.Update) {
+  throwError(
+"mapGroupsWithState is not supported with " +
   s"$outputMode output mode on a streaming 
DataFrame/Dataset")
+}
+  } else {   // check 
latMapGroupsWithState
+if (aggsAfterFlatMapGroups.isEmpty) {
+  // flatMapGroupsWithState without aggregation: operation's 
output mode must
+  // match query output mode
+  m.outputMode match {
+case InternalOutputModes.Update if outputMode != 
InternalOutputModes.Update =>
+  throwError(
+"flatMapGroupsWithState in update mode is not 
supported with " +
+  s"$outputMode output mode on a streaming 
DataFrame/Dataset")
+
+case InternalOutputModes.Append if outputMode != 
InternalOutputModes.Append =>
+  throwError(
+"flatMapGroupsWithState in append mode is not 
supported with " +
+  s"$outputMode output mode on a streaming 
DataFrame/Dataset")
+
+case _ =>
   }
-case InternalOutputModes.Append =>
-  if (outputMode != InternalOutputModes.Append) {
-throwError("flatMapGroupsWithState in append mode is not 
supported with " +
-  s"$outputMode output mode on a streaming 
DataFrame/Dataset")
+} else {
+  // flatMapGroupsWithState with aggregation: update operation 
mode not allowed, and
+  // *groupsWithState after aggregation not allowed
+  if (m.outputMode == InternalOutputModes.Update) {
+throwError(
+  "flatMapGroupsWithState in update mode is not supported 
with " +
+"aggregation on a streaming DataFrame/Dataset")
+  } else if (collectStreamingAggregates(m).nonEmpty) {
+throwError(
+  "flatMapGroupsWithState in append mode is not supported 
after " +
+s"aggregation on a streaming DataFrame/Dataset")
   }
+}
   }
 
-// flatMapGroupsWithState(Update) with aggregation
-case m: FlatMapGroupsWithState
-  if m.isStreaming && m.outputMode == InternalOutputModes.Update
-&& 

[GitHub] spark issue #17256: [SPARK-19919][SQL] Defer throwing the exception for empt...

2017-03-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17256
  
cc @cloud-fan, could you see if it sounds good?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17361: [SPARK-20030][SS] Event-time-based timeout for Ma...

2017-03-21 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/17361#discussion_r107307806
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/streaming/KeyedStateTimeout.java
 ---
@@ -34,9 +32,20 @@
 @InterfaceStability.Evolving
 public class KeyedStateTimeout {
 
-  /** Timeout based on processing time.  */
+  /**
+   * Timeout based on processing time. The duration of timeout can be set 
for each group in
+   * `map/flatMapGroupsWithState` by calling 
`KeyedState.setTimeoutDuration()`.
+   */
   public static KeyedStateTimeout ProcessingTimeTimeout() { return 
ProcessingTimeTimeout$.MODULE$; }
--- End diff --

I'd probably still remove it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17361: [SPARK-20030][SS] Event-time-based timeout for MapGroups...

2017-03-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17361
  
**[Test build #75010 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75010/testReport)**
 for PR 17361 at commit 
[`64b6abf`](https://github.com/apache/spark/commit/64b6abf89189fe8a3f3d2f58f341a9b58a95a6d2).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17256: [SPARK-19919][SQL] Defer throwing the exception f...

2017-03-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17256


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17256: [SPARK-19919][SQL] Defer throwing the exception for empt...

2017-03-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17256
  
Thank you @cloud-fan.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17348: [SPARK-20018][SQL] Pivot with timestamp and count...

2017-03-21 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17348#discussion_r107311172
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -486,14 +486,16 @@ class Analyzer(
   case Pivot(groupByExprs, pivotColumn, pivotValues, aggregates, 
child) =>
 val singleAgg = aggregates.size == 1
 def outputName(value: Literal, aggregate: Expression): String = {
+  val scalaValue = 
CatalystTypeConverters.convertToScala(value.value, value.dataType)
+  val stringValue = Option(scalaValue).getOrElse("null").toString
--- End diff --

The impact is not only on the data type `timestamp`. Any test case to cover 
`null`? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17375: [SPARK-19019][PYTHON][BRANCH-1.6] Fix hijacked `collecti...

2017-03-21 Thread davies
Github user davies commented on the issue:

https://github.com/apache/spark/pull/17375
  
lgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17374: [SPARK-19019][PYTHON][BRANCH-2.0] Fix hijacked `collecti...

2017-03-21 Thread davies
Github user davies commented on the issue:

https://github.com/apache/spark/pull/17374
  
lgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11119: [SPARK-10780][ML] Add an initial model to kmeans

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/9
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75004/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11119: [SPARK-10780][ML] Add an initial model to kmeans

2017-03-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/9
  
**[Test build #75004 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75004/consoleFull)**
 for PR 9 at commit 
[`6f169eb`](https://github.com/apache/spark/commit/6f169ebf8c0c832010d2dbd8f971cfabff7870f2).
 * This patch passes all tests.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11119: [SPARK-10780][ML] Add an initial model to kmeans

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/9
  
Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17166
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75005/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17166
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15009: [SPARK-17443][SPARK-11035] Stop Spark Application if lau...

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15009
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15009: [SPARK-17443][SPARK-11035] Stop Spark Application if lau...

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15009
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75006/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

2017-03-21 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/16781
  
I checked HIVE-12767 and reviewed this pr roughly.
And I collected my thoughts about the desired behavior of this issue, 
please correct me if I'm wrong.

when creating table:

- if `SQLConf.PARQUET_TABLE_INCLUDE_TIMEZONE` == true
  - if a property `PARQUET_TIMEZONE_TABLE_PROPERTY` exists
- include the table property `PARQUET_TIMEZONE_TABLE_PROPERTY`
- use the `PARQUET_TIMEZONE_TABLE_PROPERTY` value
  - else
- include the table property `PARQUET_TIMEZONE_TABLE_PROPERTY`
- use session local timezone as the default 
`PARQUET_TIMEZONE_TABLE_PROPERTY` value
- else
  - if a property `PARQUET_TIMEZONE_TABLE_PROPERTY` exists
- include the table property `PARQUET_TIMEZONE_TABLE_PROPERTY`
- use the `PARQUET_TIMEZONE_TABLE_PROPERTY` value
  - else
- don't include table property `PARQUET_TIMEZONE_TABLE_PROPERTY`

when writing/reading data:

- if a table property `PARQUET_TIMEZONE_TABLE_PROPERTY` exists
  - use the `PARQUET_TIMEZONE_TABLE_PROPERTY` value to adjust timezone
- else
  - don't adjust timezone

Timezone related expressions respect session local timezone now, so we 
should also use session local timezone as the default value of 
`PARQUET_TIMEZONE_TABLE_PROPERTY` instead of system timezone, i.e. use 
`sparkSession.sessionState.conf.sessionLocalTimeZone` instead of 
`TimeZone.getDefault()`.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17302: [SPARK-19959][SQL] Fix to throw NullPointerExcept...

2017-03-21 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/17302#discussion_r107314977
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -41,7 +41,20 @@ object CatalystSerde {
   }
 
   def generateObjAttr[T : Encoder]: Attribute = {
-AttributeReference("obj", encoderFor[T].deserializer.dataType, 
nullable = false)()
+val deserializer = encoderFor[T].deserializer
+val dataType = deserializer.dataType
+val nullable = if (deserializer.childrenResolved) {
--- End diff --

The point is whether `deserializer` is resolved or not. It is not a point 
whether `encoderFor[T]` is resolved or not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17381: [SPARK-20023] [SQL] Output table comment for DESC FORMAT...

2017-03-21 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17381
  
cc @cloud-fan @wzhfy 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13932: [SPARK-15354] [CORE] Topology aware block replica...

2017-03-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13932#discussion_r107316810
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockReplicationPolicy.scala ---
@@ -88,26 +131,94 @@ class RandomBlockReplicationPolicy
 logDebug(s"Prioritized peers : ${prioritizedPeers.mkString(", ")}")
 prioritizedPeers
   }
+}
+
+@DeveloperApi
+class BasicBlockReplicationPolicy
+  extends BlockReplicationPolicy
+with Logging {
 
-  // scalastyle:off line.size.limit
   /**
-   * Uses sampling algorithm by Robert Floyd. Finds a random sample in 
O(n) while
-   * minimizing space usage. Please see http://math.stackexchange.com/questions/178690/whats-the-proof-of-correctness-for-robert-floyds-algorithm-for-selecting-a-sin;>
-   * here.
+   * Method to prioritize a bunch of candidate peers of a block manager. 
This implementation
+   * replicates the behavior of block replication in HDFS, a peer is 
chosen within the rack,
+   * one outside and that's it. This works best with a total replication 
factor of 3.
*
-   * @param n total number of indices
-   * @param m number of samples needed
-   * @param r random number generator
-   * @return list of m random unique indices
+   * @param blockManagerIdId of the current BlockManager for self 
identification
+   * @param peers A list of peers of a BlockManager
+   * @param peersReplicatedTo Set of peers already replicated to
+   * @param blockId   BlockId of the block being replicated. This 
can be used as a source of
+   *  randomness if needed.
+   * @param numReplicas Number of peers we need to replicate to
+   * @return A prioritized list of peers. Lower the index of a peer, 
higher its priority
*/
-  // scalastyle:on line.size.limit
-  private def getSampleIds(n: Int, m: Int, r: Random): List[Int] = {
-val indices = (n - m + 1 to n).foldLeft(Set.empty[Int]) {case (set, i) 
=>
-  val t = r.nextInt(i) + 1
-  if (set.contains(t)) set + i else set + t
+  override def prioritize(
+  blockManagerId: BlockManagerId,
+  peers: Seq[BlockManagerId],
+  peersReplicatedTo: mutable.HashSet[BlockManagerId],
+  blockId: BlockId,
+  numReplicas: Int): List[BlockManagerId] = {
+
+logDebug(s"Input peers : $peers")
+logDebug(s"BlockManagerId : $blockManagerId")
+
+val random = new Random(blockId.hashCode)
+
+// if block doesn't have topology info, we can't do much, so we 
randlomly shuffle
+// if there is, we see what's needed from peersReplicatedTo and based 
on numReplicas,
+// we choose whats needed
+if (blockManagerId.topologyInfo.isEmpty || numReplicas == 0) {
+  // no topology info for the block. The best we can do is randomly 
choose peers
+  BlockReplicationUtils.getRandomSample(peers, numReplicas, random)
+} else {
+  // we have topology information, we see what is left to be done from 
peersReplicatedTo
+  val doneWithinRack = peersReplicatedTo.exists(_.topologyInfo == 
blockManagerId.topologyInfo)
+  val doneOutsideRack = peersReplicatedTo.exists { p =>
--- End diff --

calculate the `inRackPeers` and `outOfRackPeers` here to reduce duplicated 
code


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17382: [SPARK-20051][SS] Fix StreamSuite flaky test - recover f...

2017-03-21 Thread tdas
Github user tdas commented on the issue:

https://github.com/apache/spark/pull/17382
  
Scala tests have passed. I am merging this to unblock other PRs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17302: [SPARK-19959][SQL] Fix to throw NullPointerExcept...

2017-03-21 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/17302#discussion_r107317074
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -41,7 +41,20 @@ object CatalystSerde {
   }
 
   def generateObjAttr[T : Encoder]: Attribute = {
-AttributeReference("obj", encoderFor[T].deserializer.dataType, 
nullable = false)()
+val deserializer = encoderFor[T].deserializer
+val dataType = deserializer.dataType
+val nullable = if (deserializer.childrenResolved) {
--- End diff --

In the above context, I confirmed that `encodeFor[T]` is not resolved and 
`encodeFor[T].deserializer` is resolved.
Do you want to see my log output?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17250: [SPARK-19911][STREAMING] Add builder interface fo...

2017-03-21 Thread budde
Github user budde commented on a diff in the pull request:

https://github.com/apache/spark/pull/17250#discussion_r107318150
  
--- Diff: 
external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisInputDStream.scala
 ---
@@ -71,7 +75,256 @@ private[kinesis] class KinesisInputDStream[T: ClassTag](
 
   override def getReceiver(): Receiver[T] = {
 new KinesisReceiver(streamName, endpointUrl, regionName, 
initialPositionInStream,
-  checkpointAppName, checkpointInterval, storageLevel, messageHandler,
-  kinesisCredsProvider)
+  checkpointAppName, checkpointInterval, _storageLevel, messageHandler,
+  kinesisCredsProvider, dynamoDBCredsProvider, cloudWatchCredsProvider)
   }
 }
+
+@InterfaceStability.Stable
+object KinesisInputDStream {
+  /**
+   * Builder for [[KinesisInputDStream]] instances.
+   *
+   * @since 2.2.0
+   */
+  @InterfaceStability.Stable
+  class Builder {
+// Required params
+private var streamingContext: Option[StreamingContext] = None
+private var streamName: Option[String] = None
+private var checkpointAppName: Option[String] = None
+
+// Params with defaults
+private var endpointUrl: Option[String] = None
+private var regionName: Option[String] = None
+private var initialPositionInStream: Option[InitialPositionInStream] = 
None
+private var checkpointInterval: Option[Duration] = None
+private var storageLevel: Option[StorageLevel] = None
+private var kinesisCredsProvider: 
Option[SerializableCredentialsProvider] = None
+private var dynamoDBCredsProvider: 
Option[SerializableCredentialsProvider] = None
+private var cloudWatchCredsProvider: 
Option[SerializableCredentialsProvider] = None
+
+/**
+ * Sets the StreamingContext that will be used to construct the 
Kinesis DStream. This is a
+ * required parameter.
+ *
+ * @param ssc [[StreamingContext]] used to construct Kinesis DStreams
+ * @return Reference to this [[KinesisInputDStream.Builder]]
+ */
+def streamingContext(ssc: StreamingContext): Builder = {
+  streamingContext = Option(ssc)
+  this
+}
+
+/**
+ * Sets the StreamingContext that will be used to construct the 
Kinesis DStream. This is a
+ * required parameter.
+ *
+ * @param jssc [[JavaStreamingContext]] used to construct Kinesis 
DStreams
+ * @return Reference to this [[KinesisInputDStream.Builder]]
+ */
+def streamingContext(jssc: JavaStreamingContext): Builder = {
+  streamingContext = Option(jssc.ssc)
+  this
+}
+
+/**
+ * Sets the name of the Kinesis stream that the DStream will read 
from. This is a required
+ * parameter.
+ *
+ * @param streamName Name of Kinesis stream that the DStream will read 
from
+ * @return Reference to this [[KinesisInputDStream.Builder]]
+ */
+def streamName(streamName: String): Builder = {
+  this.streamName = Option(streamName)
+  this
+}
+
+/**
+ * Sets the KCL application name to use when checkpointing state to 
DynamoDB. This is a
+ * required parameter.
+ *
+ * @param appName Value to use for the KCL app name (used when 
creating the DynamoDB checkpoint
+ *table and when writing metrics to CloudWatch)
+ * @return Reference to this [[KinesisInputDStream.Builder]]
+ */
+def checkpointAppName(appName: String): Builder = {
+  checkpointAppName = Option(appName)
+  this
+}
+
+/**
+ * Sets the AWS Kinesis endpoint URL. Defaults to 
"https://kinesis.us-east-1.amazonaws.com; if
+ * no custom value is specified
+ *
+ * @param url Kinesis endpoint URL to use
+ * @return Reference to this [[KinesisInputDStream.Builder]]
+ */
+def endpointUrl(url: String): Builder = {
+  endpointUrl = Option(url)
+  this
+}
+
+/**
+ * Sets the AWS region to construct clients for. Defaults to 
"us-east-1" if no custom value
+ * is specified.
+ *
+ * @param regionName Name of AWS region to use (e.g. "us-west-2")
+ * @return Reference to this [[KinesisInputDStream.Builder]]
+ */
+def regionName(regionName: String): Builder = {
+  this.regionName = Option(regionName)
+  this
+}
+
+/**
+ * Sets the initial position data is read from in the Kinesis stream. 
Defaults to
+ * [[InitialPositionInStream.LATEST]] if no custom value is specified.
+ *
+ * @param initialPosition InitialPositionInStream value specifying 
where Spark Streaming
  

[GitHub] spark pull request #17250: [SPARK-19911][STREAMING] Add builder interface fo...

2017-03-21 Thread budde
Github user budde commented on a diff in the pull request:

https://github.com/apache/spark/pull/17250#discussion_r107318178
  
--- Diff: 
external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisInputDStream.scala
 ---
@@ -71,7 +75,256 @@ private[kinesis] class KinesisInputDStream[T: ClassTag](
 
   override def getReceiver(): Receiver[T] = {
 new KinesisReceiver(streamName, endpointUrl, regionName, 
initialPositionInStream,
-  checkpointAppName, checkpointInterval, storageLevel, messageHandler,
-  kinesisCredsProvider)
+  checkpointAppName, checkpointInterval, _storageLevel, messageHandler,
+  kinesisCredsProvider, dynamoDBCredsProvider, cloudWatchCredsProvider)
   }
 }
+
+@InterfaceStability.Stable
--- End diff --

Will do


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17250: [SPARK-19911][STREAMING] Add builder interface fo...

2017-03-21 Thread budde
Github user budde commented on a diff in the pull request:

https://github.com/apache/spark/pull/17250#discussion_r107318157
  
--- Diff: 
external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisInputDStream.scala
 ---
@@ -71,7 +75,256 @@ private[kinesis] class KinesisInputDStream[T: ClassTag](
 
   override def getReceiver(): Receiver[T] = {
 new KinesisReceiver(streamName, endpointUrl, regionName, 
initialPositionInStream,
-  checkpointAppName, checkpointInterval, storageLevel, messageHandler,
-  kinesisCredsProvider)
+  checkpointAppName, checkpointInterval, _storageLevel, messageHandler,
+  kinesisCredsProvider, dynamoDBCredsProvider, cloudWatchCredsProvider)
   }
 }
+
+@InterfaceStability.Stable
+object KinesisInputDStream {
+  /**
+   * Builder for [[KinesisInputDStream]] instances.
+   *
+   * @since 2.2.0
+   */
+  @InterfaceStability.Stable
+  class Builder {
+// Required params
+private var streamingContext: Option[StreamingContext] = None
+private var streamName: Option[String] = None
+private var checkpointAppName: Option[String] = None
+
+// Params with defaults
+private var endpointUrl: Option[String] = None
+private var regionName: Option[String] = None
+private var initialPositionInStream: Option[InitialPositionInStream] = 
None
+private var checkpointInterval: Option[Duration] = None
+private var storageLevel: Option[StorageLevel] = None
+private var kinesisCredsProvider: 
Option[SerializableCredentialsProvider] = None
+private var dynamoDBCredsProvider: 
Option[SerializableCredentialsProvider] = None
+private var cloudWatchCredsProvider: 
Option[SerializableCredentialsProvider] = None
+
+/**
+ * Sets the StreamingContext that will be used to construct the 
Kinesis DStream. This is a
+ * required parameter.
+ *
+ * @param ssc [[StreamingContext]] used to construct Kinesis DStreams
+ * @return Reference to this [[KinesisInputDStream.Builder]]
+ */
+def streamingContext(ssc: StreamingContext): Builder = {
+  streamingContext = Option(ssc)
+  this
+}
+
+/**
+ * Sets the StreamingContext that will be used to construct the 
Kinesis DStream. This is a
+ * required parameter.
+ *
+ * @param jssc [[JavaStreamingContext]] used to construct Kinesis 
DStreams
+ * @return Reference to this [[KinesisInputDStream.Builder]]
+ */
+def streamingContext(jssc: JavaStreamingContext): Builder = {
+  streamingContext = Option(jssc.ssc)
+  this
+}
+
+/**
+ * Sets the name of the Kinesis stream that the DStream will read 
from. This is a required
+ * parameter.
+ *
+ * @param streamName Name of Kinesis stream that the DStream will read 
from
+ * @return Reference to this [[KinesisInputDStream.Builder]]
+ */
+def streamName(streamName: String): Builder = {
+  this.streamName = Option(streamName)
+  this
+}
+
+/**
+ * Sets the KCL application name to use when checkpointing state to 
DynamoDB. This is a
+ * required parameter.
+ *
+ * @param appName Value to use for the KCL app name (used when 
creating the DynamoDB checkpoint
+ *table and when writing metrics to CloudWatch)
+ * @return Reference to this [[KinesisInputDStream.Builder]]
+ */
+def checkpointAppName(appName: String): Builder = {
+  checkpointAppName = Option(appName)
+  this
+}
+
+/**
+ * Sets the AWS Kinesis endpoint URL. Defaults to 
"https://kinesis.us-east-1.amazonaws.com; if
+ * no custom value is specified
+ *
+ * @param url Kinesis endpoint URL to use
+ * @return Reference to this [[KinesisInputDStream.Builder]]
+ */
+def endpointUrl(url: String): Builder = {
+  endpointUrl = Option(url)
+  this
+}
+
+/**
+ * Sets the AWS region to construct clients for. Defaults to 
"us-east-1" if no custom value
+ * is specified.
+ *
+ * @param regionName Name of AWS region to use (e.g. "us-west-2")
+ * @return Reference to this [[KinesisInputDStream.Builder]]
+ */
+def regionName(regionName: String): Builder = {
+  this.regionName = Option(regionName)
+  this
+}
+
+/**
+ * Sets the initial position data is read from in the Kinesis stream. 
Defaults to
+ * [[InitialPositionInStream.LATEST]] if no custom value is specified.
+ *
+ * @param initialPosition InitialPositionInStream value specifying 
where Spark Streaming
  

[GitHub] spark pull request #17382: [SPARK-20051][SS] Fix StreamSuite flaky test - re...

2017-03-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17382


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17250: [SPARK-19911][STREAMING] Add builder interface fo...

2017-03-21 Thread budde
Github user budde commented on a diff in the pull request:

https://github.com/apache/spark/pull/17250#discussion_r107318164
  
--- Diff: 
external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/SerializableCredentialsProvider.scala
 ---
@@ -83,3 +84,146 @@ private[kinesis] final case class 
STSCredentialsProvider(
 }
   }
 }
+
+@InterfaceStability.Stable
--- End diff --

For sure. Honestly I just cribbed these annotations from 
```SparkSession.Builder``` so I appreciate you letting me know what the proper 
convention is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17250: [SPARK-19911][STREAMING] Add builder interface fo...

2017-03-21 Thread budde
Github user budde commented on a diff in the pull request:

https://github.com/apache/spark/pull/17250#discussion_r107318994
  
--- Diff: 
external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/SerializableCredentialsProvider.scala
 ---
@@ -83,3 +84,146 @@ private[kinesis] final case class 
STSCredentialsProvider(
 }
   }
 }
+
+@InterfaceStability.Stable
+object SerializableCredentialsProvider {
+  /**
+   * Builder for [[SerializableCredentialsProvider]] instances.
+   *
+   * @since 2.2.0
+   */
+  @InterfaceStability.Stable
+  class Builder {
+private var awsAccessKeyId: Option[String] = None
+private var awsSecretKey: Option[String] = None
+private var stsRoleArn: Option[String] = None
+private var stsSessionName: Option[String] = None
+private var stsExternalId: Option[String] = None
+
+
+/**
+ * Sets the AWS access key ID when using a basic AWS keypair for 
long-lived authorization
+ * credentials. A value must also be provided for the AWS secret key.
+ *
+ * @param accessKeyId AWS access key ID
+ * @return Reference to this 
[[SerializableCredentialsProvider.Builder]]
+ */
+def awsAccessKeyId(accessKeyId: String): Builder = {
--- End diff --

I'll rework this builder to take multiple arguments for the long-lived 
keypair and STS


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17348: [SPARK-20018][SQL] Pivot with timestamp and count...

2017-03-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17348#discussion_r107318998
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -486,14 +486,16 @@ class Analyzer(
   case Pivot(groupByExprs, pivotColumn, pivotValues, aggregates, 
child) =>
 val singleAgg = aggregates.size == 1
 def outputName(value: Literal, aggregate: Expression): String = {
+  val utf8val = Cast(value, StringType, 
Some(conf.sessionLocalTimeZone)).eval(EmptyRow)
--- End diff --

It seems we can cast into `StringType` in all the ways - 
https://github.com/apache/spark/blob/e9e2c612d58a19ddcb4b6abfb7389a4b0f7ef6f8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L41




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17250: [SPARK-19911][STREAMING] Add builder interface fo...

2017-03-21 Thread budde
Github user budde commented on a diff in the pull request:

https://github.com/apache/spark/pull/17250#discussion_r107319004
  
--- Diff: 
external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/SerializableCredentialsProvider.scala
 ---
@@ -83,3 +84,146 @@ private[kinesis] final case class 
STSCredentialsProvider(
 }
   }
 }
+
+@InterfaceStability.Stable
+object SerializableCredentialsProvider {
+  /**
+   * Builder for [[SerializableCredentialsProvider]] instances.
+   *
+   * @since 2.2.0
+   */
+  @InterfaceStability.Stable
+  class Builder {
+private var awsAccessKeyId: Option[String] = None
+private var awsSecretKey: Option[String] = None
+private var stsRoleArn: Option[String] = None
+private var stsSessionName: Option[String] = None
+private var stsExternalId: Option[String] = None
+
+
+/**
+ * Sets the AWS access key ID when using a basic AWS keypair for 
long-lived authorization
+ * credentials. A value must also be provided for the AWS secret key.
+ *
+ * @param accessKeyId AWS access key ID
+ * @return Reference to this 
[[SerializableCredentialsProvider.Builder]]
+ */
+def awsAccessKeyId(accessKeyId: String): Builder = {
+  awsAccessKeyId = Option(accessKeyId)
+  this
+}
+
+/**
+ * Sets the AWS secret key when using a basic AWS keypair for 
long-lived authorization
+ * credentials. A value must also be provided for the AWS access key 
ID.
+ *
+ * @param secretKey AWS secret key
+ * @return Reference to this 
[[SerializableCredentialsProvider.Builder]]
+ */
+def awsSecretKey(secretKey: String): Builder = {
+  awsSecretKey = Option(secretKey)
+  this
+}
+
+/**
+ * Sets the ARN of the IAM role to assume when using AWS STS for 
temporary session-based
+ * authentication. A value must also be provided for the STS session 
name.
+ *
+ * @param roleArn ARN of IAM role to assume via STS
+ * @return Reference to this 
[[SerializableCredentialsProvider.Builder]]
+ */
+def stsRoleArn(roleArn: String): Builder = {
--- End diff --

Will do


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17348: [SPARK-20018][SQL] Pivot with timestamp and count...

2017-03-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17348#discussion_r107319664
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -486,14 +486,16 @@ class Analyzer(
   case Pivot(groupByExprs, pivotColumn, pivotValues, aggregates, 
child) =>
 val singleAgg = aggregates.size == 1
 def outputName(value: Literal, aggregate: Expression): String = {
+  val utf8Value = Cast(value, StringType, 
Some(conf.sessionLocalTimeZone)).eval(EmptyRow)
--- End diff --

BTW, is this a correct way for handling timezone - @ueshin ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17382: [SPARK-20051][SS] Fix StreamSuite flaky test - recover f...

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17382
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75014/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17382: [SPARK-20051][SS] Fix StreamSuite flaky test - recover f...

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17382
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17382: [SPARK-20051][SS] Fix StreamSuite flaky test - recover f...

2017-03-21 Thread kunalkhamar
Github user kunalkhamar commented on the issue:

https://github.com/apache/spark/pull/17382
  
@tdas all tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17379: [SPARK-20048][SQL] Cloning SessionState does not clone q...

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17379
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75015/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17379: [SPARK-20048][SQL] Cloning SessionState does not clone q...

2017-03-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17379
  
**[Test build #75015 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75015/testReport)**
 for PR 17379 at commit 
[`ad77fe9`](https://github.com/apache/spark/commit/ad77fe9ad258eac224f069bbc89294818ee6b549).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13932: [SPARK-15354] [CORE] Topology aware block replication st...

2017-03-21 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/13932
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13932: [SPARK-15354] [CORE] Topology aware block replication st...

2017-03-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13932
  
**[Test build #75020 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75020/testReport)**
 for PR 13932 at commit 
[`ec601bd`](https://github.com/apache/spark/commit/ec601bd1e619a3f6f35597753954b82536de6bc9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17358: [SPARK-20027][DOCS] Compilation fix in java docs.

2017-03-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17358
  
If my suggestion is the reason to hold off this PR currently and you are 
not sure of it, I am fine. I can double-check and deal with this in a separate 
PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17302: [SPARK-19959][SQL] Fix to throw NullPointerExcept...

2017-03-21 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17302#discussion_r107323158
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -41,7 +41,20 @@ object CatalystSerde {
   }
 
   def generateObjAttr[T : Encoder]: Attribute = {
-AttributeReference("obj", encoderFor[T].deserializer.dataType, 
nullable = false)()
+val deserializer = encoderFor[T].deserializer
+val dataType = deserializer.dataType
+val nullable = if (deserializer.childrenResolved) {
--- End diff --

I know `resolveAndBind` will resolve it. But looks at the code snippet 
above:

val deserializer = encoderFor[T].deserializer
val dataType = deserializer.dataType
val nullable = if (deserializer.childrenResolved) {  // or 
deserializer.resolved

I don't see you have a chance to call `resolveAndBind` here...

`encoderFor[T]` will call `assertUnresolved` on the encoder to return, it 
is implemented like:

def assertUnresolved(): Unit = {
  (deserializer +:  serializer).foreach(_.foreach {
case a: AttributeReference if a.name != "loopVar" =>
  sys.error(s"Unresolved encoder expected, but $a was found.")
case _ =>
  })
}

It makes sure that both `deserializer` and `serializer` are unresolved.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17295: [SPARK-19556][core] Do not encrypt block manager ...

2017-03-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17295#discussion_r107323085
  
--- Diff: 
core/src/main/scala/org/apache/spark/security/CryptoStreamUtils.scala ---
@@ -48,12 +51,30 @@ private[spark] object CryptoStreamUtils extends Logging 
{
   os: OutputStream,
   sparkConf: SparkConf,
   key: Array[Byte]): OutputStream = {
-val properties = toCryptoConf(sparkConf)
-val iv = createInitializationVector(properties)
+val params = new CryptoParams(key, sparkConf)
+val iv = createInitializationVector(params.conf)
 os.write(iv)
-val transformationStr = sparkConf.get(IO_CRYPTO_CIPHER_TRANSFORMATION)
-new CryptoOutputStream(transformationStr, properties, os,
-  new SecretKeySpec(key, "AES"), new IvParameterSpec(iv))
+new CryptoOutputStream(params.transformation, params.conf, os, 
params.keySpec,
+  new IvParameterSpec(iv))
+  }
+
+  /**
+   * Wrap a `WritableByteChannel` for encryption.
+   */
+  def createWritableChannel(
+  channel: WritableByteChannel,
+  sparkConf: SparkConf,
+  key: Array[Byte]): WritableByteChannel = {
+val params = new CryptoParams(key, sparkConf)
+val iv = createInitializationVector(params.conf)
+val buf = ByteBuffer.wrap(iv)
+while (buf.hasRemaining()) {
--- End diff --

is there any possibility this may be an infinite loop?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15628: [SPARK-17471][ML] Add compressed method to ML matrices

2017-03-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15628
  
**[Test build #75016 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75016/testReport)**
 for PR 15628 at commit 
[`17dec63`](https://github.com/apache/spark/commit/17dec637d33c997f724e06ab28487f1e9441394c).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-03-21 Thread zhengruifeng
Github user zhengruifeng commented on the issue:

https://github.com/apache/spark/pull/17014
  
ping @hhbyyh 
I updated the PR, can you please help reviewing this? Thank in advance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17295: [SPARK-19556][core] Do not encrypt block manager ...

2017-03-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17295#discussion_r107323833
  
--- Diff: 
core/src/main/scala/org/apache/spark/serializer/SerializerManager.scala ---
@@ -167,30 +167,26 @@ private[spark] class SerializerManager(
 val byteStream = new BufferedOutputStream(outputStream)
 val autoPick = !blockId.isInstanceOf[StreamBlockId]
 val ser = getSerializer(implicitly[ClassTag[T]], 
autoPick).newInstance()
-ser.serializeStream(wrapStream(blockId, 
byteStream)).writeAll(values).close()
--- End diff --

the `wrapStream` and `wrapForEncryption` methods can be removed from this 
class


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17295: [SPARK-19556][core] Do not encrypt block manager data in...

2017-03-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17295
  
**[Test build #74991 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74991/testReport)**
 for PR 17295 at commit 
[`107e3e7`](https://github.com/apache/spark/commit/107e3e72e81d2c7813d832d3e9c2beab89e01379).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17166
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17302: [SPARK-19959][SQL] Fix to throw NullPointerException in ...

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17302
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74992/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17166
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74993/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17336: [SPARK-20003] [ML] FPGrowthModel setMinConfidence should...

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17336
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17336: [SPARK-20003] [ML] FPGrowthModel setMinConfidence should...

2017-03-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17336
  
**[Test build #74995 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74995/testReport)**
 for PR 17336 at commit 
[`9c046c3`](https://github.com/apache/spark/commit/9c046c3bb8dfd6dd0fa2799d434a4f92cbb1b802).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17295: [SPARK-19556][core] Do not encrypt block manager data in...

2017-03-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17295
  
**[Test build #74997 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74997/testReport)**
 for PR 17295 at commit 
[`6bda670`](https://github.com/apache/spark/commit/6bda6701bf0c266047a5fa81fd29f4fb826728c7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   >