date:20180717

[GitHub] spark issue #21131: [SPARK-23433][CORE] Late zombie task completions update ...

2018-07-17 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21131
  
a late LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21131: [SPARK-23433][CORE] Late zombie task completions ...

2018-07-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21131#discussion_r203257926
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -764,6 +769,19 @@ private[spark] class TaskSetManager(
 maybeFinishTaskSet()
   }
 
+  private[scheduler] def markPartitionCompleted(partitionId: Int): Unit = {
+partitionToIndex.get(partitionId).foreach { index =>
+  if (!successful(index)) {
+tasksSuccessful += 1
+successful(index) = true
+if (tasksSuccessful == numTasks) {
+  isZombie = true
+}
+maybeFinishTaskSet()
--- End diff --

is this line needed? We will call `maybeFinishTaskSet()` at the end of 
`handleSuccessfulTask`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21488: SPARK-18057 Update structured streaming kafka fro...

2018-07-17 Thread ijuma

Github user ijuma commented on a diff in the pull request:

https://github.com/apache/spark/pull/21488#discussion_r203256766
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala
 ---
@@ -115,7 +116,7 @@ private[kafka010] class KafkaOffsetReader(
   def fetchTopicPartitions(): Set[TopicPartition] = runUninterruptibly {
 assert(Thread.currentThread().isInstanceOf[UninterruptibleThread])
 // Poll to get the latest assigned partitions
-consumer.poll(0)
+consumer.poll(JDuration.ofMillis(0))
--- End diff --

@zsxwing Why do you want to support Kafka clients jars from 0.10 to 2.0? 
Since newer clients jars support older brokers, we recommend people use the 
latest Kafka clients jar whenever possible.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21386: [SPARK-23928][SQL][WIP] Add shuffle collection function.

2018-07-17 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/21386
  
Okay, I'll take this over, and ping you when I submit a PR to ask a review. 
Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21795: [SPARK-24840][SQL] do not use dummy filter to switch cod...

2018-07-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21795
  
**[Test build #93215 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93215/testReport)**
 for PR 21795 at commit 
[`de5a232`](https://github.com/apache/spark/commit/de5a2323b5b46a4c073e3ff1dce6daea395dd1dd).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21795: [SPARK-24165][SQL][followup] Fixing conditional expressi...

2018-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21795
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21795: [SPARK-24165][SQL][followup] Fixing conditional expressi...

2018-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21795
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1084/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....

2018-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21488
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....

2018-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21488
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93212/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....

2018-07-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21488
  
**[Test build #93212 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93212/testReport)**
 for PR 21488 at commit 
[`e7318a9`](https://github.com/apache/spark/commit/e7318a9ac7597c0284d5b0732926fce5caec40ad).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21800: [SPARK-24825][K8S][TEST] Kubernetes integration tests bu...

2018-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21800
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21800: [SPARK-24825][K8S][TEST] Kubernetes integration tests bu...

2018-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21800
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93210/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21800: [SPARK-24825][K8S][TEST] Kubernetes integration tests bu...

2018-07-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21800
  
**[Test build #93210 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93210/testReport)**
 for PR 21800 at commit 
[`d6f41e0`](https://github.com/apache/spark/commit/d6f41e08e3111e6c878afc0845e3f2a091519dd1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21772: [SPARK-24809] [SQL] Serializing LongHashedRelatio...

2018-07-17 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21772#discussion_r203252809
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
 ---
@@ -726,8 +726,9 @@ private[execution] final class LongToUnsafeRowMap(val 
mm: TaskMemoryManager, cap
 
 writeLong(array.length)
 writeLongArray(writeBuffer, array, array.length)
-val used = ((cursor - Platform.LONG_ARRAY_OFFSET) / 8).toInt
-writeLong(used)
+val cursorFlag = cursor - Platform.LONG_ARRAY_OFFSET
+writeLong(cursorFlag)
+val used = (cursorFlag / 8).toInt
--- End diff --

Can you post the image in this PR? The web site you refer contains too many 
ads.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21440: [SPARK-24307][CORE] Support reading remote cached...

2018-07-17 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21440#discussion_r203251175
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/io/ChunkedByteBufferFileRegion.scala 
---
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.util.io
+
+import java.nio.channels.WritableByteChannel
+
+import io.netty.channel.FileRegion
+import io.netty.util.AbstractReferenceCounted
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.network.util.AbstractFileRegion
+
+
+/**
+ * This exposes a ChunkedByteBuffer as a netty FileRegion, just to allow 
sending > 2gb in one netty
+ * message.  This is because netty cannot send a ByteBuf > 2g, but it can 
send a large FileRegion,
+ * even though the data is not backed by a file.
+ */
+private[io] class ChunkedByteBufferFileRegion(
+private val chunkedByteBuffer: ChunkedByteBuffer,
+private val ioChunkSize: Int) extends AbstractFileRegion {
+
+  private var _transferred: Long = 0
+  // this duplicates the original chunks, so we're free to modify the 
position, limit, etc.
+  private val chunks = chunkedByteBuffer.getChunks()
+  private val size = chunks.foldLeft(0L) { _ + _.remaining() }
+
+  protected def deallocate: Unit = {}
+
+  override def count(): Long = size
+
+  // this is the "start position" of the overall Data in the backing file, 
not our current position
+  override def position(): Long = 0
+
+  override def transferred(): Long = _transferred
+
+  private var currentChunkIdx = 0
+
+  def transferTo(target: WritableByteChannel, position: Long): Long = {
+assert(position == _transferred)
+if (position == size) return 0L
+var keepGoing = true
+var written = 0L
+var currentChunk = chunks(currentChunkIdx)
+while (keepGoing) {
+  while (currentChunk.hasRemaining && keepGoing) {
+val ioSize = Math.min(currentChunk.remaining(), ioChunkSize)
+val originalLimit = currentChunk.limit()
+currentChunk.limit(currentChunk.position() + ioSize)
+val thisWriteSize = target.write(currentChunk)
+currentChunk.limit(originalLimit)
+written += thisWriteSize
+if (thisWriteSize < ioSize) {
--- End diff --

I see, thanks for explain.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21440: [SPARK-24307][CORE] Support reading remote cached...

2018-07-17 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21440#discussion_r203250619
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala ---
@@ -166,6 +170,34 @@ private[spark] class ChunkedByteBuffer(var chunks: 
Array[ByteBuffer]) {
 
 }
 
+object ChunkedByteBuffer {
+  // TODO eliminate this method if we switch BlockManager to getting 
InputStreams
+  def fromManagedBuffer(data: ManagedBuffer, maxChunkSize: Int): 
ChunkedByteBuffer = {
+data match {
+  case f: FileSegmentManagedBuffer =>
+map(f.getFile, maxChunkSize, f.getOffset, f.getLength)
+  case other =>
+new ChunkedByteBuffer(other.nioByteBuffer())
+}
+  }
+
+  def map(file: File, maxChunkSize: Int, offset: Long, length: Long): 
ChunkedByteBuffer = {
+Utils.tryWithResource(new FileInputStream(file).getChannel()) { 
channel =>
--- End diff --

I've already updated some of them in SPARK-21475 in shuffle related code 
path, but not all of them which are not so critical.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21669: [SPARK-23257][K8S][WIP] Kerberos Support for Spark on K8...

2018-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21669
  
Build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21669: [SPARK-23257][K8S][WIP] Kerberos Support for Spark on K8...

2018-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21669
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93211/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21669: [SPARK-23257][K8S][WIP] Kerberos Support for Spark on K8...

2018-07-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21669
  
**[Test build #93211 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93211/testReport)**
 for PR 21669 at commit 
[`13b3adc`](https://github.com/apache/spark/commit/13b3adc5ffb55fbfd6572089b1f54e8bca393494).
 * This patch **fails Spark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21800: [SPARK-24825][K8S][TEST] Kubernetes integration tests bu...

2018-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21800
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93209/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21800: [SPARK-24825][K8S][TEST] Kubernetes integration tests bu...

2018-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21800
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21801: [SPARK-24386][SPARK-24768][BUILD][FOLLOWUP] Fix l...

2018-07-17 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21801#discussion_r203249250
  
--- Diff: external/avro/src/test/resources/log4j.properties ---
@@ -46,4 +46,4 @@ 
log4j.additivity.org.apache.hadoop.hive.metastore.RetryingHMSHandler=false
 log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=OFF
 
 log4j.additivity.hive.ql.metadata.Hive=false
-log4j.logger.hive.ql.metadata.Hive=OFF
\ No newline at end of file
+log4j.logger.hive.ql.metadata.Hive=OFF
--- End diff --

Me neither, actually.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21800: [SPARK-24825][K8S][TEST] Kubernetes integration tests bu...

2018-07-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21800
  
**[Test build #93209 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93209/testReport)**
 for PR 21800 at commit 
[`6a89c65`](https://github.com/apache/spark/commit/6a89c658c6a56c34d5e0578f09db77b4ed9b41f3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20146
  
Ah, I know why. Because the latest commit 
https://github.com/apache/spark/pull/20146/commits/a6551b02a10428d66e0dadcfcb5a8da3798ec814
 doesn't contain the changes to trigger the test. Rebasing and squashing it 
will trigger the test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21542: [SPARK-24529][Build][test-maven] Add spotbugs into maven...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21542
  
H this happened to me too and I just pushed 
https://github.com/apache/spark/commit/fc2e18963efdf4b50258f85c8779122742876910.
 Mine was Java 8. There's reproducer in the PR description.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21801: [SPARK-24386][SPARK-24768][BUILD][FOLLOWUP] Fix l...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21801#discussion_r203248740
  
--- Diff: external/avro/src/test/resources/log4j.properties ---
@@ -46,4 +46,4 @@ 
log4j.additivity.org.apache.hadoop.hive.metastore.RetryingHMSHandler=false
 log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=OFF
 
 log4j.additivity.hive.ql.metadata.Hive=false
-log4j.logger.hive.ql.metadata.Hive=OFF
\ No newline at end of file
+log4j.logger.hive.ql.metadata.Hive=OFF
--- End diff --

hm I didn't know Maven's checkstyle checks resources too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....

2018-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21488
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93213/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....

2018-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21488
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....

2018-07-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21488
  
**[Test build #93213 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93213/testReport)**
 for PR 21488 at commit 
[`13a7884`](https://github.com/apache/spark/commit/13a7884279103ed06778e4351616e316beb7566f).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21801: [SPARK-24386][SPARK-24768][BUILD][FOLLOWUP] Fix lint-jav...

2018-07-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21801
  
**[Test build #93214 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93214/testReport)**
 for PR 21801 at commit 
[`7f78d75`](https://github.com/apache/spark/commit/7f78d750411a4098527b2b332495f5dd4f20c63e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21801: [SPARK-24386][SPARK-24768][BUILD][FOLLOWUP] Fix lint-jav...

2018-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21801
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1083/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21801: [SPARK-24386][SPARK-24768][BUILD][FOLLOWUP] Fix lint-jav...

2018-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21801
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21801: [SPARK-24386][SPARK-24768][BUILD][FOLLOWUP] Fix l...

2018-07-17 Thread ueshin

GitHub user ueshin opened a pull request:

https://github.com/apache/spark/pull/21801

[SPARK-24386][SPARK-24768][BUILD][FOLLOWUP] Fix lint-java and Scala 2.12 
build.

## What changes were proposed in this pull request?

This pr fixes lint-java and Scala 2.12 build.

lint-java:

```
[ERROR] src/test/resources/log4j.properties:[0] (misc) NewlineAtEndOfFile: 
File does not end with a newline.
```

Scala 2.12 build:

```
[error] 
/.../sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousCoalesceRDD.scala:121:
 overloaded method value addTaskCompletionListener with alternatives:
[error]   (f: org.apache.spark.TaskContext => 
Unit)org.apache.spark.TaskContext 
[error]   (listener: 
org.apache.spark.util.TaskCompletionListener)org.apache.spark.TaskContext
[error]  cannot be applied to (org.apache.spark.TaskContext => 
java.util.List[Runnable])
[error]   context.addTaskCompletionListener { ctx =>
[error]   ^
```


## How was this patch tested?

Manually executed lint-java and Scala 2.12 build in my local environment.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ueshin/apache-spark 
issues/SPARK-24386_24768/fix_build

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21801.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21801


commit 2fed7d487d90da5ea65d35d50208f88b5f829145
Author: Takuya UESHIN 
Date:   2018-07-17T09:51:01Z

Fix lint-java.

commit 7f78d750411a4098527b2b332495f5dd4f20c63e
Author: Takuya UESHIN 
Date:   2018-07-18T03:45:34Z

Fix Scala 2.12 build.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21440: [SPARK-24307][CORE] Support reading remote cached...

2018-07-17 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/21440#discussion_r20324
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/io/ChunkedByteBufferFileRegion.scala 
---
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.util.io
+
+import java.nio.channels.WritableByteChannel
+
+import io.netty.channel.FileRegion
+import io.netty.util.AbstractReferenceCounted
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.network.util.AbstractFileRegion
+
+
+/**
+ * This exposes a ChunkedByteBuffer as a netty FileRegion, just to allow 
sending > 2gb in one netty
+ * message.  This is because netty cannot send a ByteBuf > 2g, but it can 
send a large FileRegion,
+ * even though the data is not backed by a file.
+ */
+private[io] class ChunkedByteBufferFileRegion(
+private val chunkedByteBuffer: ChunkedByteBuffer,
+private val ioChunkSize: Int) extends AbstractFileRegion {
+
+  private var _transferred: Long = 0
+  // this duplicates the original chunks, so we're free to modify the 
position, limit, etc.
+  private val chunks = chunkedByteBuffer.getChunks()
+  private val size = chunks.foldLeft(0L) { _ + _.remaining() }
+
+  protected def deallocate: Unit = {}
+
+  override def count(): Long = size
+
+  // this is the "start position" of the overall Data in the backing file, 
not our current position
+  override def position(): Long = 0
+
+  override def transferred(): Long = _transferred
+
+  private var currentChunkIdx = 0
+
+  def transferTo(target: WritableByteChannel, position: Long): Long = {
+assert(position == _transferred)
+if (position == size) return 0L
+var keepGoing = true
+var written = 0L
+var currentChunk = chunks(currentChunkIdx)
+while (keepGoing) {
+  while (currentChunk.hasRemaining && keepGoing) {
+val ioSize = Math.min(currentChunk.remaining(), ioChunkSize)
+val originalLimit = currentChunk.limit()
+currentChunk.limit(currentChunk.position() + ioSize)
+val thisWriteSize = target.write(currentChunk)
+currentChunk.limit(originalLimit)
+written += thisWriteSize
+if (thisWriteSize < ioSize) {
--- End diff --

actually this is a totally normal condition, it just means the channel is 
not currently ready to accept anymore data.  This is something netty expects, 
and it will make sure the rest of the data is put on the channel eventually 
(it'll get called the next time with the correct `position` argument indicating 
how far along it is).

The added unit tests cover this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...

2018-07-17 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/20146
  
it's odd appveyer tests are not getting triggered.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21440: [SPARK-24307][CORE] Support reading remote cached...

2018-07-17 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/21440#discussion_r203245221
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala ---
@@ -166,6 +170,34 @@ private[spark] class ChunkedByteBuffer(var chunks: 
Array[ByteBuffer]) {
 
 }
 
+object ChunkedByteBuffer {
+  // TODO eliminate this method if we switch BlockManager to getting 
InputStreams
+  def fromManagedBuffer(data: ManagedBuffer, maxChunkSize: Int): 
ChunkedByteBuffer = {
+data match {
+  case f: FileSegmentManagedBuffer =>
+map(f.getFile, maxChunkSize, f.getOffset, f.getLength)
+  case other =>
+new ChunkedByteBuffer(other.nioByteBuffer())
+}
+  }
+
+  def map(file: File, maxChunkSize: Int, offset: Long, length: Long): 
ChunkedByteBuffer = {
+Utils.tryWithResource(new FileInputStream(file).getChannel()) { 
channel =>
--- End diff --

I wasn't aware of that issue, thanks for sharing that, I'll update this.  
Should we also update other uses?  Seems there are a lot of other cases, eg. 
`UnsafeShuffleWriter`, `DiskBlockObjectWriter`, etc.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan

2018-07-17 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/21710#discussion_r203245118
  
--- Diff: R/pkg/R/mllib_fpm.R ---
@@ -154,3 +160,74 @@ setMethod("write.ml", signature(object = 
"FPGrowthModel", path = "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' PrefixSpan
+#'
+#' A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+#' \code{spark.prefixSpan} returns an instance of PrefixSpan.
+#' \code{spark.findFrequentSequentialPatterns} returns a complete set of 
frequent sequential
+#' patterns.
+#' For more details, see
+#' 
\href{https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html#prefixspan}{
+#' PrefixSpan}.
+#'
+#' @param minSupport Minimal support level.
+#' @param maxPatternLength Maximal pattern length.
+#' @param maxLocalProjDBSize Maximum number of items (including delimiters 
used in the internal
+#'   storage format) allowed in a projected 
database before local
+#'   processing.
+#' @param sequenceCol name of the sequence column in dataset.
+#' @param ... additional argument(s) passed to the method.
+#' @return \code{spark.prefixSpan} returns an instance of PrefixSpan
+#' @rdname spark.prefixSpan
+#' @name spark.prefixSpan
+#' @aliases spark.prefixSpan,ANY-method
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(list(list(list(list(1L, 2L), list(3L))),
+#'   list(list(list(1L), list(3L, 2L), list(1L, 2L))),
+#'   list(list(list(1L, 2L), list(5L))),
+#'   list(list(list(6L, schema = c("sequence"))
+#' prefix_Span <- spark.prefixSpan(minSupport = 0.5, maxPatternLength = 5L,
+#' maxLocalProjDBSize = 3200L)
+#' frequency <- spark.findFrequentSequentialPatterns(prefix_Span, df)
+#' showDF(frequency)
+#' }
+#' @note spark.prefixSpan since 2.4.0
+setMethod("spark.prefixSpan", signature(),
+  function(minSupport=0.1, maxPatternLength=10L,
+   maxLocalProjDBSize=3200L, sequenceCol="sequence") {
+if (!is.numeric(minSupport) || minSupport < 0) {
+  stop("minSupport should be a number with value >= 0.")
+}
+if (!is.integer(maxPatternLength) || maxPatternLength <= 0) {
+  stop("maxPatternLength should be a number with value > 0.")
+}
+if (!is.numeric(maxLocalProjDBSize) || maxLocalProjDBSize <= 
0) {
+  stop("maxLocalProjDBSize should be a number with value > 0.")
+}
+
+jobj <- callJStatic("org.apache.spark.ml.r.PrefixSpanWrapper", 
"getPrefixSpan",
+as.numeric(minSupport), 
as.integer(maxPatternLength),
+as.numeric(maxLocalProjDBSize), 
as.character(sequenceCol))
+new("PrefixSpan", jobj = jobj)
+  })
+
+# Find frequent sequential patterns.
+
+#' @param object a prefixSpan object.
+#' @param data A SparkDataFrame.
+#' @return A complete set of frequent sequential patterns in the input 
sequences of itemsets.
+#' The returned \code{SparkDataFrame} contains columns of sequence 
and corresponding
+#' frequency. The schema of it will be:
+#' \code{sequence: ArrayType(ArrayType(T))} (T is the item type)
+#' \code{freq: Long}
+#' @rdname spark.prefixSpan
--- End diff --

ditto here


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan

2018-07-17 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/21710#discussion_r203245010
  
--- Diff: R/pkg/R/generics.R ---
@@ -1415,6 +1415,13 @@ setGeneric("spark.freqItemsets", function(object) { 
standardGeneric("spark.freqI
 #' @rdname spark.fpGrowth
 setGeneric("spark.associationRules", function(object) { 
standardGeneric("spark.associationRules") })
 
+#' @rdname spark.prefixSpan
+setGeneric("spark.prefixSpan", function(...) { 
standardGeneric("spark.prefixSpan") })
+
+#' @rdname spark.prefixSpan
--- End diff --

i think he meant the rdname - yes ok to have both in one rd


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21440: [SPARK-24307][CORE] Support reading remote cached...

2018-07-17 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/21440#discussion_r203244832
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala ---
@@ -166,6 +170,34 @@ private[spark] class ChunkedByteBuffer(var chunks: 
Array[ByteBuffer]) {
 
 }
 
+object ChunkedByteBuffer {
+  // TODO eliminate this method if we switch BlockManager to getting 
InputStreams
+  def fromManagedBuffer(data: ManagedBuffer, maxChunkSize: Int): 
ChunkedByteBuffer = {
+data match {
+  case f: FileSegmentManagedBuffer =>
+map(f.getFile, maxChunkSize, f.getOffset, f.getLength)
+  case other =>
+new ChunkedByteBuffer(other.nioByteBuffer())
+}
+  }
+
+  def map(file: File, maxChunkSize: Int, offset: Long, length: Long): 
ChunkedByteBuffer = {
+Utils.tryWithResource(new FileInputStream(file).getChannel()) { 
channel =>
+  var remaining = length
+  var pos = offset
+  val chunks = new ListBuffer[ByteBuffer]()
+  while (remaining > 0) {
+val chunkSize = math.min(remaining, maxChunkSize)
+val chunk = channel.map(FileChannel.MapMode.READ_ONLY, pos, 
chunkSize)
--- End diff --

I think your concern is that when we are going to send data that is backed 
by a file, eg. a remote read of an RDD cached on disk, we should be able to 
send it using something more efficient than memory mapping the entire file.  Is 
that correct?

That actually isn't a problem.  This `map()` method isn't called for 
sending disk-cached RDDs.  That is already handled correctly with 
`FileSegmentManagedBuffer.convertToNetty()`, which uses the `DefaultFileRegion` 
you had in mind.  The `map` method is only used on the receiving end, after the 
data has already been transferred, and just to pass the data on to other spark 
code locally in the executor.  (And that will avoid the `map()` entirely after 
the TODO above.)

I needed to add `ChunkedByteBufferFileRegion` for data that is already in 
memory as a ChunkedByteBuffer, eg. for memory-cached RDDs. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21305
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21305
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93208/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-07-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21305
  
**[Test build #93208 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93208/testReport)**
 for PR 21305 at commit 
[`f041019`](https://github.com/apache/spark/commit/f041019b1ffe8187e47ba89de96daf631dfd56da).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21542: [SPARK-24529][Build][test-maven] Add spotbugs into maven...

2018-07-17 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/21542
  
"permission" stuff might be Java 9 related?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21787: [SPARK-24568] Code refactoring for DataType equalsXXX me...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21787
  
Constant variables and squashing the logic into one function look not worth 
enough and overkill. Less duplication is good of course but it doesn't look 
worth enough for both. I would focus on more important stuff. -1 from me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21787: [SPARK-24568] Code refactoring for DataType equalsXXX me...

2018-07-17 Thread swapnilushinde

Github user swapnilushinde commented on the issue:

https://github.com/apache/spark/pull/21787
  
true. Currently we have just 3 variations of comparing two datatypes for 
equality. Adding even one more equality function would easily cause writing 
same repetitive code which would negate observed increase in code lines.
Comparing two datasets' with different criteria for datatype equality is 
very handy for unit testing purposes too.
Can you please let me know what do you find unreadable/complex in new code 
so I can try to simplify it if possible?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21795: [SPARK-24165][SQL][followup] Fixing conditional e...

2018-07-17 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21795#discussion_r203242468
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -2336,46 +2336,40 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
 
 val sourceDF = spark.createDataFrame(rows, schema)
 
-val structWhenDF = sourceDF
+def structWhenDF: DataFrame = sourceDF
   .select(when('cond, struct(lit("a").as("val1"), 
lit(10).as("val2"))).otherwise('s) as "res")
   .select('res.getField("val1"))
-val arrayWhenDF = sourceDF
+def arrayWhenDF: DataFrame = sourceDF
   .select(when('cond, array(lit("a"), lit("b"))).otherwise('a) as 
"res")
   .select('res.getItem(0))
-val mapWhenDF = sourceDF
+def mapWhenDF: DataFrame = sourceDF
   .select(when('cond, map(lit(0), lit("a"))).otherwise('m) as "res")
   .select('res.getItem(0))
 
-val structIfDF = sourceDF
+def structIfDF: DataFrame = sourceDF
   .select(expr("if(cond, struct('a' as val1, 10 as val2), s)") as 
"res")
   .select('res.getField("val1"))
-val arrayIfDF = sourceDF
+def arrayIfDF: DataFrame = sourceDF
   .select(expr("if(cond, array('a', 'b'), a)") as "res")
   .select('res.getItem(0))
-val mapIfDF = sourceDF
+def mapIfDF: DataFrame = sourceDF
   .select(expr("if(cond, map(0, 'a'), m)") as "res")
   .select('res.getItem(0))
 
-def checkResult(df: DataFrame, codegenExpected: Boolean): Unit = {
-  
assert(df.queryExecution.executedPlan.isInstanceOf[WholeStageCodegenExec] == 
codegenExpected)
-  checkAnswer(df, Seq(Row("a"), Row(null)))
+def checkResult(): Unit = {
+  checkAnswer(structWhenDF, Seq(Row("a"), Row(null)))
+  checkAnswer(arrayWhenDF, Seq(Row("a"), Row(null)))
+  checkAnswer(mapWhenDF, Seq(Row("a"), Row(null)))
+  checkAnswer(structIfDF, Seq(Row("a"), Row(null)))
+  checkAnswer(arrayIfDF, Seq(Row("a"), Row(null)))
+  checkAnswer(mapIfDF, Seq(Row("a"), Row(null)))
 }
 
-// without codegen
-checkResult(structWhenDF, false)
-checkResult(arrayWhenDF, false)
-checkResult(mapWhenDF, false)
-checkResult(structIfDF, false)
-checkResult(arrayIfDF, false)
-checkResult(mapIfDF, false)
-
-// with codegen
-checkResult(structWhenDF.filter('cond.isNotNull), true)
--- End diff --

I saw some tests using similar dummy filters in `DataFrameFunctionsSuite`. 
Should we fix them as well?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21772: [SPARK-24809] [SQL] Serializing LongHashedRelation in ex...

2018-07-17 Thread liutang123

Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/21772
  
@hvanhovell Thanks for reviewing. Losing data because the variable 
**cursor** in executor is 0 and serialization depends on it. I will add an UT 
later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21772: [SPARK-24809] [SQL] Serializing LongHashedRelatio...

2018-07-17 Thread liutang123

Github user liutang123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21772#discussion_r203241485
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
 ---
@@ -726,8 +726,9 @@ private[execution] final class LongToUnsafeRowMap(val 
mm: TaskMemoryManager, cap
 
 writeLong(array.length)
 writeLongArray(writeBuffer, array, array.length)
-val used = ((cursor - Platform.LONG_ARRAY_OFFSET) / 8).toInt
-writeLong(used)
+val cursorFlag = cursor - Platform.LONG_ARRAY_OFFSET
+writeLong(cursorFlag)
+val used = (cursorFlag / 8).toInt
--- End diff --

losing data when serializing LongHashedRelation in executor, can you see 
[this picture](http://oi67.tinypic.com/2z5pzs7.jpg)? In executor, the cursor is 
0.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18138: [SPARK-20915][SQL] Make lpad/rpad with empty pad ...

2018-07-17 Thread wangyum

Github user wangyum closed the pull request at:

https://github.com/apache/spark/pull/18138


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19804: [WIP][SPARK-22573][SQL] Shouldn't inferFilters if it con...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19804
  
Thank you @wangyum.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21795: [SPARK-24165][SQL][followup] Fixing conditional e...

2018-07-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21795#discussion_r203240428
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -2336,46 +2336,40 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
 
 val sourceDF = spark.createDataFrame(rows, schema)
 
-val structWhenDF = sourceDF
+def structWhenDF: DataFrame = sourceDF
   .select(when('cond, struct(lit("a").as("val1"), 
lit(10).as("val2"))).otherwise('s) as "res")
   .select('res.getField("val1"))
-val arrayWhenDF = sourceDF
+def arrayWhenDF: DataFrame = sourceDF
   .select(when('cond, array(lit("a"), lit("b"))).otherwise('a) as 
"res")
   .select('res.getItem(0))
-val mapWhenDF = sourceDF
+def mapWhenDF: DataFrame = sourceDF
   .select(when('cond, map(lit(0), lit("a"))).otherwise('m) as "res")
   .select('res.getItem(0))
 
-val structIfDF = sourceDF
+def structIfDF: DataFrame = sourceDF
   .select(expr("if(cond, struct('a' as val1, 10 as val2), s)") as 
"res")
   .select('res.getField("val1"))
-val arrayIfDF = sourceDF
+def arrayIfDF: DataFrame = sourceDF
   .select(expr("if(cond, array('a', 'b'), a)") as "res")
   .select('res.getItem(0))
-val mapIfDF = sourceDF
+def mapIfDF: DataFrame = sourceDF
   .select(expr("if(cond, map(0, 'a'), m)") as "res")
   .select('res.getItem(0))
 
-def checkResult(df: DataFrame, codegenExpected: Boolean): Unit = {
-  
assert(df.queryExecution.executedPlan.isInstanceOf[WholeStageCodegenExec] == 
codegenExpected)
-  checkAnswer(df, Seq(Row("a"), Row(null)))
+def checkResult(): Unit = {
+  checkAnswer(structWhenDF, Seq(Row("a"), Row(null)))
+  checkAnswer(arrayWhenDF, Seq(Row("a"), Row(null)))
+  checkAnswer(mapWhenDF, Seq(Row("a"), Row(null)))
+  checkAnswer(structIfDF, Seq(Row("a"), Row(null)))
+  checkAnswer(arrayIfDF, Seq(Row("a"), Row(null)))
+  checkAnswer(mapIfDF, Seq(Row("a"), Row(null)))
 }
 
-// without codegen
-checkResult(structWhenDF, false)
-checkResult(arrayWhenDF, false)
-checkResult(mapWhenDF, false)
-checkResult(structIfDF, false)
-checkResult(arrayIfDF, false)
-checkResult(mapIfDF, false)
-
-// with codegen
-checkResult(structWhenDF.filter('cond.isNotNull), true)
--- End diff --

ah that's tricky. Because filter pushdown runs first, the local relation 
optimization can't be applied.

To prevent confusions like this, how about we use local/cached relation to 
test it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21784: [SPARK-24182][YARN][FOLLOW-UP] Turn off noisy log output

2018-07-17 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21784
  
It's noisy when type something:

![spark-24128](https://user-images.githubusercontent.com/5399861/42857022-8afee254-8a7a-11e8-8ee0-47a19af99fa7.gif)



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21792: [SPARK-23231][ML][DOC] Add doc for string indexer orderi...

2018-07-17 Thread zhengruifeng

Github user zhengruifeng commented on the issue:

https://github.com/apache/spark/pull/21792
  
@srowen I think we need to update the docs
1, Current doc in `StringIndexer` is somewhat misleading: "The indices are 
in `[0, numLabels)`, ordered by label frequencies, so the most frequent label 
gets index `0`." this is true only with default ordering type.
2, In RFormula, `stringOrderType` only affect feature columns, not label 
column. This need to be emphasised, which is somewhat out of expectation.

@MLnick your thoughts?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21765: [MINOR][CORE] Add test cases for RDD.cartesian

2018-07-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21765
  
**[Test build #4218 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4218/testReport)**
 for PR 21765 at commit 
[`e5f469a`](https://github.com/apache/spark/commit/e5f469a0b83d35b8735eeba30dfca4fe0320810b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21789: [SPARK-24829][SQL]CAST AS FLOAT inconsistent with Hive

2018-07-17 Thread zuotingbing

Github user zuotingbing commented on the issue:

https://github.com/apache/spark/pull/21789
  
yes, only in STS, i will update the title


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19804: [WIP][SPARK-22573][SQL] Shouldn't inferFilters if it con...

2018-07-17 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/19804
  
Thanks @HyukjinKwon, already close some.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21789: [SPARK-24829][SQL]CAST AS FLOAT inconsistent with...

2018-07-17 Thread zuotingbing

Github user zuotingbing commented on a diff in the pull request:

https://github.com/apache/spark/pull/21789#discussion_r203239370
  
--- Diff: 
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala
 ---
@@ -766,6 +774,14 @@ class HiveThriftHttpServerSuite extends 
HiveThriftJdbcTest {
   assert(resultSet.getString(2) === HiveUtils.builtinHiveVersion)
 }
   }
+
+  test("Checks cast as float") {
--- End diff --

for two different modes:  binary &  http


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21754: [SPARK-24705][SQL] Cannot reuse an exchange operator wit...

2018-07-17 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21754
  
ping


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15945: [SPARK-12978][SQL] Merge unnecessary partial aggregates

2018-07-17 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/15945
  
I'll close for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15945: [SPARK-12978][SQL] Merge unnecessary partial aggr...

2018-07-17 Thread maropu

Github user maropu closed the pull request at:

https://github.com/apache/spark/pull/15945


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16605: [SPARK-18884][SQL] Throw an exception in compile time if...

2018-07-17 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/16605
  
I'll close for now


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16605: [SPARK-18884][SQL] Throw an exception in compile ...

2018-07-17 Thread maropu

Github user maropu closed the pull request at:

https://github.com/apache/spark/pull/16605


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21460: [SPARK-23442][SQL] Increase reading tasks when re...

2018-07-17 Thread wangyum

Github user wangyum closed the pull request at:

https://github.com/apache/spark/pull/21460


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19804: [WIP][SPARK-22573][SQL] Shouldn't inferFilters if...

2018-07-17 Thread wangyum

Github user wangyum closed the pull request at:

https://github.com/apache/spark/pull/19804


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20248: [SPARK-23058][SQL] Show non printable field delim...

2018-07-17 Thread wangyum

Github user wangyum closed the pull request at:

https://github.com/apache/spark/pull/20248


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19804: [WIP][SPARK-22573][SQL] Shouldn't inferFilters if it con...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19804
  
@wangyum, I think you could leave close some of PRs that you are currently 
not working on and have no explicit plan yet to start to work on soon. You 
could reopen or create new one when you start to work on that. See also 
http://apache-spark-developers-list.1001551.n3.nabble.com/Stale-PR-update-and-review-request-td24430.html


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21440: [SPARK-24307][CORE] Support reading remote cached...

2018-07-17 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21440#discussion_r203237484
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/io/ChunkedByteBufferFileRegion.scala 
---
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.util.io
+
+import java.nio.channels.WritableByteChannel
+
+import io.netty.channel.FileRegion
+import io.netty.util.AbstractReferenceCounted
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.network.util.AbstractFileRegion
+
+
+/**
+ * This exposes a ChunkedByteBuffer as a netty FileRegion, just to allow 
sending > 2gb in one netty
+ * message.  This is because netty cannot send a ByteBuf > 2g, but it can 
send a large FileRegion,
+ * even though the data is not backed by a file.
+ */
+private[io] class ChunkedByteBufferFileRegion(
+private val chunkedByteBuffer: ChunkedByteBuffer,
+private val ioChunkSize: Int) extends AbstractFileRegion {
+
+  private var _transferred: Long = 0
+  // this duplicates the original chunks, so we're free to modify the 
position, limit, etc.
+  private val chunks = chunkedByteBuffer.getChunks()
+  private val size = chunks.foldLeft(0L) { _ + _.remaining() }
+
+  protected def deallocate: Unit = {}
+
+  override def count(): Long = size
+
+  // this is the "start position" of the overall Data in the backing file, 
not our current position
+  override def position(): Long = 0
+
+  override def transferred(): Long = _transferred
+
+  private var currentChunkIdx = 0
+
+  def transferTo(target: WritableByteChannel, position: Long): Long = {
+assert(position == _transferred)
+if (position == size) return 0L
+var keepGoing = true
+var written = 0L
+var currentChunk = chunks(currentChunkIdx)
+while (keepGoing) {
+  while (currentChunk.hasRemaining && keepGoing) {
+val ioSize = Math.min(currentChunk.remaining(), ioChunkSize)
+val originalLimit = currentChunk.limit()
+currentChunk.limit(currentChunk.position() + ioSize)
+val thisWriteSize = target.write(currentChunk)
+currentChunk.limit(originalLimit)
+written += thisWriteSize
+if (thisWriteSize < ioSize) {
--- End diff --

What will be happened if `thisWriteSize` is smaller than `ioSize`, will 
Spark throw an exception or something else?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...

2018-07-17 Thread stanzhai

Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/18544
  
cc @gatorsmile changes in 
`sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala`
 has been reverted.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21192: [SPARK-24118][SQL] Flexible format for the lineSep optio...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21192
  
To me, yes. If you find some times, I would appreciate if you take a look 
at https://github.com/apache/spark/pull/21192#issuecomment-391405633 too so 
that we can review each other whoever make a PR first.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21440: [SPARK-24307][CORE] Support reading remote cached...

2018-07-17 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21440#discussion_r203236014
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala ---
@@ -166,6 +170,34 @@ private[spark] class ChunkedByteBuffer(var chunks: 
Array[ByteBuffer]) {
 
 }
 
+object ChunkedByteBuffer {
+  // TODO eliminate this method if we switch BlockManager to getting 
InputStreams
+  def fromManagedBuffer(data: ManagedBuffer, maxChunkSize: Int): 
ChunkedByteBuffer = {
+data match {
+  case f: FileSegmentManagedBuffer =>
+map(f.getFile, maxChunkSize, f.getOffset, f.getLength)
+  case other =>
+new ChunkedByteBuffer(other.nioByteBuffer())
+}
+  }
+
+  def map(file: File, maxChunkSize: Int, offset: Long, length: Long): 
ChunkedByteBuffer = {
+Utils.tryWithResource(new FileInputStream(file).getChannel()) { 
channel =>
--- End diff --

Can we please use `FileChannel#open` instead, 
FileInputStream/FileOutputStream has some issues 
(https://www.cloudbees.com/blog/fileinputstream-fileoutputstream-considered-harmful)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21589: [SPARK-24591][CORE] Number of cores and executors...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21589#discussion_r203235747
  
--- Diff: R/pkg/R/context.R ---
@@ -437,3 +437,33 @@ setCheckpointDir <- function(directory) {
   sc <- getSparkContext()
   invisible(callJMethod(sc, "setCheckpointDir", 
suppressWarnings(normalizePath(directory
 }
+
+#' Total number of CPU cores of all executors registered in the cluster at 
the moment
+#' and potentially available to jobs submitted via the Spark context.
+#'
--- End diff --

and `This method is experimental, and its behavior can be changed in the 
next releases.` too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21589: [SPARK-24591][CORE] Number of cores and executors...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21589#discussion_r203235671
  
--- Diff: R/pkg/R/context.R ---
@@ -437,3 +437,33 @@ setCheckpointDir <- function(directory) {
   sc <- getSparkContext()
   invisible(callJMethod(sc, "setCheckpointDir", 
suppressWarnings(normalizePath(directory
 }
+
+#' Total number of CPU cores of all executors registered in the cluster at 
the moment
+#' and potentially available to jobs submitted via the Spark context.
+#'
--- End diff --

@MaxGekk, is `The number reflects current status of the cluster and can 
change in the future` intentionally taken out here or a mistake?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21440: [SPARK-24307][CORE] Support reading remote cached...

2018-07-17 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21440#discussion_r203235292
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala ---
@@ -17,17 +17,21 @@
 
 package org.apache.spark.util.io
 
-import java.io.InputStream
+import java.io.{File, FileInputStream, InputStream}
 import java.nio.ByteBuffer
-import java.nio.channels.WritableByteChannel
+import java.nio.channels.{FileChannel, WritableByteChannel}
+
+import scala.collection.mutable.ListBuffer
 
 import com.google.common.primitives.UnsignedBytes
-import io.netty.buffer.{ByteBuf, Unpooled}
 
 import org.apache.spark.SparkEnv
 import org.apache.spark.internal.config
+import org.apache.spark.network.buffer.{FileSegmentManagedBuffer, 
ManagedBuffer}
 import org.apache.spark.network.util.ByteArrayWritableChannel
 import org.apache.spark.storage.StorageUtils
+import org.apache.spark.util.Utils
+
--- End diff --

nit. This blank line seems not necessary.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21800: [SPARK-24825][K8S][TEST] Kubernetes integration tests bu...

2018-07-17 Thread shaneknapp

Github user shaneknapp commented on the issue:

https://github.com/apache/spark/pull/21800
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21784: [SPARK-24182][YARN][FOLLOW-UP] Turn off noisy log output

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21784
  
Hm, yea. I don't think find this super noisy though to be honest.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19819: [SPARK-22606][Streaming]Add threadId to the Cache...

2018-07-17 Thread eatoncys

Github user eatoncys closed the pull request at:

https://github.com/apache/spark/pull/19819


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21785: [SPARK-24529][BUILD][test-maven][FOLLOW-UP] Set s...

2018-07-17 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21785


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21785: [SPARK-24529][BUILD][test-maven][FOLLOW-UP] Set spotbugs...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21785
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21785: [SPARK-24529][BUILD][test-maven][FOLLOW-UP] Set spotbugs...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21785
  
@kiszk, let me just push this in since it fixes an actual issue and the 
build passes; however, please make a followup if you see something else to fix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21635: [SPARK-24594][YARN] Introducing metrics for YARN

2018-07-17 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21635#discussion_r203232034
  
--- Diff: docs/monitoring.md ---
@@ -435,6 +435,7 @@ set of sinks to which metrics are reported. The 
following instances are currentl
 * `executor`: A Spark executor.
 * `driver`: The Spark driver process (the process in which your 
SparkContext is created).
 * `shuffleService`: The Spark shuffle service.
+* `applicationMaster`: The Spark application master on YARN.
 
--- End diff --

I think it would be better to clarify as "The Spark ApplicationMaster when 
running on YARN."


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19819: [SPARK-22606][Streaming]Add threadId to the CachedKafkaC...

2018-07-17 Thread lvdongr

Github user lvdongr commented on the issue:

https://github.com/apache/spark/pull/19819
  
I've seen your PR:  https://github.com/apache/spark/pull/20997, a good 
solution @gaborgsomogyi 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan

2018-07-17 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21710#discussion_r203229835
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/r/PrefixSpanWrapper.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.r
+
+import org.apache.spark.ml.fpm.PrefixSpan
+import org.apache.spark.sql.{DataFrame, Dataset}
+
+private[r] class PrefixSpanWrapper private (val prefixSpan: PrefixSpan) {
--- End diff --

You are right. I will remove 
```
private[r] class PrefixSpanWrapper private (val prefixSpan: PrefixSpan) {
  def findFrequentSequentialPatterns(dataset: Dataset[_]): DataFrame = {
prefixSpan.findFrequentSequentialPatterns(dataset)
  }
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan

2018-07-17 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21710#discussion_r203229794
  
--- Diff: R/pkg/tests/fulltests/test_mllib_fpm.R ---
@@ -82,4 +82,26 @@ test_that("spark.fpGrowth", {
 
 })
 
+test_that("spark.prefixSpan", {
+df <- createDataFrame(list(list(list(list(1L, 2L), list(3L))),
+  list(list(list(1L), list(3L, 2L), list(1L, 2L))),
+  list(list(list(1L, 2L), list(5L))),
+  list(list(list(6L, schema = c("sequence"))
+prefix_Span1 <- spark.prefixSpan(minSupport = 0.5, maxPatternLength = 
5L,
+ maxLocalProjDBSize = 3200L)
+result1 <- spark.findFrequentSequentialPatterns(prefix_Span1, df)
+
+expected_result <- createDataFrame(list(list(list(list(1L)), 3L),
+list(list(list(3L)), 2L),
+list(list(list(2L)), 3L),
+list(list(list(1L, 2L)), 3L),
+list(list(list(1L), list(3L)), 
2L)),
+schema = c("sequence", "freq"))
+expect_equivalent(expected_result, result1)
+
+prefix_Span2 <- spark.prefixSpan(minSupport = 0.5, maxPatternLength = 
5L)
+result2 <- spark.findFrequentSequentialPatterns(prefix_Span2, df)
+expect_equivalent(expected_result, result2)
--- End diff --

Just trying to test that the default value (```maxLocalProjDBSize = 
3200L```) will be used if the parameter isn't set explicitly. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan

2018-07-17 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21710#discussion_r203229733
  
--- Diff: R/pkg/R/generics.R ---
@@ -1415,6 +1415,13 @@ setGeneric("spark.freqItemsets", function(object) { 
standardGeneric("spark.freqI
 #' @rdname spark.fpGrowth
 setGeneric("spark.associationRules", function(object) { 
standardGeneric("spark.associationRules") })
 
+#' @rdname spark.prefixSpan
+setGeneric("spark.prefixSpan", function(...) { 
standardGeneric("spark.prefixSpan") })
+
+#' @rdname spark.prefixSpan
--- End diff --

@viirya Thanks for your review. 
Do you mean to use ```findFrequentSequentialPatterns``` instead of 
```spark.findFrequentSequentialPatterns```? I have a question about when we 
need to add ```spark``` in front of the method name. In FPGrowth, some methods 
have ```spark.```, e.g.  ```spark.freqItemsets```, 
```spark.associationRules```. Some methods don't have ```spark.```, e.g. 
```predict``` and ```write.ml``` . 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r203229065
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -895,6 +895,8 @@ def csv(self, path, mode=None, compression=None, 
sep=None, quote=None, escape=No
   the quote character. If None is 
set, the default value is
   escape character when escape and 
quote characters are
   different, ``\0`` otherwise..
+:param encoding: sets the encoding (charset) to be used on the csv 
file. If None is set, it
+  uses the default value, 
``UTF-8``.
--- End diff --

Likewise, let's match the doc to JSON's.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r203228930
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -512,6 +513,43 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 }
   }
 
+  test("SPARK-19018: Save csv with custom charset") {
+
+// scalastyle:off nonascii
+val content = "ÂµÃ Ã¡Ã¢Ã¤ ÃÃÃ"
+// scalastyle:on nonascii
+
+Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach 
{ encoding =>
+  withTempDir { dir =>
+val csvDir = new File(dir, "csv")
+
+val originalDF = Seq(content).toDF("_c0").repartition(1)
--- End diff --

`toDF("_c0")` -> `toDF()`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r203228844
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -512,6 +513,43 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 }
   }
 
+  test("SPARK-19018: Save csv with custom charset") {
+
+// scalastyle:off nonascii
+val content = "ÂµÃ Ã¡Ã¢Ã¤ ÃÃÃ"
+// scalastyle:on nonascii
+
+Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach 
{ encoding =>
+  withTempDir { dir =>
--- End diff --

`withTempDir` -> `withTempPath`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r203228679
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -512,6 +513,43 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 }
   }
 
+  test("SPARK-19018: Save csv with custom charset") {
+
+// scalastyle:off nonascii
+val content = "ÂµÃ Ã¡Ã¢Ã¤ ÃÃÃ"
+// scalastyle:on nonascii
+
+Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach 
{ encoding =>
+  withTempDir { dir =>
+val csvDir = new File(dir, "csv")
+
+val originalDF = Seq(content).toDF("_c0").repartition(1)
+originalDF.write
+  .option("encoding", encoding)
+  .csv(csvDir.getCanonicalPath)
+
+csvDir.listFiles().filter(_.getName.endsWith("csv")).foreach({ 
csvFile =>
+  val readback = Files.readAllBytes(csvFile.toPath)
+  val expected = (content + 
"\n").getBytes(Charset.forName(encoding))
+  assert(readback === expected)
+})
+  }
+}
+  }
+
+  test("SPARK-19018: error handling for unsupported charsets") {
+val exception = intercept[SparkException] {
+  withTempDir { dir =>
--- End diff --

`withTempPath`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r203228640
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -512,6 +513,43 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 }
   }
 
+  test("SPARK-19018: Save csv with custom charset") {
+
+// scalastyle:off nonascii
+val content = "ÂµÃ Ã¡Ã¢Ã¤ ÃÃÃ"
+// scalastyle:on nonascii
+
+Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach 
{ encoding =>
+  withTempDir { dir =>
+val csvDir = new File(dir, "csv")
+
+val originalDF = Seq(content).toDF("_c0").repartition(1)
+originalDF.write
+  .option("encoding", encoding)
+  .csv(csvDir.getCanonicalPath)
+
+csvDir.listFiles().filter(_.getName.endsWith("csv")).foreach({ 
csvFile =>
+  val readback = Files.readAllBytes(csvFile.toPath)
+  val expected = (content + 
"\n").getBytes(Charset.forName(encoding))
--- End diff --

Currently, the newline is dependent on Univocity. This test is going to be 
broken on Windows. Let's use platform's newline


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21635: [SPARK-24594][YARN] Introducing metrics for YARN

2018-07-17 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21635#discussion_r203228423
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterSource.scala
 ---
@@ -0,0 +1,49 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy.yarn
+
+import com.codahale.metrics.{Gauge, MetricRegistry}
+
+import org.apache.spark.metrics.source.Source
+
+private[spark] class ApplicationMasterSource(yarnAllocator: YarnAllocator) 
extends Source {
+
+  override val sourceName: String = "applicationMaster"
--- End diff --

@tgravescs Would you please explain more, are you going to add a new 
configuration "spark.metrics.namespace", also how do you use this configuration?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r203228403
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -512,6 +513,43 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 }
   }
 
+  test("SPARK-19018: Save csv with custom charset") {
+
+// scalastyle:off nonascii
+val content = "ÂµÃ Ã¡Ã¢Ã¤ ÃÃÃ"
+// scalastyle:on nonascii
+
+Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach 
{ encoding =>
+  withTempDir { dir =>
+val csvDir = new File(dir, "csv")
+
+val originalDF = Seq(content).toDF("_c0").repartition(1)
+originalDF.write
+  .option("encoding", encoding)
+  .csv(csvDir.getCanonicalPath)
+
+csvDir.listFiles().filter(_.getName.endsWith("csv")).foreach({ 
csvFile =>
--- End diff --

`h({ ` => `h { `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r203228243
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 ---
@@ -146,7 +148,12 @@ private[csv] class CsvOutputWriter(
 context: TaskAttemptContext,
 params: CSVOptions) extends OutputWriter with Logging {
 
-  private val writer = CodecStreams.createOutputStreamWriter(context, new 
Path(path))
+  private val charset = Charset.forName(params.charset)
+
+  private val writer = CodecStreams.createOutputStreamWriter(
+context,
--- End diff --

tiny nit:

```scala
private val writer = CodecStreams.createOutputStreamWriter(
  context, new Path(path), charset)
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20949: [SPARK-19018][SQL] Add support for custom encodin...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20949#discussion_r203227873
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -625,6 +625,7 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
* enclosed in quotes. Default is to only escape values containing a 
quote character.
* `header` (default `false`): writes the names of columns as the 
first line.
* `nullValue` (default empty string): sets the string 
representation of a null value.
+   * `encoding` (default `UTF-8`): encoding to use when saving to 
file.
--- End diff --

I think we should match the doc with JSON's


https://github.com/apache/spark/blob/6ea582e36ab0a2e4e01340f6fc8cfb8d493d567d/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L525-L526


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21787: [SPARK-24568] Code refactoring for DataType equalsXXX me...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21787
  
But the current code looks less readable and adding more lines. I would 
rather leave this as was.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21798: [SPARK-24836][SQL] New option for Avro datasource...

2018-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21798#discussion_r203226401
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala ---
@@ -276,10 +274,15 @@ private[avro] object AvroFileFormat {
 }
   }
 
-  def ignoreFilesWithoutExtensions(conf: Configuration): Boolean = {
-// Files without .avro extensions are not ignored by default
-val defaultValue = false
+  def ignoreExtension(conf: Configuration, options: Map[String, String]): 
Boolean = {
+val ignoreFilesWithoutExtensionByDefault = false
+val ignoreFilesWithoutExtension = conf.getBoolean(
+  AvroFileFormat.IgnoreFilesWithoutExtensionProperty,
+  ignoreFilesWithoutExtensionByDefault)
 
-conf.getBoolean(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, 
defaultValue)
+options
+  .get("ignoreExtension")
--- End diff --

Let's make sure we describe that in a public API later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21800: [SPARK-24825][K8S][TEST] Kubernetes integration tests bu...

2018-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21800
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....

2018-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21488
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....

2018-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21488
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1082/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21800: [SPARK-24825][K8S][TEST] Kubernetes integration tests bu...

2018-07-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21800
  
Kubernetes integration test status success
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/1081/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21800: [SPARK-24825][K8S][TEST] Kubernetes integration tests bu...

2018-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21800
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1081/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 >

1 - 100 of 650 matches

Mail list logo