[GitHub] spark issue #17471: [SPARK-3577] Report Spill size on disk for UnsafeExterna...

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17471
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78851/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18388: [SPARK-21175] Reject OpenBlocks when memory short...

2017-06-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18388#discussion_r124716883
  
--- Diff: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocksFailed.java
 ---
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.network.shuffle.protocol;
+
+import com.google.common.base.Objects;
+import io.netty.buffer.ByteBuf;
+
+// Needed by ScalaDoc. See SPARK-7726
+import static 
org.apache.spark.network.shuffle.protocol.BlockTransferMessage.Type;
+
+/**
+ * This message is responded from shuffle service when client failed to 
"open blocks" due to
+ * some reason(e.g. the shuffle service is suffering from high memory 
cost).
+ */
+public class OpenBlocksFailed extends BlockTransferMessage {
+
+  public final int reason;
+
+  public OpenBlocksFailed(int reason) {
+this.reason = reason;
+  }
+
+  @Override
+  protected Type type() { return Type.OPEN_BLOCKS_FAILED; }
+
+  @Override
+  public int hashCode() {
+return Objects.hashCode(reason);
+  }
+
+  public String toString() {
+String reasonStr = null;
+switch (reason) {
+  case 1:
+reasonStr = "shuffle service is suffering high memory cost";
+break;
+  default:
+reasonStr = "unknown";
+break;
+}
+return Objects.toStringHelper(this)
+  .add("reason", reasonStr)
+  .toString();
+  }
+
+  @Override
+  public boolean equals(Object other) {
+if (other != null && other instanceof OpenBlocksFailed) {
+  OpenBlocksFailed o = (OpenBlocksFailed) other;
+  return Objects.equal(reason, o.reason);
--- End diff --

nit: `this.reason == o.reason`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17471: [SPARK-3577] Report Spill size on disk for UnsafeExterna...

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17471
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17471: [SPARK-3577] Report Spill size on disk for UnsafeExterna...

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17471
  
**[Test build #78851 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78851/testReport)**
 for PR 17471 at commit 
[`6b94c2b`](https://github.com/apache/spark/commit/6b94c2b05adb26715087af778557934648a58b01).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18458: [SPARK-20889][SparkR] Grouped documentation for COLLECTI...

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18458
  
**[Test build #78869 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78869/testReport)**
 for PR 18458 at commit 
[`c47b3a2`](https://github.com/apache/spark/commit/c47b3a249b51ab093181eaa82d965d6787176778).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18463
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18463
  
**[Test build #78868 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78868/testReport)**
 for PR 18463 at commit 
[`466325d`](https://github.com/apache/spark/commit/466325d3fd353668583f3bde38ae490d9db0b189).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18463
  
retest this plesae


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18463
  
**[Test build #78867 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78867/testReport)**
 for PR 18463 at commit 
[`466325d`](https://github.com/apache/spark/commit/466325d3fd353668583f3bde38ae490d9db0b189).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18458: [SPARK-20889][SparkR] Grouped documentation for C...

2017-06-28 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18458#discussion_r124716253
  
--- Diff: R/pkg/R/functions.R ---
@@ -2156,28 +2178,23 @@ setMethod("date_format", signature(y = "Column", x 
= "character"),
 column(jc)
   })
 
-#' from_json
-#'
-#' Parses a column containing a JSON string into a Column of 
\code{structType} with the specified
-#' \code{schema} or array of \code{structType} if \code{as.json.array} is 
set to \code{TRUE}.
-#' If the string is unparseable, the Column will contains the value NA.
+#' @details
+#' \code{from_json}: Parses a column containing a JSON string into a 
Column of \code{structType}
+#' with the specified \code{schema} or array of \code{structType} if 
\code{as.json.array} is set
+#' to \code{TRUE}. If the string is unparseable, the Column will contains 
the value NA.
 #'
-#' @param x Column containing the JSON string.
+#' @rdname column_collection_functions
 #' @param schema a structType object to use as the schema to use when 
parsing the JSON string.
 #' @param as.json.array indicating if input string is JSON array of 
objects or a single object.
-#' @param ... additional named properties to control how the json is 
parsed, accepts the same
-#'options as the JSON data source.
-#'
-#' @family non-aggregate functions
-#' @rdname from_json
-#' @name from_json
-#' @aliases from_json,Column,structType-method
+#' @aliases from_json from_json,Column,structType-method
 #' @export
 #' @examples
+#'
 #' \dontrun{
-#' schema <- structType(structField("name", "string"),
-#' select(df, from_json(df$value, schema, dateFormat = "dd/MM/"))
-#'}
+#' df2 <- sql("SELECT named_struct('name', 'Bob') as people")
+#' df2 <- mutate(df2, people_json = to_json(df2$people))
+#' schema <- structType(structField("name", "string"))
+#' head(select(df2, from_json(df2$people_json, schema)))}
--- End diff --

Thanks for catching this. Added an example. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18388: [SPARK-21175] Reject OpenBlocks when memory short...

2017-06-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18388#discussion_r124715952
  
--- Diff: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java
 ---
@@ -90,16 +96,28 @@ protected void handleMessage(
   try {
 OpenBlocks msg = (OpenBlocks) msgObj;
 checkAuth(client, msg.appId);
-long streamId = streamManager.registerStream(client.getClientId(),
-  new ManagedBufferIterator(msg.appId, msg.execId, msg.blockIds));
-if (logger.isTraceEnabled()) {
-  logger.trace("Registered streamId {} with {} buffers for client 
{} from host {}",
-   streamId,
-   msg.blockIds.length,
-   client.getClientId(),
-   getRemoteAddress(client.getChannel()));
+// Return OpenBlocksFailed when memory usage is above the water 
mark.
+long usage = memoryUsage.getMemoryUsage();
+if (usage > memWaterMark) {
+  logger.warn("Memory usage({}) is above water mark({}), rejecting 
'open blocks' request " +
+"from client({}, {}).", usage, memWaterMark, 
client.getClientId(),
+client.getSocketAddress());
+  callback.onSuccess(new OpenBlocksFailed(1).toByteBuffer());
+} else {
+  logger.trace("Memory usage({}) is under water mark({}), 
accepting 'open blocks' " +
+"request from client({}, {}).", usage, memWaterMark, 
client.getClientId(),
+client.getSocketAddress());
+  long streamId = 
streamManager.registerStream(client.getClientId(),
+new ManagedBufferIterator(msg.appId, msg.execId, 
msg.blockIds));
+  if (logger.isTraceEnabled()) {
+logger.trace("Registered streamId {} with {} buffers for 
client {} from host {}",
--- End diff --

shall we merge this and the above log into one log entry?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18463
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18463
  
**[Test build #78866 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78866/testReport)**
 for PR 18463 at commit 
[`466325d`](https://github.com/apache/spark/commit/466325d3fd353668583f3bde38ae490d9db0b189).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18388: [SPARK-21175] Reject OpenBlocks when memory short...

2017-06-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18388#discussion_r124715302
  
--- Diff: 
common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java
 ---
@@ -257,4 +257,31 @@ public Properties cryptoConf() {
 return CryptoUtils.toCryptoConf("spark.network.crypto.config.", 
conf.getAll());
   }
 
+  /**
+   * When memory usage of Netty is above this water mark, it's regarded as 
memory shortage.
--- End diff --

do we have a config for shuffle service JVM heap size? maybe we can use 
that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18450: [SPARK-21238][SQL] allow nested SQL execution

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18450
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18450: [SPARK-21238][SQL] allow nested SQL execution

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18450
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78853/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18388: [SPARK-21175] Reject OpenBlocks when memory short...

2017-06-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18388#discussion_r124715080
  
--- Diff: 
common/network-common/src/main/java/org/apache/spark/network/util/PooledByteBufAllocatorWithMetrics.java
 ---
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.network.util;
+
+import java.util.Iterator;
+import java.util.List;
+
+import io.netty.buffer.PoolArenaMetric;
+import io.netty.buffer.PoolChunkListMetric;
+import io.netty.buffer.PoolChunkMetric;
+import io.netty.buffer.PooledByteBufAllocator;
+
+/**
+ * A {@link PooledByteBufAllocator} providing some metrics.
+ */
+public class PooledByteBufAllocatorWithMetrics extends 
PooledByteBufAllocator {
+
+  public PooledByteBufAllocatorWithMetrics(
+  boolean preferDirect,
+  int nHeapArena,
+  int nDirectArena,
+  int pageSize,
+  int maxOrder,
+  int tinyCacheSize,
+  int smallCacheSize,
+  int normalCacheSize) {
+super(preferDirect, nHeapArena, nDirectArena, pageSize, maxOrder, 
tinyCacheSize,
+  smallCacheSize, normalCacheSize);
+  }
+
+  public long offHeapUsage() {
+return sumOfMetrics(directArenas());
+  }
+
+  public long onHeapUsage() {
+return sumOfMetrics(heapArenas());
+  }
+
+  private long sumOfMetrics(List metrics) {
+long sum = 0;
+for (int i = 0; i < metrics.size(); i++) {
+  PoolArenaMetric metric = metrics.get(i);
--- End diff --

nit: it's better to use `Iterator` pattern here, as the input list may not 
be an indexed list and `list.get(i)` becomes `O(n)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18458: [SPARK-20889][SparkR] Grouped documentation for C...

2017-06-28 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18458#discussion_r124715019
  
--- Diff: R/pkg/R/functions.R ---
@@ -2156,28 +2178,23 @@ setMethod("date_format", signature(y = "Column", x 
= "character"),
 column(jc)
   })
 
-#' from_json
-#'
-#' Parses a column containing a JSON string into a Column of 
\code{structType} with the specified
-#' \code{schema} or array of \code{structType} if \code{as.json.array} is 
set to \code{TRUE}.
-#' If the string is unparseable, the Column will contains the value NA.
+#' @details
+#' \code{from_json}: Parses a column containing a JSON string into a 
Column of \code{structType}
+#' with the specified \code{schema} or array of \code{structType} if 
\code{as.json.array} is set
+#' to \code{TRUE}. If the string is unparseable, the Column will contains 
the value NA.
--- End diff --

Corrected the typo. Will consider updating `null` & `NA` in the future :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18450: [SPARK-21238][SQL] allow nested SQL execution

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18450
  
**[Test build #78853 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78853/testReport)**
 for PR 18450 at commit 
[`f8e9901`](https://github.com/apache/spark/commit/f8e99013dffeffc2bfe37624b84dbf9736fed8b9).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18463
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78863/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18463
  
**[Test build #78863 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78863/testReport)**
 for PR 18463 at commit 
[`488c287`](https://github.com/apache/spark/commit/488c2871e4589f1a469cff2dba1e962173eaf910).
 * This patch **fails due to an unknown error code, -10**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18463
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18301: [SPARK-21052][SQL] Add hash map metrics to join

2017-06-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18301#discussion_r124714544
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
@@ -267,10 +298,111 @@ class SQLMetricsSuite extends SparkFunSuite with 
SharedSQLContext {
 val df = df1.join(broadcast(df2), "key")
 testSparkPlanMetrics(df, 2, Map(
   1L -> ("BroadcastHashJoin", Map(
-"number of output rows" -> 2L)))
+"number of output rows" -> 2L,
+"avg hash probe (min, med, max)" -> "\n(1, 1, 1)")))
 )
   }
 
+  test("BroadcastHashJoin metrics: track avg probe") {
+// The executed plan looks like:
+// Project [a#210, b#211, b#221]
+// +- BroadcastHashJoin [a#210], [a#220], Inner, BuildRight
+//:- Project [_1#207 AS a#210, _2#208 AS b#211]
+//:  +- Filter isnotnull(_1#207)
+//: +- LocalTableScan [_1#207, _2#208]
+//+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, 
binary, true]))
+//   +- Project [_1#217 AS a#220, _2#218 AS b#221]
+//  +- Filter isnotnull(_1#217)
+// +- LocalTableScan [_1#217, _2#218]
+//
+// Assume the execution plan is
+// WholeStageCodegen disabled:
+// ... -> BroadcastHashJoin(nodeId = 1) -> Project(nodeId = 0)
+//
+// WholeStageCodegen enabled:
+// ... ->
+// WholeStageCodegen(nodeId = 0, Filter(nodeId = 4) -> Project(nodeId 
= 3) ->
--- End diff --

can you format it a little bit? to indicate that we only have a 
`WholeStageCodegen`, all other plans are the inner children of 
`WholeStageCodegen`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18448: [SPARK-20889][SparkR] Grouped documentation for MISC col...

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18448
  
**[Test build #78865 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78865/testReport)**
 for PR 18448 at commit 
[`ff27f18`](https://github.com/apache/spark/commit/ff27f182b9055511d2fef59c6d66e113fcbef535).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18301: [SPARK-21052][SQL] Add hash map metrics to join

2017-06-28 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18301
  
@viirya  ok let's add it back


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18301: [SPARK-21052][SQL] Add hash map metrics to join

2017-06-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18301#discussion_r124714354
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala
 ---
@@ -367,6 +367,22 @@ class TungstenAggregationIterator(
 }
   }
 
+  TaskContext.get().addTaskCompletionListener(_ => {
+// At the end of the task, update the task's peak memory usage. Since 
we destroy
+// the map to create the sorter, their memory usages should not 
overlap, so it is safe
+// to just use the max of the two.
+val mapMemory = hashMap.getPeakMemoryUsedBytes
+val sorterMemory = 
Option(externalSorter).map(_.getPeakMemoryUsedBytes).getOrElse(0L)
+val maxMemory = Math.max(mapMemory, sorterMemory)
+val metrics = TaskContext.get().taskMetrics()
+peakMemory += maxMemory
+spillSize += metrics.memoryBytesSpilled - spillSizeBefore
+metrics.incPeakExecutionMemory(maxMemory)
--- End diff --

makes sense


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18301: [SPARK-21052][SQL] Add hash map metrics to join

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18301
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78859/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18448: [SPARK-20889][SparkR] Grouped documentation for M...

2017-06-28 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18448#discussion_r124714226
  
--- Diff: R/pkg/R/functions.R ---
@@ -132,6 +132,27 @@ NULL
 #' df <- createDataFrame(as.data.frame(Titanic, stringsAsFactors = FALSE))}
 NULL
 
+#' Miscellaneous functions for Column operations
+#'
+#' Miscellaneous functions defined for \code{Column}.
+#'
+#' @param x Column to compute on. In \code{sha2}, it is one of 224, 256, 
384, or 512.
+#' @param y Column to compute on.
+#' @param ... additional columns.
--- End diff --

updated now. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18301: [SPARK-21052][SQL] Add hash map metrics to join

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18301
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18301: [SPARK-21052][SQL] Add hash map metrics to join

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18301
  
**[Test build #78859 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78859/testReport)**
 for PR 18301 at commit 
[`9cbd627`](https://github.com/apache/spark/commit/9cbd627bed6279550a85aaf1d596f22c6b69bfc6).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18448: [SPARK-20889][SparkR] Grouped documentation for M...

2017-06-28 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18448#discussion_r124714065
  
--- Diff: R/pkg/R/functions.R ---
@@ -132,6 +132,27 @@ NULL
 #' df <- createDataFrame(as.data.frame(Titanic, stringsAsFactors = FALSE))}
 NULL
 
+#' Miscellaneous functions for Column operations
+#'
+#' Miscellaneous functions defined for \code{Column}.
+#'
+#' @param x Column to compute on. In \code{sha2}, it is one of 224, 256, 
384, or 512.
+#' @param y Column to compute on.
--- End diff --

I think roxygen automatically chooses the order of the arguments based on 
the order they appear in the file, and ignores the order we specify. So even if 
I move `y` before `x` here, in the generated doc, `x` will still appear before 
`y`. Indeed, as you can see from the screenshot, `...` appears before `y`.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18463
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18463
  
**[Test build #78861 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78861/testReport)**
 for PR 18463 at commit 
[`86bfa22`](https://github.com/apache/spark/commit/86bfa22d1f8d46e75dcc5f9085b7976365bc0e8f).
 * This patch **fails due to an unknown error code, -10**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18463
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78861/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18405: [SPARK-21194][SQL] Fail the putNullmethod when containsN...

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18405
  
**[Test build #78864 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78864/testReport)**
 for PR 18405 at commit 
[`0163e04`](https://github.com/apache/spark/commit/0163e04e9a5705fe963bad764704e6828161b374).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18463
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78860/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18463
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18463
  
**[Test build #78860 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78860/testReport)**
 for PR 18463 at commit 
[`5d5b390`](https://github.com/apache/spark/commit/5d5b39077d49225df2603217dea7e8d978a22a76).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18416: [SPARK-21204][SQL][WIP] Add support for Scala Set...

2017-06-28 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18416#discussion_r124712549
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DatasetPrimitiveSuite.scala ---
@@ -339,6 +340,28 @@ class DatasetPrimitiveSuite extends QueryTest with 
SharedSQLContext {
   LHMapClass(LHMap(1 -> 2)) -> LHMap("test" -> MapClass(Map(3 -> 4
   }
 
+  test("arbitrary sets") {
--- End diff --

Added a test for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18416: [SPARK-21204][SQL][WIP] Add support for Scala Set...

2017-06-28 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18416#discussion_r124712535
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -834,6 +834,140 @@ case class CollectObjectsToMap private(
   }
 }
 
+object CollectObjectsToSet {
+  private val curId = new java.util.concurrent.atomic.AtomicInteger()
+
+  /**
+   * Construct an instance of CollectObjectsToSet case class.
+   *
+   * @param function The function applied on the collection elements.
+   * @param inputData An expression that when evaluated returns a 
collection object.
+   * @param collClass The type of the resulting collection.
+   */
+  def apply(
+  function: Expression => Expression,
+  inputData: Expression,
+  collClass: Class[_]): CollectObjectsToSet = {
+val id = curId.getAndIncrement()
+val loopValue = s"CollectObjectsToSet_loopValue$id"
+val loopIsNull = s"CollectObjectsToSet_loopIsNull$id"
+val arrayType = inputData.dataType.asInstanceOf[ArrayType]
+val loopVar = LambdaVariable(loopValue, loopIsNull, 
arrayType.elementType)
+CollectObjectsToSet(
+  loopValue, loopIsNull, function(loopVar), inputData, collClass)
+  }
+}
+
+/**
+ * Expression used to convert a Catalyst Array to an external Scala `Set`.
+ * The collection is constructed using the associated builder, obtained by 
calling `newBuilder`
+ * on the collection's companion object.
+ *
+ * Notice that when we convert a Catalyst array which contains duplicated 
elements to an external
+ * Scala `Set`, the elements will be de-duplicated.
+ *
+ * @param loopValue the name of the loop variable that is used when 
iterating over the value
+ *   collection, and which is used as input for the 
`lambdaFunction`
+ * @param loopIsNull the nullability of the loop variable that is used 
when iterating over
+ *the value collection, and which is used as input 
for the
+ *`lambdaFunction`
+ * @param lmbdaFunction A function that takes the `loopValue` as input, 
and is used as
+ *a lambda function to handle collection 
elements.
+ * @param inputData An expression that when evaluated returns an array 
object.
+ * @param collClass The type of the resulting collection.
+ */
+case class CollectObjectsToSet private(
+loopValue: String,
+loopIsNull: String,
+lambdaFunction: Expression,
+inputData: Expression,
+collClass: Class[_]) extends Expression with NonSQLExpression {
+
+  override def nullable: Boolean = inputData.nullable
+
+  override def children: Seq[Expression] = lambdaFunction :: inputData :: 
Nil
+
+  override def eval(input: InternalRow): Any =
+throw new UnsupportedOperationException("Only code-generated 
evaluation is supported")
+
+  override def dataType: DataType = ObjectType(collClass)
+
+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+// The data with PythonUserDefinedType are actually stored with the 
data type of its sqlType.
+def inputDataType(dataType: DataType) = dataType match {
+  case p: PythonUserDefinedType => p.sqlType
+  case _ => dataType
+}
+
+val arrayType = 
inputDataType(inputData.dataType).asInstanceOf[ArrayType]
+val loopValueJavaType = ctx.javaType(arrayType.elementType)
+ctx.addMutableState("boolean", loopIsNull, "")
+ctx.addMutableState(loopValueJavaType, loopValue, "")
+val genFunction = lambdaFunction.genCode(ctx)
+
+val genInputData = inputData.genCode(ctx)
+val dataLength = ctx.freshName("dataLength")
+val loopIndex = ctx.freshName("loopIndex")
+val builderValue = ctx.freshName("builderValue")
+
+val getLength = s"${genInputData.value}.numElements()"
+val getLoopVar = ctx.getValue(genInputData.value, 
arrayType.elementType, loopIndex)
+
+// Make a copy of the data if it's unsafe-backed
+def makeCopyIfInstanceOf(clazz: Class[_ <: Any], value: String) =
+  s"$value instanceof ${clazz.getSimpleName}? $value.copy() : $value"
+val genFunctionValue =
+  lambdaFunction.dataType match {
+case StructType(_) => makeCopyIfInstanceOf(classOf[UnsafeRow], 
genFunction.value)
+case ArrayType(_, _) => 
makeCopyIfInstanceOf(classOf[UnsafeArrayData], genFunction.value)
+case MapType(_, _, _) => 
makeCopyIfInstanceOf(classOf[UnsafeMapData], genFunction.value)
+case _ => genFunction.value
+  }
+
+val loopNullCheck = s"$loopIsNull = 
${genInputData.value}.isNullAt($loopIndex);"

[GitHub] spark issue #18301: [SPARK-21052][SQL] Add hash map metrics to join

2017-06-28 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18301
  
@rxin I just revert it in previous commits. @cloud-fan should I revert it 
back?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18463
  
**[Test build #78863 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78863/testReport)**
 for PR 18463 at commit 
[`488c287`](https://github.com/apache/spark/commit/488c2871e4589f1a469cff2dba1e962173eaf910).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18301: [SPARK-21052][SQL] Add hash map metrics to join

2017-06-28 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/18301
  
hey i didn't track super closely, but it is pretty important to show at 
least one more digit, e.g. 1.7, rather than just 2.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18301: [SPARK-21052][SQL] Add hash map metrics to join

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18301
  
**[Test build #78862 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78862/testReport)**
 for PR 18301 at commit 
[`9a048f8`](https://github.com/apache/spark/commit/9a048f817b9a6499a64778c13141c9bc320cf2ab).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18463
  
**[Test build #78861 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78861/testReport)**
 for PR 18463 at commit 
[`86bfa22`](https://github.com/apache/spark/commit/86bfa22d1f8d46e75dcc5f9085b7976365bc0e8f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18463
  
**[Test build #78860 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78860/testReport)**
 for PR 18463 at commit 
[`5d5b390`](https://github.com/apache/spark/commit/5d5b39077d49225df2603217dea7e8d978a22a76).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16028: [SPARK-18518][ML] HasSolver supports override

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16028
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16028: [SPARK-18518][ML] HasSolver supports override

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16028
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78857/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16028: [SPARK-18518][ML] HasSolver supports override

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16028
  
**[Test build #78857 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78857/testReport)**
 for PR 16028 at commit 
[`d84bb21`](https://github.com/apache/spark/commit/d84bb214908aea84421133958762bbf2a3e4f7d9).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18405: [SPARK-21194][SQL] Fail the putNullmethod when containsN...

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18405
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78852/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18405: [SPARK-21194][SQL] Fail the putNullmethod when containsN...

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18405
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18301: [SPARK-21052][SQL] Add hash map metrics to join

2017-06-28 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18301#discussion_r124710724
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
@@ -163,29 +178,45 @@ class SQLMetricsSuite extends SparkFunSuite with 
SharedSQLContext {
 val df2 = testData2.groupBy('a).count()
 val expected2 = Seq(
   Map("number of output rows" -> 4L,
-"avg hashmap probe (min, med, max)" -> "\n(1, 1, 1)"),
+"avg hash probe (min, med, max)" -> "\n(1, 1, 1)"),
   Map("number of output rows" -> 3L,
-"avg hashmap probe (min, med, max)" -> "\n(1, 1, 1)"))
+"avg hash probe (min, med, max)" -> "\n(1, 1, 1)"))
 testSparkPlanMetrics(df2, 1, Map(
   2L -> ("HashAggregate", expected2(0)),
   0L -> ("HashAggregate", expected2(1)))
 )
   }
 
   test("Aggregate metrics: track avg probe") {
-val random = new Random()
-val manyBytes = (0 until 65535).map { _ =>
-  val byteArrSize = random.nextInt(100)
-  val bytes = new Array[Byte](byteArrSize)
-  random.nextBytes(bytes)
-  (bytes, random.nextInt(100))
-}
-val df = manyBytes.toSeq.toDF("a", 
"b").repartition(1).groupBy('a).count()
-val metrics = getSparkPlanMetrics(df, 1, Set(2L, 0L)).get
-Seq(metrics(2L)._2("avg hashmap probe (min, med, max)"),
-metrics(0L)._2("avg hashmap probe (min, med, max)")).foreach { 
probes =>
-  probes.toString.stripPrefix("\n(").stripSuffix(")").split(", 
").foreach { probe =>
-assert(probe.toInt > 1)
+// The executed plan looks like:
+// HashAggregate(keys=[a#61], functions=[count(1)], output=[a#61, 
count#71L])
+// +- Exchange hashpartitioning(a#61, 5)
+//+- HashAggregate(keys=[a#61], functions=[partial_count(1)], 
output=[a#61, count#76L])
+//   +- Exchange RoundRobinPartitioning(1)
+//  +- LocalTableScan [a#61]
+//
+// Assume the execution plan is:
+// Wholestage disabled:
+// LocalTableScan(nodeId = 4) ->Exchange (nodeId = 3) -> 
HashAggregate(nodeId = 2) ->
--- End diff --

I attached the tree string. This doc is used to show `nodeId` relations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18405: [SPARK-21194][SQL] Fail the putNullmethod when containsN...

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18405
  
**[Test build #78852 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78852/testReport)**
 for PR 18405 at commit 
[`c998374`](https://github.com/apache/spark/commit/c998374cf68e4f8520b9b29fd40c3a4b652dbdb8).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18301: [SPARK-21052][SQL] Add hash map metrics to join

2017-06-28 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18301#discussion_r124710649
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala
 ---
@@ -367,6 +367,22 @@ class TungstenAggregationIterator(
 }
   }
 
+  TaskContext.get().addTaskCompletionListener(_ => {
+// At the end of the task, update the task's peak memory usage. Since 
we destroy
+// the map to create the sorter, their memory usages should not 
overlap, so it is safe
+// to just use the max of the two.
+val mapMemory = hashMap.getPeakMemoryUsedBytes
+val sorterMemory = 
Option(externalSorter).map(_.getPeakMemoryUsedBytes).getOrElse(0L)
+val maxMemory = Math.max(mapMemory, sorterMemory)
+val metrics = TaskContext.get().taskMetrics()
+peakMemory += maxMemory
+spillSize += metrics.memoryBytesSpilled - spillSizeBefore
+metrics.incPeakExecutionMemory(maxMemory)
--- End diff --

hmm, the description of `peakExecutionMemory` in `TaskMetrics` is:

...The value of this accumulator should be approximately the sum of the 
peak sizes across all such data structures created in this task...

So it is designed to get the sum of peak memory of operators in the task. I 
think because the operators are not ran in sequence but in an iterator way, 
it's reasonable to sum the peak memory. Although the peak points of the 
operators might not at the same moment.
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18463
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18463
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78858/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18463
  
**[Test build #78858 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78858/testReport)**
 for PR 18463 at commit 
[`5d5b390`](https://github.com/apache/spark/commit/5d5b39077d49225df2603217dea7e8d978a22a76).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18463
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18435: [SPARK-21225][CORE] Considering CPUS_PER_TASK when alloc...

2017-06-28 Thread JackYangzg
Github user JackYangzg commented on the issue:

https://github.com/apache/spark/pull/18435
  
@jerryshao Ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18301: [SPARK-21052][SQL] Add hash map metrics to join

2017-06-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18301#discussion_r124709997
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
@@ -163,29 +178,45 @@ class SQLMetricsSuite extends SparkFunSuite with 
SharedSQLContext {
 val df2 = testData2.groupBy('a).count()
 val expected2 = Seq(
   Map("number of output rows" -> 4L,
-"avg hashmap probe (min, med, max)" -> "\n(1, 1, 1)"),
+"avg hash probe (min, med, max)" -> "\n(1, 1, 1)"),
   Map("number of output rows" -> 3L,
-"avg hashmap probe (min, med, max)" -> "\n(1, 1, 1)"))
+"avg hash probe (min, med, max)" -> "\n(1, 1, 1)"))
 testSparkPlanMetrics(df2, 1, Map(
   2L -> ("HashAggregate", expected2(0)),
   0L -> ("HashAggregate", expected2(1)))
 )
   }
 
   test("Aggregate metrics: track avg probe") {
-val random = new Random()
-val manyBytes = (0 until 65535).map { _ =>
-  val byteArrSize = random.nextInt(100)
-  val bytes = new Array[Byte](byteArrSize)
-  random.nextBytes(bytes)
-  (bytes, random.nextInt(100))
-}
-val df = manyBytes.toSeq.toDF("a", 
"b").repartition(1).groupBy('a).count()
-val metrics = getSparkPlanMetrics(df, 1, Set(2L, 0L)).get
-Seq(metrics(2L)._2("avg hashmap probe (min, med, max)"),
-metrics(0L)._2("avg hashmap probe (min, med, max)")).foreach { 
probes =>
-  probes.toString.stripPrefix("\n(").stripSuffix(")").split(", 
").foreach { probe =>
-assert(probe.toInt > 1)
+// The executed plan looks like:
+// HashAggregate(keys=[a#61], functions=[count(1)], output=[a#61, 
count#71L])
+// +- Exchange hashpartitioning(a#61, 5)
+//+- HashAggregate(keys=[a#61], functions=[partial_count(1)], 
output=[a#61, count#76L])
+//   +- Exchange RoundRobinPartitioning(1)
+//  +- LocalTableScan [a#61]
+//
+// Assume the execution plan is:
+// Wholestage disabled:
+// LocalTableScan(nodeId = 4) ->Exchange (nodeId = 3) -> 
HashAggregate(nodeId = 2) ->
--- End diff --

tree string here please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18301: [SPARK-21052][SQL] Add hash map metrics to join

2017-06-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18301#discussion_r124710004
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
@@ -163,29 +178,45 @@ class SQLMetricsSuite extends SparkFunSuite with 
SharedSQLContext {
 val df2 = testData2.groupBy('a).count()
 val expected2 = Seq(
   Map("number of output rows" -> 4L,
-"avg hashmap probe (min, med, max)" -> "\n(1, 1, 1)"),
+"avg hash probe (min, med, max)" -> "\n(1, 1, 1)"),
   Map("number of output rows" -> 3L,
-"avg hashmap probe (min, med, max)" -> "\n(1, 1, 1)"))
+"avg hash probe (min, med, max)" -> "\n(1, 1, 1)"))
 testSparkPlanMetrics(df2, 1, Map(
   2L -> ("HashAggregate", expected2(0)),
   0L -> ("HashAggregate", expected2(1)))
 )
   }
 
   test("Aggregate metrics: track avg probe") {
-val random = new Random()
-val manyBytes = (0 until 65535).map { _ =>
-  val byteArrSize = random.nextInt(100)
-  val bytes = new Array[Byte](byteArrSize)
-  random.nextBytes(bytes)
-  (bytes, random.nextInt(100))
-}
-val df = manyBytes.toSeq.toDF("a", 
"b").repartition(1).groupBy('a).count()
-val metrics = getSparkPlanMetrics(df, 1, Set(2L, 0L)).get
-Seq(metrics(2L)._2("avg hashmap probe (min, med, max)"),
-metrics(0L)._2("avg hashmap probe (min, med, max)")).foreach { 
probes =>
-  probes.toString.stripPrefix("\n(").stripSuffix(")").split(", 
").foreach { probe =>
-assert(probe.toInt > 1)
+// The executed plan looks like:
+// HashAggregate(keys=[a#61], functions=[count(1)], output=[a#61, 
count#71L])
+// +- Exchange hashpartitioning(a#61, 5)
+//+- HashAggregate(keys=[a#61], functions=[partial_count(1)], 
output=[a#61, count#76L])
+//   +- Exchange RoundRobinPartitioning(1)
+//  +- LocalTableScan [a#61]
+//
+// Assume the execution plan is:
+// Wholestage disabled:
+// LocalTableScan(nodeId = 4) ->Exchange (nodeId = 3) -> 
HashAggregate(nodeId = 2) ->
+// Exchange(nodeId = 1) -> HashAggregate(nodeId = 0)
+//
+// Wholestage enabled:
+// LocalTableScan(nodeId = 6) -> Exchange(nodeId = 5) ->
--- End diff --

ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18301: [SPARK-21052][SQL] Add hash map metrics to join

2017-06-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18301#discussion_r124709371
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala
 ---
@@ -367,6 +367,22 @@ class TungstenAggregationIterator(
 }
   }
 
+  TaskContext.get().addTaskCompletionListener(_ => {
+// At the end of the task, update the task's peak memory usage. Since 
we destroy
+// the map to create the sorter, their memory usages should not 
overlap, so it is safe
+// to just use the max of the two.
+val mapMemory = hashMap.getPeakMemoryUsedBytes
+val sorterMemory = 
Option(externalSorter).map(_.getPeakMemoryUsedBytes).getOrElse(0L)
+val maxMemory = Math.max(mapMemory, sorterMemory)
+val metrics = TaskContext.get().taskMetrics()
+peakMemory += maxMemory
+spillSize += metrics.memoryBytesSpilled - spillSizeBefore
+metrics.incPeakExecutionMemory(maxMemory)
--- End diff --

not related to this PR, but `peakMemory` should pick the max memory usage 
among the operators in one task, instead of accumulating them?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18301: [SPARK-21052][SQL] Add hash map metrics to join

2017-06-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18301#discussion_r124709196
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala
 ---
@@ -367,6 +367,22 @@ class TungstenAggregationIterator(
 }
   }
 
+  TaskContext.get().addTaskCompletionListener(_ => {
+// At the end of the task, update the task's peak memory usage. Since 
we destroy
+// the map to create the sorter, their memory usages should not 
overlap, so it is safe
+// to just use the max of the two.
+val mapMemory = hashMap.getPeakMemoryUsedBytes
+val sorterMemory = 
Option(externalSorter).map(_.getPeakMemoryUsedBytes).getOrElse(0L)
+val maxMemory = Math.max(mapMemory, sorterMemory)
+val metrics = TaskContext.get().taskMetrics()
+peakMemory += maxMemory
+spillSize += metrics.memoryBytesSpilled - spillSizeBefore
--- End diff --

ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18301: [SPARK-21052][SQL] Add hash map metrics to join

2017-06-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18301#discussion_r124709173
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala
 ---
@@ -367,6 +367,22 @@ class TungstenAggregationIterator(
 }
   }
 
+  TaskContext.get().addTaskCompletionListener(_ => {
+// At the end of the task, update the task's peak memory usage. Since 
we destroy
+// the map to create the sorter, their memory usages should not 
overlap, so it is safe
+// to just use the max of the two.
+val mapMemory = hashMap.getPeakMemoryUsedBytes
+val sorterMemory = 
Option(externalSorter).map(_.getPeakMemoryUsedBytes).getOrElse(0L)
+val maxMemory = Math.max(mapMemory, sorterMemory)
+val metrics = TaskContext.get().taskMetrics()
+peakMemory += maxMemory
--- End diff --

nit: it's more clear to call `set` here, instead of `+=`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18458: [SPARK-20889][SparkR] Grouped documentation for COLLECTI...

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18458
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18458: [SPARK-20889][SparkR] Grouped documentation for COLLECTI...

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18458
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78856/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13873: [SPARK-16167][SQL] RowEncoder should preserve array/map ...

2017-06-28 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/13873
  
Hmm, I guess we need #16056 to fix nullability of `StaticInvoke`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18458: [SPARK-20889][SparkR] Grouped documentation for COLLECTI...

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18458
  
**[Test build #78856 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78856/testReport)**
 for PR 18458 at commit 
[`664629d`](https://github.com/apache/spark/commit/664629dab0150d4db2ea7fcdc63d35f6694bad7f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18448: [SPARK-20889][SparkR] Grouped documentation for MISC col...

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18448
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18448: [SPARK-20889][SparkR] Grouped documentation for MISC col...

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18448
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78854/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18422: [SPARK-20889][SparkR] Grouped documentation for NONAGGRE...

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18422
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18448: [SPARK-20889][SparkR] Grouped documentation for MISC col...

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18448
  
**[Test build #78854 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78854/testReport)**
 for PR 18448 at commit 
[`203be11`](https://github.com/apache/spark/commit/203be118cd7bc8a4a919150cad5ac086f4c55c6f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18422: [SPARK-20889][SparkR] Grouped documentation for NONAGGRE...

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18422
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78855/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18422: [SPARK-20889][SparkR] Grouped documentation for NONAGGRE...

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18422
  
**[Test build #78855 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78855/testReport)**
 for PR 18422 at commit 
[`aff832e`](https://github.com/apache/spark/commit/aff832ef95192532f161fa89cb9f49a7cc1d2d08).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18430: [SPARK-21223]:Thread-safety issue in FsHistoryProvider

2017-06-28 Thread zenglinxi0615
Github user zenglinxi0615 commented on the issue:

https://github.com/apache/spark/pull/18430
  
@jerryshao  actually, this threading issue cause an infinite loop when we 
restart historyserver and replaying event logs of spark apps. you can see the 
jstack log in attachments of SPARK-21223. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18301: [SPARK-21052][SQL] Add hash map metrics to join

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18301
  
**[Test build #78859 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78859/testReport)**
 for PR 18301 at commit 
[`9cbd627`](https://github.com/apache/spark/commit/9cbd627bed6279550a85aaf1d596f22c6b69bfc6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18463
  
**[Test build #78858 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78858/testReport)**
 for PR 18463 at commit 
[`5d5b390`](https://github.com/apache/spark/commit/5d5b39077d49225df2603217dea7e8d978a22a76).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18458: [SPARK-20889][SparkR] Grouped documentation for C...

2017-06-28 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18458#discussion_r124706483
  
--- Diff: R/pkg/R/functions.R ---
@@ -2156,28 +2178,23 @@ setMethod("date_format", signature(y = "Column", x 
= "character"),
 column(jc)
   })
 
-#' from_json
-#'
-#' Parses a column containing a JSON string into a Column of 
\code{structType} with the specified
-#' \code{schema} or array of \code{structType} if \code{as.json.array} is 
set to \code{TRUE}.
-#' If the string is unparseable, the Column will contains the value NA.
+#' @details
+#' \code{from_json}: Parses a column containing a JSON string into a 
Column of \code{structType}
+#' with the specified \code{schema} or array of \code{structType} if 
\code{as.json.array} is set
+#' to \code{TRUE}. If the string is unparseable, the Column will contains 
the value NA.
 #'
-#' @param x Column containing the JSON string.
+#' @rdname column_collection_functions
 #' @param schema a structType object to use as the schema to use when 
parsing the JSON string.
 #' @param as.json.array indicating if input string is JSON array of 
objects or a single object.
-#' @param ... additional named properties to control how the json is 
parsed, accepts the same
-#'options as the JSON data source.
-#'
-#' @family non-aggregate functions
-#' @rdname from_json
-#' @name from_json
-#' @aliases from_json,Column,structType-method
+#' @aliases from_json from_json,Column,structType-method
 #' @export
 #' @examples
+#'
 #' \dontrun{
-#' schema <- structType(structField("name", "string"),
-#' select(df, from_json(df$value, schema, dateFormat = "dd/MM/"))
-#'}
+#' df2 <- sql("SELECT named_struct('name', 'Bob') as people")
+#' df2 <- mutate(df2, people_json = to_json(df2$people))
+#' schema <- structType(structField("name", "string"))
+#' head(select(df2, from_json(df2$people_json, schema)))}
--- End diff --

I think it's worthwhile to keep `dateFormat = "dd/MM/")` in the example


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18458: [SPARK-20889][SparkR] Grouped documentation for C...

2017-06-28 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18458#discussion_r124706890
  
--- Diff: R/pkg/R/functions.R ---
@@ -132,6 +132,35 @@ NULL
 #' df <- createDataFrame(as.data.frame(Titanic, stringsAsFactors = FALSE))}
 NULL
 
+#' Collection functions for Column operations
+#'
+#' Collection functions defined for \code{Column}.
+#'
+#' @param x Column to compute on. Note the difference in the following 
methods:
+#'  \itemize{
+#'  \item \code{to_json}: it is the column containing the struct 
or array of the structs.
+#'  \item \code{from_json}: it is the column containing the JSON 
string.
+#'  }
+#' @param ... additional argument(s). In \code{to_json} and 
\code{from_json}, this contains
+#'additional named properties to control how it is converted, 
accepts the same
+#'options as the JSON data source.
+#' @name column_collection_functions
+#' @rdname column_collection_functions
+#' @family collection functions
+#' @examples
+#' \dontrun{
+#' # Dataframe used throughout this doc
+#' df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))
+#' df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))
+#' tmp <- mutate(df, v1 = create_array(df$mpg, df$cyl, df$hp))
+#' head(select(tmp, array_contains(tmp$v1, 21), size(tmp$v1)))
+#' tmp2 <- mutate(tmp, v2 = explode(tmp$v1))
+#' head(tmp2)
+#' head(select(tmp, posexplode(tmp$v1)))
+#' head(select(tmp, sort_array(tmp$v1)))
+#' head(select(tmp, sort_array(tmp$v1, FALSE)))}
--- End diff --

nit, let's improve this? I think in sort_array we could be more clear, eg. 
`sort_array(tmp$v1, asc = FALSE)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18458: [SPARK-20889][SparkR] Grouped documentation for C...

2017-06-28 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18458#discussion_r124706681
  
--- Diff: R/pkg/R/functions.R ---
@@ -2156,28 +2178,23 @@ setMethod("date_format", signature(y = "Column", x 
= "character"),
 column(jc)
   })
 
-#' from_json
-#'
-#' Parses a column containing a JSON string into a Column of 
\code{structType} with the specified
-#' \code{schema} or array of \code{structType} if \code{as.json.array} is 
set to \code{TRUE}.
-#' If the string is unparseable, the Column will contains the value NA.
+#' @details
+#' \code{from_json}: Parses a column containing a JSON string into a 
Column of \code{structType}
+#' with the specified \code{schema} or array of \code{structType} if 
\code{as.json.array} is set
+#' to \code{TRUE}. If the string is unparseable, the Column will contains 
the value NA.
--- End diff --

btw, `will contains the value NA.` is very consistently documented. in this 
case this is right, but there are many other that says the value is `null` 
(note lower case) which isn't quite correct on the R side.

another project? :)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18463: [WIP][SPARK-21093][R] Terminate R's worker processes in ...

2017-06-28 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18463
  
Apparently, I can only reproduce this with Jenkins for now.

I tested them on the environments below:

- CentOS Linux release 7.3.1611 (Core) / R version 3.4.0 / Java(TM) SE 
Runtime Environment (build 1.8.0_101-b13)
- macOS 10.12.3 (16D32) / R version 3.4.0 / Java(TM) SE Runtime Environment 
(build 1.8.0_45-b14)
- Ubuntu 14.04 LTS / R version 3.3.1 / Java(TM) SE Runtime Environment 
(build 1.8.0_131-b11) 

At least, I checked `SparkSQL functions: Spark package found ...` passes 
which was failed with an unknown error code, -10 - 
https://github.com/apache/spark/pull/18456

I tested this each 10-ish times on MacOS and CentOS and 3 times on Ubuntu 
but I could not reproduce this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18301: [SPARK-21052][SQL] Add hash map metrics to join

2017-06-28 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18301
  
retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18301: [SPARK-21052][SQL] Add hash map metrics to join

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18301
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78848/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18301: [SPARK-21052][SQL] Add hash map metrics to join

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18301
  
**[Test build #78848 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78848/testReport)**
 for PR 18301 at commit 
[`9cbd627`](https://github.com/apache/spark/commit/9cbd627bed6279550a85aaf1d596f22c6b69bfc6).
 * This patch **fails due to an unknown error code, -10**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18301: [SPARK-21052][SQL] Add hash map metrics to join

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18301
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18463: [WIP][SPARK-21093][R] Terminate R's worker proces...

2017-06-28 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/18463

[WIP][SPARK-21093][R] Terminate R's worker processes in the parent of R's 
daemon to prevent a leak

## What changes were proposed in this pull request?

This is a retry for https://github.com/apache/spark/pull/18320

## How was this patch tested?

Manually tested.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-21093-retry

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18463.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18463


commit f2c15a3241583b9d692cf48b36e62d8b84cbc5dd
Author: hyukjinkwon 
Date:   2017-06-25T18:05:57Z

[SPARK-21093][R] Terminate R's worker processes in the parent of R's daemon 
to prevent a leak

## What changes were proposed in this pull request?

`mcfork` in R looks opening a pipe ahead but the existing logic does not 
properly close it when it is executed hot. This leads to the failure of more 
forking due to the limit for number of files open.

This hot execution looks particularly for `gapply`/`gapplyCollect`. For 
unknown reason, this happens more easily in CentOS and could be reproduced in 
Mac too.

All the details are described in 
https://issues.apache.org/jira/browse/SPARK-21093

This PR proposes simply to terminate R's worker processes in the parent of 
R's daemon to prevent a leak.

## How was this patch tested?

I ran the codes below on both CentOS and Mac with that configuration 
disabled/enabled.

```r
df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
collect(gapply(df, "a", function(key, x) { x }, schema(df)))
collect(gapply(df, "a", function(key, x) { x }, schema(df)))
...  # 30 times
```

Also, now it passes R tests on CentOS as below:

```
SparkSQL functions: Spark package found in SPARK_HOME: .../spark

..

..

..

..

..


```

Author: hyukjinkwon 

Closes #18320 from HyukjinKwon/SPARK-21093.

commit 9e907cbaa6d6b65e09008181b61747ffcb67d5d0
Author: hyukjinkwon 
Date:   2017-06-29T03:46:12Z

Disable Scala/Python tests for debugging and print everything




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16028: [SPARK-18518][ML] HasSolver supports override

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16028
  
**[Test build #78857 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78857/testReport)**
 for PR 16028 at commit 
[`d84bb21`](https://github.com/apache/spark/commit/d84bb214908aea84421133958762bbf2a3e4f7d9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18448: [SPARK-20889][SparkR] Grouped documentation for M...

2017-06-28 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18448#discussion_r124706110
  
--- Diff: R/pkg/R/functions.R ---
@@ -132,6 +132,27 @@ NULL
 #' df <- createDataFrame(as.data.frame(Titanic, stringsAsFactors = FALSE))}
 NULL
 
+#' Miscellaneous functions for Column operations
+#'
+#' Miscellaneous functions defined for \code{Column}.
+#'
+#' @param x Column to compute on. In \code{sha2}, it is one of 224, 256, 
384, or 512.
+#' @param y Column to compute on.
--- End diff --

probably not own in this PR ... since `y` always go first, should we flip 
this order I think? list y first?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13873: [SPARK-16167][SQL] RowEncoder should preserve array/map ...

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13873
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18448: [SPARK-20889][SparkR] Grouped documentation for M...

2017-06-28 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18448#discussion_r124706139
  
--- Diff: R/pkg/R/functions.R ---
@@ -132,6 +132,27 @@ NULL
 #' df <- createDataFrame(as.data.frame(Titanic, stringsAsFactors = FALSE))}
 NULL
 
+#' Miscellaneous functions for Column operations
+#'
+#' Miscellaneous functions defined for \code{Column}.
+#'
+#' @param x Column to compute on. In \code{sha2}, it is one of 224, 256, 
384, or 512.
+#' @param y Column to compute on.
+#' @param ... additional columns.
--- End diff --

nit: capital `Columns` to indicate type


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13873: [SPARK-16167][SQL] RowEncoder should preserve array/map ...

2017-06-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13873
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78849/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13873: [SPARK-16167][SQL] RowEncoder should preserve array/map ...

2017-06-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13873
  
**[Test build #78849 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78849/testReport)**
 for PR 13873 at commit 
[`306b283`](https://github.com/apache/spark/commit/306b283f457f5718a152853df3aa854f7fba8ac2).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18422: [SPARK-20889][SparkR] Grouped documentation for N...

2017-06-28 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18422#discussion_r124705885
  
--- Diff: R/pkg/R/functions.R ---
@@ -132,23 +132,40 @@ NULL
 #' df <- createDataFrame(as.data.frame(Titanic, stringsAsFactors = FALSE))}
 NULL
 
-#' lit
+#' Non-aggregate functions for Column operations
 #'
-#' A new \linkS4class{Column} is created to represent the literal value.
-#' If the parameter is a \linkS4class{Column}, it is returned unchanged.
+#' Non-aggregate functions defined for \code{Column}.
 #'
-#' @param x a literal value or a Column.
+#' @param x Column to compute on. In \code{lit}, it is a literal value or 
a Column.
+#'  In \code{monotonically_increasing_id}, it should be empty.
+#' @param y Column to compute on.
+#' @param ... additional argument(s). In \code{expr}, it contains an 
expression character
--- End diff --

and so in all other cases in this group, `...` is expected for other 
columns. perhaps we can say `additional Columns`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18422: [SPARK-20889][SparkR] Grouped documentation for N...

2017-06-28 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18422#discussion_r124705377
  
--- Diff: R/pkg/R/functions.R ---
@@ -824,32 +835,23 @@ setMethod("initcap",
 column(jc)
   })
 
-#' is.nan
-#'
-#' Return true if the column is NaN, alias for \link{isnan}
-#'
-#' @param x Column to compute on.
+#' @details
+#' \code{is.nan}: Alias for \link{isnan}.
--- End diff --

roxygen does this by text order, I think - doesn't it make this go first, 
before isnan? perhaps we swap the order of code?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18422: [SPARK-20889][SparkR] Grouped documentation for N...

2017-06-28 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18422#discussion_r124704884
  
--- Diff: R/pkg/R/functions.R ---
@@ -3554,21 +3493,17 @@ setMethod("grouping_id",
 column(jc)
   })
 
-#' input_file_name
-#'
-#' Creates a string column with the input file name for a given row
+#' @details
+#' \code{input_file_name}: Creates a string column with the input file 
name for a given row.
 #'
-#' @rdname input_file_name
-#' @name input_file_name
-#' @family non-aggregate functions
-#' @aliases input_file_name,missing-method
+#' @rdname column_nonaggregate_functions
+#' @aliases input_file_name input_file_name,missing-method
 #' @export
 #' @examples
-#' \dontrun{
-#' df <- read.text("README.md")
 #'
-#' head(select(df, input_file_name()))
-#' }
+#' \dontrun{
+#' tmp <- read.text("README.md")
--- End diff --

why rename to `tmp` though?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18422: [SPARK-20889][SparkR] Grouped documentation for N...

2017-06-28 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18422#discussion_r124705740
  
--- Diff: R/pkg/R/functions.R ---
@@ -132,23 +132,40 @@ NULL
 #' df <- createDataFrame(as.data.frame(Titanic, stringsAsFactors = FALSE))}
 NULL
 
-#' lit
+#' Non-aggregate functions for Column operations
 #'
-#' A new \linkS4class{Column} is created to represent the literal value.
-#' If the parameter is a \linkS4class{Column}, it is returned unchanged.
+#' Non-aggregate functions defined for \code{Column}.
 #'
-#' @param x a literal value or a Column.
+#' @param x Column to compute on. In \code{lit}, it is a literal value or 
a Column.
+#'  In \code{monotonically_increasing_id}, it should be empty.
--- End diff --

and same for `input_file_name` - btw, should be empty might be a bit 
confusing? how about `In ..., Should be used with no argument.`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18422: [SPARK-20889][SparkR] Grouped documentation for N...

2017-06-28 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18422#discussion_r124704722
  
--- Diff: R/pkg/R/functions.R ---
@@ -132,23 +132,40 @@ NULL
 #' df <- createDataFrame(as.data.frame(Titanic, stringsAsFactors = FALSE))}
 NULL
 
-#' lit
+#' Non-aggregate functions for Column operations
 #'
-#' A new \linkS4class{Column} is created to represent the literal value.
-#' If the parameter is a \linkS4class{Column}, it is returned unchanged.
+#' Non-aggregate functions defined for \code{Column}.
 #'
-#' @param x a literal value or a Column.
+#' @param x Column to compute on. In \code{lit}, it is a literal value or 
a Column.
+#'  In \code{monotonically_increasing_id}, it should be empty.
+#' @param y Column to compute on.
+#' @param ... additional argument(s). In \code{expr}, it contains an 
expression character
+#'object to be parsed.
+#' @name column_nonaggregate_functions
+#' @rdname column_nonaggregate_functions
+#' @seealso coalesce,SparkDataFrame-method
 #' @family non-aggregate functions
-#' @rdname lit
-#' @name lit
+#' @examples
+#' \dontrun{
+#' # Dataframe used throughout this doc
+#' df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))}
+NULL
+
+#' @details
+#' \code{lit}: A new \linkS4class{Column} is created to represent the 
literal value.
--- End diff --

this format is actually kinda weird. let's fix it? I don't think we need to 
link to Column 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18422: [SPARK-20889][SparkR] Grouped documentation for N...

2017-06-28 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18422#discussion_r124705490
  
--- Diff: R/pkg/R/functions.R ---
@@ -132,23 +132,40 @@ NULL
 #' df <- createDataFrame(as.data.frame(Titanic, stringsAsFactors = FALSE))}
 NULL
 
-#' lit
+#' Non-aggregate functions for Column operations
 #'
-#' A new \linkS4class{Column} is created to represent the literal value.
-#' If the parameter is a \linkS4class{Column}, it is returned unchanged.
+#' Non-aggregate functions defined for \code{Column}.
 #'
-#' @param x a literal value or a Column.
+#' @param x Column to compute on. In \code{lit}, it is a literal value or 
a Column.
+#'  In \code{monotonically_increasing_id}, it should be empty.
+#' @param y Column to compute on.
+#' @param ... additional argument(s). In \code{expr}, it contains an 
expression character
+#'object to be parsed.
+#' @name column_nonaggregate_functions
+#' @rdname column_nonaggregate_functions
+#' @seealso coalesce,SparkDataFrame-method
 #' @family non-aggregate functions
-#' @rdname lit
-#' @name lit
+#' @examples
+#' \dontrun{
+#' # Dataframe used throughout this doc
+#' df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))}
+NULL
+
+#' @details
+#' \code{lit}: A new \linkS4class{Column} is created to represent the 
literal value.
+#' If the parameter is a \linkS4class{Column}, it is returned unchanged.
+#'
+#' @rdname column_nonaggregate_functions
 #' @export
-#' @aliases lit,ANY-method
+#' @aliases lit lit,ANY-method
 #' @examples
+#'
 #' \dontrun{
-#' lit(df$name)
-#' select(df, lit("x"))
-#' select(df, lit("2015-01-01"))
-#'}
+#' tmp <- mutate(df, v1 = lit(df$mpg), v2 = lit("x"), v3 = 
lit("2015-01-01"),
+#'   v4 = negate(df$mpg), v5 = expr('length(model)'),
+#'   v6 = greatest(df$vs, df$am), v7 = least(df$vs, df$am),
+#'   v8 = column("mpg"))
--- End diff --

is there example for 
```
nanvl(df$c, x)
coalesce(df$c, df$d, df$e) 
```

that I've missed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18422: [SPARK-20889][SparkR] Grouped documentation for N...

2017-06-28 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18422#discussion_r124705601
  
--- Diff: R/pkg/R/functions.R ---
@@ -132,23 +132,40 @@ NULL
 #' df <- createDataFrame(as.data.frame(Titanic, stringsAsFactors = FALSE))}
 NULL
 
-#' lit
+#' Non-aggregate functions for Column operations
 #'
-#' A new \linkS4class{Column} is created to represent the literal value.
-#' If the parameter is a \linkS4class{Column}, it is returned unchanged.
+#' Non-aggregate functions defined for \code{Column}.
 #'
-#' @param x a literal value or a Column.
+#' @param x Column to compute on. In \code{lit}, it is a literal value or 
a Column.
+#'  In \code{monotonically_increasing_id}, it should be empty.
+#' @param y Column to compute on.
+#' @param ... additional argument(s). In \code{expr}, it contains an 
expression character
--- End diff --

`In \code{expr}, it contains an expression character` - this isn't quite 
right actually - it's in `x` in expr, not as ... parameter


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18449: [SPARK-21237][SQL] Invalidate stats once table da...

2017-06-28 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18449


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   >