date:20180718

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21804
  
**[Test build #93236 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93236/testReport)**
 for PR 21804 at commit 
[`eb78665`](https://github.com/apache/spark/commit/eb786655387ecf7320d9b4957b45564253fb1af4).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21804: [SPARK-24268][SQL] Use datatype.catalogString in error m...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21804
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93236/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21804: [SPARK-24268][SQL] Use datatype.catalogString in error m...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21804
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21804: [SPARK-24268][SQL] Use datatype.catalogString in error m...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21804
  
**[Test build #93236 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93236/testReport)**
 for PR 21804 at commit 
[`eb78665`](https://github.com/apache/spark/commit/eb786655387ecf7320d9b4957b45564253fb1af4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21804: [SPARK-24268][SQL] Use datatype.catalogString in error m...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21804
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21804: [SPARK-24268][SQL] Use datatype.catalogString in error m...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21804
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1096/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21754: [SPARK-24705][SQL] Cannot reuse an exchange opera...

2018-07-18 Thread markhamstra

Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/21754#discussion_r203416454
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/Exchange.scala 
---
@@ -85,14 +85,20 @@ case class ReusedExchangeExec(override val output: 
Seq[Attribute], child: Exchan
  */
 case class ReuseExchange(conf: SQLConf) extends Rule[SparkPlan] {
 
+  private def supportReuseExchange(exchange: Exchange): Boolean = exchange 
match {
+// If a coordinator defined in an exchange operator, the exchange 
cannot be reused
--- End diff --

This seems overstated if this comment in the JIRA description is correct: 
"When the cache tabel device_loc is executed before this query is executed, 
everything is fine". In fact, if Xiao Li is correct in that statement, then 
this PR is eliminating a useful optimization in cases where it doesn't need to 
-- i.e. it is preventing Exchange reuse any time adaptive execution is used 
instead of only preventing reuse when it will actually cause a problem.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21131: [SPARK-23433][CORE] Late zombie task completions ...

2018-07-18 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/21131#discussion_r203415036
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -764,6 +769,19 @@ private[spark] class TaskSetManager(
 maybeFinishTaskSet()
   }
 
+  private[scheduler] def markPartitionCompleted(partitionId: Int): Unit = {
+partitionToIndex.get(partitionId).foreach { index =>
+  if (!successful(index)) {
+tasksSuccessful += 1
+successful(index) = true
+if (tasksSuccessful == numTasks) {
+  isZombie = true
+}
+maybeFinishTaskSet()
--- End diff --

I think you're right, its not needed, its called when the tasks succeed, 
fail, or are aborted, and when this called while that taskset still has running 
tasks, then its a no-op, as it would fail the `runningTasks == 0` check inside 
`maybeFinishTaskSet()`.

do you think its worth removing?  I'm fine either way.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21804: [SPARK-24268][SQL] Use datatype.catalogString in ...

GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/21804

[SPARK-24268][SQL] Use datatype.catalogString in error messages

## What changes were proposed in this pull request?

As stated in https://github.com/apache/spark/pull/21321, in the error 
messages we should use `catalogString`. This is not the case, as SPARK-22893 
used `simpleString` in order to have the same representation everywhere and it 
missed some places.

The PR unifies the messages using alway the `catalogString` representation 
of the dataTypes in the messages.

## How was this patch tested?

existing/modified UTs


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-24268_catalog

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21804.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21804


commit eb786655387ecf7320d9b4957b45564253fb1af4
Author: Marco Gaido 
Date:   2018-07-18T14:47:12Z

[SPARK-24268][SQL] Use datatype.catalogString in error messages




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21803: [SPARK-24849][SQL] Converting a value of StructType to a...

2018-07-18 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21803
  
How about the case where a column name has special characters that should 
be backquoted, e.g., 'aaa:bbb'?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21802: [SPARK-23928][SQL] Add shuffle collection functio...

Github user mn-mikke commented on a diff in the pull request:

https://github.com/apache/spark/pull/21802#discussion_r203388798
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2382,6 +2382,20 @@ def array_sort(col):
 return Column(sc._jvm.functions.array_sort(_to_java_column(col)))
 
 
+@since(2.4)
+def shuffle(col):
+"""
+Collection function: Generates a random permutation of the given array.
+
+.. note:: The function is non-deterministic because its results 
depends on order of rows which
--- End diff --

Isn't it non-deterministic rather for the fact that the permutation is 
determined randomly?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21802: [SPARK-23928][SQL] Add shuffle collection functio...

Github user mn-mikke commented on a diff in the pull request:

https://github.com/apache/spark/pull/21802#discussion_r203407122
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -1444,6 +1444,51 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSQLContext {
 )
   }
 
+  test("shuffle function") {
+// Shuffle expressions should produce same results at retries in the 
same DataFrame.
+def checkResult(df: DataFrame): Unit = {
+  checkAnswer(df, df.collect())
+}
+
+// primitive-type elements
+val idf = Seq(
+  Seq(1, 9, 8, 7),
+  Seq(5, 8, 9, 7, 2),
+  Seq.empty,
+  null
+).toDF("i")
+
+def checkResult1(): Unit = {
--- End diff --

Maybe a different name for the method?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21803: [SPARK-24849][SQL] Converting a value of StructType to a...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21803
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21533: [SPARK-24195][Core] Bug fix for local:/ path in SparkCon...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21533
  
**[Test build #4219 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4219/testReport)**
 for PR 21533 at commit 
[`eb46ccf`](https://github.com/apache/spark/commit/eb46ccfec084c2439a26eee38015381f091fe164).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21803: [SPARK-24849][SQL] Converting a value of StructType to a...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21803
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21612: [SPARK-24628][DOC]Typos of the example code in do...

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21612


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21803: [SPARK-24849][SQL] Converting a value of StructType to a...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21803
  
**[Test build #93235 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93235/testReport)**
 for PR 21803 at commit 
[`34511db`](https://github.com/apache/spark/commit/34511db4c283e1013de203ca03ce152b26cf62f4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21803: [SPARK-24849][SQL] Converting a value of StructTy...

2018-07-18 Thread MaxGekk

GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/21803

[SPARK-24849][SQL] Converting a value of StructType to a DDL string

## What changes were proposed in this pull request?

In the PR, I propose to extend the `StructType` object by new method 
`toDDL` which converts a value of the `StructType` type to a string formatted 
in DDL style. The resulted string can be used in a table creation.

## How was this patch tested?

I add a test for checking the new method and 2 round trip tests: `fromDDL` 
-> `toDDL` and `toDDL` -> `fromDDL`


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 to-ddl

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21803.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21803


commit 38f905ad61f9197d12213bd93f2f755d428ee431
Author: Maxim Gekk 
Date:   2018-07-18T14:33:31Z

New method - toDDL

commit 6e0509326393ab0554b66df0ae65ba263b2c4fa9
Author: Maxim Gekk 
Date:   2018-07-18T14:39:16Z

Simplification of a test

commit 34511db4c283e1013de203ca03ce152b26cf62f4
Author: Maxim Gekk 
Date:   2018-07-18T14:44:38Z

New test for cases




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21612: [SPARK-24628][DOC]Typos of the example code in docs/mlli...

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/21612
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21767: SPARK-24804 There are duplicate words in the test...

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21767


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21767: SPARK-24804 There are duplicate words in the test title ...

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/21767
  
yeah, please avoid PRs that are this trivial, it's just not worth the 
overhead. But I merged it this time.
Also please read https://spark.apache.org/contributing.html


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.

2018-07-18 Thread echarles

Github user echarles commented on the issue:

https://github.com/apache/spark/pull/21748
  
@mccheah Tried this PR in client-mode In-Cluster on minikube v0.25.2: 
Exectuors are started but directly removed. As the start/remove is so fast, I 
can hardly see logs (and the logs I have seen don't show any stacktrace). Maybe 
something in my env? 

The config I have for the client mode is:

```
# DRIVER_POD_NAME=$HOSTNAME
  --conf spark.kubernetes.driver.pod.name="$DRIVER_POD_NAME" \
  --conf spark.driver.host="$DRIVER_POD_NAME" \
  --conf spark.driver.port=7077 \
  --conf spark.driver.blockManager.port=1 \
```

The driver log is:

```
2018-07-18 14:29:43 INFO  SparkContext:54 - Created broadcast 0 from 
broadcast at DAGScheduler.scala:1039
2018-07-18 14:29:43 INFO  DAGScheduler:54 - Submitting 10 missing tasks 
from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 
tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9))
2018-07-18 14:29:43 INFO  TaskSchedulerImpl:54 - Adding task set 0.0 with 
10 tasks
2018-07-18 14:29:45 INFO  ExecutorPodsAllocator:54 - Going to request 1 
executors from Kubernetes.
2018-07-18 14:29:46 INFO  BlockManagerMasterEndpoint:54 - Trying to remove 
executor 5 from BlockManagerMaster.
2018-07-18 14:29:46 INFO  BlockManagerMaster:54 - Removal of executor 5 
requested
2018-07-18 14:29:46 INFO  
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:54 - Asked to remove 
non-existent executor 5
2018-07-18 14:29:52 INFO  BlockManagerMaster:54 - Removal of executor 6 
requested
2018-07-18 14:29:52 INFO  
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:54 - Asked to remove 
non-existent executor 6
2018-07-18 14:29:52 INFO  BlockManagerMasterEndpoint:54 - Trying to remove 
executor 6 from BlockManagerMaster.
2018-07-18 14:29:52 INFO  ExecutorPodsAllocator:54 - Going to request 1 
executors from Kubernetes.
2018-07-18 14:29:55 INFO  BlockManagerMaster:54 - Removal of executor 7 
requested
2018-07-18 14:29:55 INFO  
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:54 - Asked to remove 
non-existent executor 7
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/21638
  
Because this method is internal to Spark, why not just take out the 
parameter? Yes it's superfluous now, but it's been this way for a while, and 
seems perhaps better to avoid a behavior change. In fact you can pull a 
`minPartitions` parameter out of several private methods then. You can't remove 
the parameter to `binaryFiles`, sure, but it can be documented as doing nothing.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21781: [INFRA] Close stale PR

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/21781
  
I would add:

https://github.com/apache/spark/pull/19233
https://github.com/apache/spark/pull/20100



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21765: [MINOR][CORE] Add test cases for RDD.cartesian

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21765


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21765: [MINOR][CORE] Add test cases for RDD.cartesian

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/21765
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21774: [SPARK-24811][SQL]Avro: add new function from_avro and t...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21774
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21774: [SPARK-24811][SQL]Avro: add new function from_avro and t...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21774
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93230/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21774: [SPARK-24811][SQL]Avro: add new function from_avro and t...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21774
  
**[Test build #93230 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93230/testReport)**
 for PR 21774 at commit 
[`204a59d`](https://github.com/apache/spark/commit/204a59d0088b9a3c959c6e3bce6b2fd663d991be).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class AvroFunctionsSuite extends QueryTest with SharedSQLContext `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21795: [SPARK-24840][SQL] do not use dummy filter to swi...

Github user mn-mikke commented on a diff in the pull request:

https://github.com/apache/spark/pull/21795#discussion_r203379244
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -1147,65 +1149,66 @@ class DataFrameFunctionsSuite extends QueryTest 
with SharedSQLContext {
 val nseqi : Seq[Int] = null
 val nseqs : Seq[String] = null
 val df = Seq(
-
   (Seq(1), Seq(2, 3), Seq(5L, 6L), nseqi, Seq("a", "b", "c"), Seq("d", 
"e"), Seq("f"), nseqs),
   (Seq(1, 0), Seq.empty[Int], Seq(2L), nseqi, Seq("a"), 
Seq.empty[String], Seq(null), nseqs)
 ).toDF("i1", "i2", "i3", "in", "s1", "s2", "s3", "sn")
 
-val dummyFilter = (c: Column) => c.isNull || c.isNotNull // switch 
codeGen on
-
 // Simple test cases
-checkAnswer(
--- End diff --

Good catch!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21795: [SPARK-24840][SQL] do not use dummy filter to swi...

Github user mn-mikke commented on a diff in the pull request:

https://github.com/apache/spark/pull/21795#discussion_r203378508
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -924,26 +926,26 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSQLContext {
   null
 ).toDF("i")
 
-checkAnswer(
-  idf.select(reverse('i)),
-  Seq(Row(Seq(7, 8, 9, 1)), Row(Seq(2, 7, 9, 8, 5)), Row(Seq.empty), 
Row(null))
-)
-checkAnswer(
-  idf.filter(dummyFilter('i)).select(reverse('i)),
-  Seq(Row(Seq(7, 8, 9, 1)), Row(Seq(2, 7, 9, 8, 5)), Row(Seq.empty), 
Row(null))
-)
-checkAnswer(
-  idf.selectExpr("reverse(i)"),
-  Seq(Row(Seq(7, 8, 9, 1)), Row(Seq(2, 7, 9, 8, 5)), Row(Seq.empty), 
Row(null))
-)
-checkAnswer(
-  oneRowDF.selectExpr("reverse(array(1, null, 2, null))"),
-  Seq(Row(Seq(null, 2, null, 1)))
-)
-checkAnswer(
-  oneRowDF.filter(dummyFilter('i)).selectExpr("reverse(array(1, null, 
2, null))"),
-  Seq(Row(Seq(null, 2, null, 1)))
-)
+def checkResult2(): Unit = {
--- End diff --

What about using more specific names for functions ```checkResult2```, 
```checkResult3``` etc.? Maybe ```checkStringTestCases```, 
```checkCasesWithArraysOfComplexTypes``` or something like that? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20856: [SPARK-23731][SQL] FileSourceScanExec throws NullPointer...

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20856
  
@HyukjinKwon thanks for your great analysis. I agree with you that the 
proposed fix is more a "workaround" than a real fix for the issue we have here.

The main problem here as you pointed out is that we have a bad (invalid?) 
`FileSourceScanExec` on the executors. Probably this has never been an issue as 
on the executors we accessed only some properties which were correctly 
populated and we assumed that the other operation would have been performed 
only on driver side.

I think the cleanest approach (not sure it is entirely feasible) would be 
to choose one of the following option:

 - check that all exec expression (in this case `FileSourceScanExec`) are 
working properly both on driver and executor side;
 - define which operation/attributes can be accessed on executor side too 
and which only on driver side, document it and enforce it (if feasible).

What do you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21440: [SPARK-24307][CORE] Support reading remote cached...

2018-07-18 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/21440#discussion_r203381863
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala ---
@@ -166,6 +170,34 @@ private[spark] class ChunkedByteBuffer(var chunks: 
Array[ByteBuffer]) {
 
 }
 
+object ChunkedByteBuffer {
+  // TODO eliminate this method if we switch BlockManager to getting 
InputStreams
+  def fromManagedBuffer(data: ManagedBuffer, maxChunkSize: Int): 
ChunkedByteBuffer = {
+data match {
+  case f: FileSegmentManagedBuffer =>
+map(f.getFile, maxChunkSize, f.getOffset, f.getLength)
+  case other =>
+new ChunkedByteBuffer(other.nioByteBuffer())
+}
+  }
+
+  def map(file: File, maxChunkSize: Int, offset: Long, length: Long): 
ChunkedByteBuffer = {
+Utils.tryWithResource(new FileInputStream(file).getChannel()) { 
channel =>
--- End diff --

great, thanks for the explanation


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21440: [SPARK-24307][CORE] Support reading remote cached partit...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21440
  
**[Test build #93234 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93234/testReport)**
 for PR 21440 at commit 
[`4664942`](https://github.com/apache/spark/commit/4664942f0509b8d34ff27ddc9427351ed836f663).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21440: [SPARK-24307][CORE] Support reading remote cached partit...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21440
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21440: [SPARK-24307][CORE] Support reading remote cached partit...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21440
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1095/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20949: [SPARK-19018][SQL] Add support for custom encoding on cs...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20949
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20949: [SPARK-19018][SQL] Add support for custom encoding on cs...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20949
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93229/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20949: [SPARK-19018][SQL] Add support for custom encoding on cs...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20949
  
**[Test build #93229 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93229/testReport)**
 for PR 20949 at commit 
[`025958a`](https://github.com/apache/spark/commit/025958a7d9e8a741875db2af8878f60cb07409d3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21770
  
**[Test build #93233 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93233/testReport)**
 for PR 21770 at commit 
[`5d33d53`](https://github.com/apache/spark/commit/5d33d535f7c04a7231c3b088ac3fcde313f5da8c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21770
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21770
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1094/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20949: [SPARK-19018][SQL] Add support for custom encoding on cs...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20949
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93226/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20949: [SPARK-19018][SQL] Add support for custom encoding on cs...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20949
  
Build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21774: [SPARK-24811][SQL]Avro: add new function from_avro and t...

2018-07-18 Thread gengliangwang

Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/21774
  
This is ready for review. @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20949: [SPARK-19018][SQL] Add support for custom encoding on cs...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20949
  
**[Test build #93226 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93226/testReport)**
 for PR 20949 at commit 
[`fd857b0`](https://github.com/apache/spark/commit/fd857b005abba233eb7409479436c0abe4e23e4f).
 * This patch **fails Spark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21596: [SPARK-24601] Bump Jackson version

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21596
  
**[Test build #93232 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93232/testReport)**
 for PR 21596 at commit 
[`7d4ac0b`](https://github.com/apache/spark/commit/7d4ac0b25ca0b38e48e20f288e7389fbbf83a01a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21802: [SPARK-23928][SQL] Add shuffle collection function.

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21802
  
**[Test build #93231 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93231/testReport)**
 for PR 21802 at commit 
[`b4cbb55`](https://github.com/apache/spark/commit/b4cbb5558088356fe6be1cda053c9f91fbe7c538).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21802: [SPARK-23928][SQL] Add shuffle collection function.

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21802
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1093/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21802: [SPARK-23928][SQL] Add shuffle collection function.

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21802
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21386: [SPARK-23928][SQL][WIP] Add shuffle collection function.

2018-07-18 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/21386
  
@pkuwm I submitted a PR #21802 based on this. Could you take a look if you 
have time? Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21772: [SPARK-24809] [SQL] Serializing LongHashedRelatio...

2018-07-18 Thread liutang123

Github user liutang123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21772#discussion_r203365167
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
 ---
@@ -726,8 +726,9 @@ private[execution] final class LongToUnsafeRowMap(val 
mm: TaskMemoryManager, cap
 
 writeLong(array.length)
 writeLongArray(writeBuffer, array, array.length)
-val used = ((cursor - Platform.LONG_ARRAY_OFFSET) / 8).toInt
-writeLong(used)
+val cursorFlag = cursor - Platform.LONG_ARRAY_OFFSET
+writeLong(cursorFlag)
+val used = (cursorFlag / 8).toInt
--- End diff --


![image](https://issues.apache.org/jira/secure/attachment/12932027/Spark%20LongHashedRelation%20serialization.svg)



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21802: [SPARK-23928][SQL] Add shuffle collection function.

2018-07-18 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/21802
  
cc @pkuwm


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21802: [SPARK-23928][SQL] Add shuffle collection functio...

2018-07-18 Thread ueshin

GitHub user ueshin opened a pull request:

https://github.com/apache/spark/pull/21802

[SPARK-23928][SQL] Add shuffle collection function.

## What changes were proposed in this pull request?

This PR adds a new collection function: shuffle. It generates a random 
permutation of the given array. 

## How was this patch tested?

New tests are added to CollectionExpressionsSuite.scala and 
DataFrameFunctionsSuite.scala.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ueshin/apache-spark issues/SPARK-23928/shuffle

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21802.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21802


commit a3dbd93c0acbb2a3f3fb50574ae1e126c66c4d2d
Author: pkuwm 
Date:   2018-07-17T23:18:03Z

Add shuffle collection function.

commit b4cbb5558088356fe6be1cda053c9f91fbe7c538
Author: Takuya UESHIN 
Date:   2018-07-18T12:17:59Z

Refactor Shuffle function.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20856: [SPARK-23731][SQL] FileSourceScanExec throws NullPointer...

2018-07-18 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20856
  
Okay, I was investigating this and the fix itself looks quite inappropriate.

This looks what happened now. I can reproduce this by a bit of messy way:

```diff
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala
index 8d06804ce1e..d25fc9a7ba9 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala
@@ -37,7 +37,9 @@ class EquivalentExpressions {
   case _ => false
 }

-override def hashCode: Int = e.semanticHash()
+override def hashCode: Int = {
+  1
+}
   }
```

```scala
spark.range(1).write.mode("overwrite").parquet("/tmp/foo")
spark.read.parquet("/tmp/foo").createOrReplaceTempView("foo")
spark.conf.set("spark.sql.codegen.wholeStage", false)
sql("SELECT (SELECT id FROM foo) == (SELECT id FROM foo)").collect()
```

This is what I see and think:

1. Sub scalar query was made (for instance `SELECT (SELECT id FROM foo)`).

2. Try to extract some common expressions (via 
`CodeGenerator.subexpressionElimination`) so that it can generates some common 
codes and can be reused.

3. During this, seems it extracts some expressions that can be reused (via 
`EquivalentExpressions.addExprTree`)

  
https://github.com/apache/spark/blob/b2deef64f604ddd9502a31105ed47cb63470ec85/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L1102

4. During this, if the hash (`EquivalentExpressions.Expr.hashCode`) 
happened to be the same at `EquivalentExpressions.addExpr` 
anyhow,â`EquivalentExpressions.Expr.equals` is called to identicy object in 
the same hash, which eventually calls `semanticEquals` in `ScalarSubquery`

  
https://github.com/apache/spark/blob/087879a77acb37b790c36f8da67355b90719c2dc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala#L54

  
https://github.com/apache/spark/blob/087879a77acb37b790c36f8da67355b90719c2dc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala#L36

5. `ScalarSubquery`'s `semanticEquals` needs `SubqueryExec`'s `sameResult`

  
https://github.com/apache/spark/blob/77a2fc5b521788b406bb32bcc3c637c1d7406e58/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala#L58

6. `SubqueryExec`'s `sameResult` requires a canonicalized plan which calls 
`FileSourceScanExec`'s `doCanonicalize`

  
https://github.com/apache/spark/blob/e008ad175256a3192fdcbd2c4793044d52f46d57/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L258

7. In `FileSourceScanExec`'s `doCanonicalize`, `FileSourceScanExec`'s 
`relation` is required but seems `@transient` so it becomes `null`.

  
https://github.com/apache/spark/blob/e76b0124fbe463def00b1dffcfd8fd47e04772fe/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L527

  
https://github.com/apache/spark/blob/e76b0124fbe463def00b1dffcfd8fd47e04772fe/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L160

8. NPE is thrown:

```
java.lang.NullPointerException
at 
org.apache.spark.sql.execution.FileSourceScanExec.(DataSourceScanExec.scala:169)
at 
org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:526)
at 
org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:159)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:211)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:210)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$3.apply(QueryPlan.scala:225)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$3.apply(QueryPlan.scala:225)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:225)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:211)
at

[GitHub] spark issue #21596: [SPARK-24601] Bump Jackson version

2018-07-18 Thread Fokko

Github user Fokko commented on the issue:

https://github.com/apache/spark/pull/21596
  
Rebased onto master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21795: [SPARK-24840][SQL] do not use dummy filter to switch cod...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21795
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93225/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21795: [SPARK-24840][SQL] do not use dummy filter to switch cod...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21795
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21795: [SPARK-24840][SQL] do not use dummy filter to switch cod...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21795
  
**[Test build #93225 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93225/testReport)**
 for PR 21795 at commit 
[`de5a232`](https://github.com/apache/spark/commit/de5a2323b5b46a4c073e3ff1dce6daea395dd1dd).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21589
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21589
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93223/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21589
  
**[Test build #93223 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93223/testReport)**
 for PR 21589 at commit 
[`eebb310`](https://github.com/apache/spark/commit/eebb31099f078cc05bf0f6d6e32c94d4ee818f9e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...

2018-07-18 Thread stanzhai

Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/18544
  
It's not reasonable, `failFunctionLookup` throws `NoSuchFunctionException`.
The function actually exists in current selected database, we should throw 
the exception which is due to an initialization failure, but not 
`NoSuchFunctionException`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21469: [SPARK-24441][SS] Expose total estimated size of states ...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21469
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93224/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21469: [SPARK-24441][SS] Expose total estimated size of states ...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21469
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21469: [SPARK-24441][SS] Expose total estimated size of states ...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21469
  
**[Test build #93224 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93224/testReport)**
 for PR 21469 at commit 
[`5b203d4`](https://github.com/apache/spark/commit/5b203d4967eda3a09f7c8d83cf86e7ac6a427182).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

2018-07-18 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/21589

> User's are not expected to override it unless they want fine grained
control over the value

This is actually one of the use cases when an user need to take control or
tune a query. The `defaultParallelism` is used in many places like
https://github.com/apache/spark/blob/9549a2814951f9ba969955d78ac4bd2240f85989/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L594-L597
. If he/she wants to tune the behavior in the methods, he/she has to change
`defaultParallelism`. In this way the factor `5` in `df.repartition(5 *
sc.defaultParallelism)` should be tune accordingly. In this way we just force
users to introduce absolutely unnecessary complexity and dependencies in their
code. If I need number of cores in my cluster, I would like to have a direct
way to take it instead of hope a method returns me this number implicitly.

> One thing to be kept in mind is that dynamic resource allocation will
kick in after tasks are submitted ...

Let me show you another use case which I observe in my experience. Our
customers can write a code in notebooks and can attach their notebooks to
different cluster. Usually code is developed and debugged on small (staging)
cluster. After that the notebooks are re-attached to production cluster which
may have completely different size. Pretty often users just leave existing
params/constants like in `repartition()` as is. It usually leads to
underloading or overloading a clusters. Why cannot they use
`defaultParallelism` everywhere? Look at the use case above - tuning one part
of user's app requires changing factors in another parts (absolutely
independent from the first one).

---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21733: [SPARK-24763][SS] Remove redundant key data from value i...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21733
  
Build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21733: [SPARK-24763][SS] Remove redundant key data from value i...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21733
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21733: [SPARK-24763][SS] Remove redundant key data from value i...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21733
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93222/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21733: [SPARK-24763][SS] Remove redundant key data from value i...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21733
  
**[Test build #93222 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93222/testReport)**
 for PR 21733 at commit 
[`4754469`](https://github.com/apache/spark/commit/4754469ebdb36da1d3ae1234a49472716a143119).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  sealed trait StreamingAggregationStateManager extends Serializable `
  * `  abstract class StreamingAggregationStateManagerBaseImpl(`
  * `  class StreamingAggregationStateManagerImplV1(`
  * `  class StreamingAggregationStateManagerImplV2(`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21733: [SPARK-24763][SS] Remove redundant key data from value i...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21733
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93221/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21733: [SPARK-24763][SS] Remove redundant key data from value i...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21733
  
**[Test build #93221 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93221/testReport)**
 for PR 21733 at commit 
[`db9d9ce`](https://github.com/apache/spark/commit/db9d9ce6dc4912672ca0af14833b5d0c239f9562).
 * This patch **fails Spark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds the following public classes _(experimental)_:
  * `  sealed trait StreamingAggregationStateManager extends Serializable `
  * `  abstract class StreamingAggregationStateManagerBaseImpl(`
  * `  class StreamingAggregationStateManagerImplV1(`
  * `  class StreamingAggregationStateManagerImplV2(`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21752: [SPARK-24788][SQL] fixed UnresolvedException when toStri...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21752
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21801: [SPARK-24386][SPARK-24768][BUILD][FOLLOWUP] Fix l...

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21801


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21764: [SPARK-24802] Optimization Rule Exclusion

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21764
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93220/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21764: [SPARK-24802] Optimization Rule Exclusion

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21764
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21764: [SPARK-24802] Optimization Rule Exclusion

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21764
  
**[Test build #93220 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93220/testReport)**
 for PR 21764 at commit 
[`84f1a6b`](https://github.com/apache/spark/commit/84f1a6b5cba08df8684179e9d7195545be655e76).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21801: [SPARK-24386][SPARK-24768][BUILD][FOLLOWUP] Fix lint-jav...

2018-07-18 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21801
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21801: [SPARK-24386][SPARK-24768][BUILD][FOLLOWUP] Fix lint-jav...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21801
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21801: [SPARK-24386][SPARK-24768][BUILD][FOLLOWUP] Fix lint-jav...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21801
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93218/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21801: [SPARK-24386][SPARK-24768][BUILD][FOLLOWUP] Fix lint-jav...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21801
  
**[Test build #93218 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93218/testReport)**
 for PR 21801 at commit 
[`7f78d75`](https://github.com/apache/spark/commit/7f78d750411a4098527b2b332495f5dd4f20c63e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/21589
  
+CC @markhamstra since you were looking at API stability.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/21589
  

I am not convinced by the rationale given for adding the new api's in the 
jira.
The examples given there can be easily modeled using `defaultParallelism` 
(to get current state) and executor events (to get numCores, memory per 
executor).
For example: `df.repartition(5 * sc.defaultParallelism)`

The other argument seems to be that users can override this value and set 
it to a static constant.
User's are not expected to override it unless they want fine grained 
control over the value and spark is expected to honor it when specified.

One thing to be kept in mind is that dynamic resource allocation will kick 
in after tasks are submitted (when there are insufficient resources available) 
- so trying to fine tune this for an application, in presence of DRA, uses 
these api's is not going to be effective anyway.

If there are corner cases where `defaultParallelism` is not accurate, we 
should fix those to reflect the current value.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21102: [SPARK-23913][SQL] Add array_intersect function

Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/21102#discussion_r203322643
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala ---
@@ -163,7 +187,7 @@ class OpenHashSet[@specialized(Long, Int) T: ClassTag](
* to a new position (in the new data array).
*/
   def rehashIfNeeded(k: T, allocateFunc: (Int) => Unit, moveFunc: (Int, 
Int) => Unit) {
-if (_size > _growThreshold) {
+if (_occupied > _growThreshold) {
--- End diff --

For accuracy sake - my example snippet above will fail much earlier - due 
to OpenHashSet. MAX_CAPACITY. Though that is probably not the point anyway :-)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21102: [SPARK-23913][SQL] Add array_intersect function

Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/21102#discussion_r203322056
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala ---
@@ -163,7 +187,7 @@ class OpenHashSet[@specialized(Long, Int) T: ClassTag](
* to a new position (in the new data array).
*/
   def rehashIfNeeded(k: T, allocateFunc: (Int) => Unit, moveFunc: (Int, 
Int) => Unit) {
-if (_size > _growThreshold) {
+if (_occupied > _growThreshold) {
--- End diff --

There is no explicitly entry here - it is simply unoccupied slots in an 
array.
The slot is free, it can be used by some other (new) entry when insert is 
called.

It must be trivial to see how very bad behavior can happen with actual size 
of set being very small - with a series of add/remove's : resulting in unending 
growth of the set.

something like this, for example, is enough to cause set to blow to 2B 
entries:
```
var i = 0
while (i < Int.MaxValue) {
  set.add(1)
  set.remove(1)
  assert (0 == set.size)
  i += 1
}
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

2018-07-18 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/21589
  
> I am not seeing the utility of these two methods.

@mridulm I describe the utility of the methods in the ticket: 
https://issues.apache.org/jira/browse/SPARK-24591

> defaultParallelism already captures the current number of cores.

The `defaultParallelism` can be changed by users. And pretty often it is 
not reflected to number of cores. 

> For monitoring usecases, existing events fired via listener can be used 
to keep track of current executor population (if that is the intended usecase).

The basic cluster properties should be easily discoverable via APIs, I 
believe. And monitoring is just one of use cases. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21789: [SPARK-24829][SQL]In Spark Thrift Server, CAST AS...

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21789#discussion_r203320896
  
--- Diff: 
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala
 ---
@@ -766,6 +774,14 @@ class HiveThriftHttpServerSuite extends 
HiveThriftJdbcTest {
   assert(resultSet.getString(2) === HiveUtils.builtinHiveVersion)
 }
   }
+
+  test("Checks cast as float") {
--- End diff --

then probably better to add it into HiveThriftJdbcTest?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21789: [SPARK-24829][SQL]In Spark Thrift Server, CAST AS...

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21789#discussion_r203321155
  
--- Diff: 
sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/Column.java ---
@@ -349,7 +349,7 @@ public void addValue(Type type, Object field) {
 break;
   case FLOAT_TYPE:
 nulls.set(size, field == null);
-doubleVars()[size] = field == null ? 0 : 
((Float)field).doubleValue();
+doubleVars()[size] = field == null ? 0 : new 
Double(field.toString());
--- End diff --

if the problem is the precision, isn't enough to cast it to Double instead 
of creating a double out of a string?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21758: [SPARK-24795][CORE] Implement barrier execution mode

Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/21758
  
I had left a few comments on SPARK-24375 @jiangxb1987 ... unfortunately the 
jira's have moved around a bit.
If this is active PR for introducing the feature, would be great to get 
clarity on them.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21221: [SPARK-23429][CORE] Add executor memory metrics t...

Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/21221#discussion_r203319952
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala ---
@@ -160,11 +160,29 @@ case class 
SparkListenerBlockUpdated(blockUpdatedInfo: BlockUpdatedInfo) extends
  * Periodic updates from executors.
  * @param execId executor id
  * @param accumUpdates sequence of (taskId, stageId, stageAttemptId, 
accumUpdates)
+ * @param executorUpdates executor level metrics updates
  */
 @DeveloperApi
 case class SparkListenerExecutorMetricsUpdate(
 execId: String,
-accumUpdates: Seq[(Long, Int, Int, Seq[AccumulableInfo])])
+accumUpdates: Seq[(Long, Int, Int, Seq[AccumulableInfo])],
+executorUpdates: Option[Array[Long]] = None)
+  extends SparkListenerEvent
+
+/**
+ * Peak metric values for the executor for the stage, written to the 
history log at stage
+ * completion.
+ * @param execId executor id
+ * @param stageId stage id
+ * @param stageAttemptId stage attempt
+ * @param executorMetrics executor level metrics, indexed by 
MetricGetter.values
+ */
+@DeveloperApi
+case class SparkListenerStageExecutorMetrics(
+execId: String,
+stageId: Int,
+stageAttemptId: Int,
+executorMetrics: Array[Long])
--- End diff --

+1 on enum's @squito !
The only concern would be evolving the enum's in a later release - changing 
enum could result in source incompatibility.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21102: [SPARK-23913][SQL] Add array_intersect function

2018-07-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21102#discussion_r203319710
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala ---
@@ -163,7 +187,7 @@ class OpenHashSet[@specialized(Long, Int) T: ClassTag](
* to a new position (in the new data array).
*/
   def rehashIfNeeded(k: T, allocateFunc: (Int) => Unit, moveFunc: (Int, 
Int) => Unit) {
-if (_size > _growThreshold) {
+if (_occupied > _growThreshold) {
--- End diff --

When 'remove' is called, '_size' is decremented. But, an entry is not 
released. This is  a motivation to introduce 'occupied'.
I will try to use another implementation without 'remove' while it may 
introduce some overhead.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21729: [SPARK-24755][Core] Executor loss can cause task to not ...

Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/21729
  
Looks good to me, thanks for fixing this @hthuynh2 !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21774: [SPARK-24811][SQL]Avro: add new function from_avro and t...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21774
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21652: [SPARK-24551][K8S] Add integration tests for secrets

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21652
  
Kubernetes integration test status success
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/1091/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21774: [SPARK-24811][SQL]Avro: add new function from_avro and t...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21774
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1092/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21652: [SPARK-24551][K8S] Add integration tests for secrets

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21652
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1091/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21652: [SPARK-24551][K8S] Add integration tests for secrets

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21652
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21652: [SPARK-24551][K8S] Add integration tests for secrets

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21652
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21652: [SPARK-24551][K8S] Add integration tests for secrets