[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....

2018-08-05 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/21939
  
> i'm ready to pull the trigger on the update to arrow... i'd much prefer a 
pip dist, but would be ok w/a conda package. :)

Thanks @shaneknapp !  So for those suggesting we keep the existing minimum 
pyarrow version of 0.8.0, does that mean we will need to add triple tests to 
support 0.9.0 and 0.10.0?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....

2018-08-05 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/21939
  
@gatorsmile , there is a RC1 vote up now, so it should very soon


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....

2018-08-05 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/21939
  
@cloud-fan , we have BinaryType support in Java already, but it has not 
been added to Python due to an issue - the related jiras that @HyukjinKwon 
mentioned.  So Arrow 0.10.0 has a bug fix that makes it possible to add it to 
Python.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21608: [SPARK-24626] [SQL] Improve location size calculation in...

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21608
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21608: [SPARK-24626] [SQL] Improve location size calculation in...

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21608
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94259/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21608: [SPARK-24626] [SQL] Improve location size calculation in...

2018-08-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21608
  
**[Test build #94259 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94259/testReport)**
 for PR 21608 at commit 
[`253af70`](https://github.com/apache/spark/commit/253af70cd4cd525db4951db67fce01a5ca1f0014).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21998: [SPARK-24940][SQL] Use IntegerLiteral in ResolveCoalesce...

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21998
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21622: [SPARK-24637][SS] Add metrics regarding state and waterm...

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21622
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94261/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21998: [SPARK-24940][SQL] Use IntegerLiteral in ResolveCoalesce...

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21998
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94258/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21622: [SPARK-24637][SS] Add metrics regarding state and waterm...

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21622
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21998: [SPARK-24940][SQL] Use IntegerLiteral in ResolveCoalesce...

2018-08-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21998
  
**[Test build #94258 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94258/testReport)**
 for PR 21998 at commit 
[`8f0c578`](https://github.com/apache/spark/commit/8f0c57804e75b2a74f7573a21f6c7c63f7b85e03).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21622: [SPARK-24637][SS] Add metrics regarding state and waterm...

2018-08-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21622
  
**[Test build #94261 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94261/testReport)**
 for PR 21622 at commit 
[`722e6a0`](https://github.com/apache/spark/commit/722e6a0f7506440f260126d841d0cb27cf744100).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21305
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21305
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94256/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-08-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21305
  
**[Test build #94256 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94256/testReport)**
 for PR 21305 at commit 
[`ae0d242`](https://github.com/apache/spark/commit/ae0d2424315634760d46be0f21e0e98160bada5a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21937: [WIP][SPARK-23914][SQL][follow-up] refactor ArrayUnion

2018-08-05 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/21937
  
I'll revisit after the conflict is fixed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21937: [WIP][SPARK-23914][SQL][follow-up] refactor Array...

2018-08-05 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21937#discussion_r207767260
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -3698,230 +3767,162 @@ object ArraySetLike {
   """,
   since = "2.4.0")
 case class ArrayUnion(left: Expression, right: Expression) extends 
ArraySetLike
-with ComplexTypeMergingExpression {
-  var hsInt: OpenHashSet[Int] = _
-  var hsLong: OpenHashSet[Long] = _
-
-  def assignInt(array: ArrayData, idx: Int, resultArray: ArrayData, pos: 
Int): Boolean = {
-val elem = array.getInt(idx)
-if (!hsInt.contains(elem)) {
-  if (resultArray != null) {
-resultArray.setInt(pos, elem)
-  }
-  hsInt.add(elem)
-  true
-} else {
-  false
-}
-  }
-
-  def assignLong(array: ArrayData, idx: Int, resultArray: ArrayData, pos: 
Int): Boolean = {
-val elem = array.getLong(idx)
-if (!hsLong.contains(elem)) {
-  if (resultArray != null) {
-resultArray.setLong(pos, elem)
-  }
-  hsLong.add(elem)
-  true
-} else {
-  false
-}
-  }
+  with ComplexTypeMergingExpression {
 
-  def evalIntLongPrimitiveType(
-  array1: ArrayData,
-  array2: ArrayData,
-  resultArray: ArrayData,
-  isLongType: Boolean): Int = {
-// store elements into resultArray
-var nullElementSize = 0
-var pos = 0
-Seq(array1, array2).foreach { array =>
-  var i = 0
-  while (i < array.numElements()) {
-val size = if (!isLongType) hsInt.size else hsLong.size
-if (size + nullElementSize > 
ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {
-  ArraySetLike.throwUnionLengthOverflowException(size)
-}
-if (array.isNullAt(i)) {
-  if (nullElementSize == 0) {
-if (resultArray != null) {
-  resultArray.setNullAt(pos)
+  @transient lazy val evalUnion: (ArrayData, ArrayData) => ArrayData = {
+if (elementTypeSupportEquals) {
+  (array1, array2) =>
+val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
+val hs = new OpenHashSet[Any]
+var foundNullElement = false
+Seq(array1, array2).foreach { array =>
+  var i = 0
+  while (i < array.numElements()) {
+if (array.isNullAt(i)) {
+  if (!foundNullElement) {
+arrayBuffer += null
+foundNullElement = true
+  }
+} else {
+  val elem = array.get(i, elementType)
+  if (!hs.contains(elem)) {
+if (arrayBuffer.size > 
ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {
+  
ArraySetLike.throwUnionLengthOverflowException(arrayBuffer.size)
+}
+arrayBuffer += elem
+hs.add(elem)
+  }
 }
-pos += 1
-nullElementSize = 1
+i += 1
   }
-} else {
-  val assigned = if (!isLongType) {
-assignInt(array, i, resultArray, pos)
+}
+new GenericArrayData(arrayBuffer)
+} else {
+  (array1, array2) =>
+val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
+var alreadyIncludeNull = false
+Seq(array1, array2).foreach(_.foreach(elementType, (_, elem) => {
+  var found = false
+  if (elem == null) {
+if (alreadyIncludeNull) {
+  found = true
+} else {
+  alreadyIncludeNull = true
+}
   } else {
-assignLong(array, i, resultArray, pos)
+// check elem is already stored in arrayBuffer or not?
+var j = 0
+while (!found && j < arrayBuffer.size) {
+  val va = arrayBuffer(j)
+  if (va != null && ordering.equiv(va, elem)) {
+found = true
+  }
+  j = j + 1
+}
   }
-  if (assigned) {
-pos += 1
+  if (!found) {
+if (arrayBuffer.length > 
ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {
+  
ArraySetLike.throwUnionLengthOverflowException(arrayBuffer.length)
+}
+arrayBuffer += elem
   }
-}
-i += 1
-  }
+}))
+new GenericArrayData(arrayBuffer)
 }
-pos
   }
 
   override def nullSafeEval(input1: Any, input2: 

[GitHub] spark pull request #21937: [WIP][SPARK-23914][SQL][follow-up] refactor Array...

2018-08-05 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21937#discussion_r207767113
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -3698,230 +3767,162 @@ object ArraySetLike {
   """,
   since = "2.4.0")
 case class ArrayUnion(left: Expression, right: Expression) extends 
ArraySetLike
-with ComplexTypeMergingExpression {
-  var hsInt: OpenHashSet[Int] = _
-  var hsLong: OpenHashSet[Long] = _
-
-  def assignInt(array: ArrayData, idx: Int, resultArray: ArrayData, pos: 
Int): Boolean = {
-val elem = array.getInt(idx)
-if (!hsInt.contains(elem)) {
-  if (resultArray != null) {
-resultArray.setInt(pos, elem)
-  }
-  hsInt.add(elem)
-  true
-} else {
-  false
-}
-  }
-
-  def assignLong(array: ArrayData, idx: Int, resultArray: ArrayData, pos: 
Int): Boolean = {
-val elem = array.getLong(idx)
-if (!hsLong.contains(elem)) {
-  if (resultArray != null) {
-resultArray.setLong(pos, elem)
-  }
-  hsLong.add(elem)
-  true
-} else {
-  false
-}
-  }
+  with ComplexTypeMergingExpression {
 
-  def evalIntLongPrimitiveType(
-  array1: ArrayData,
-  array2: ArrayData,
-  resultArray: ArrayData,
-  isLongType: Boolean): Int = {
-// store elements into resultArray
-var nullElementSize = 0
-var pos = 0
-Seq(array1, array2).foreach { array =>
-  var i = 0
-  while (i < array.numElements()) {
-val size = if (!isLongType) hsInt.size else hsLong.size
-if (size + nullElementSize > 
ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {
-  ArraySetLike.throwUnionLengthOverflowException(size)
-}
-if (array.isNullAt(i)) {
-  if (nullElementSize == 0) {
-if (resultArray != null) {
-  resultArray.setNullAt(pos)
+  @transient lazy val evalUnion: (ArrayData, ArrayData) => ArrayData = {
+if (elementTypeSupportEquals) {
+  (array1, array2) =>
+val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
+val hs = new OpenHashSet[Any]
+var foundNullElement = false
+Seq(array1, array2).foreach { array =>
+  var i = 0
+  while (i < array.numElements()) {
+if (array.isNullAt(i)) {
+  if (!foundNullElement) {
+arrayBuffer += null
+foundNullElement = true
+  }
+} else {
+  val elem = array.get(i, elementType)
+  if (!hs.contains(elem)) {
+if (arrayBuffer.size > 
ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {
+  
ArraySetLike.throwUnionLengthOverflowException(arrayBuffer.size)
+}
+arrayBuffer += elem
+hs.add(elem)
+  }
 }
-pos += 1
-nullElementSize = 1
+i += 1
   }
-} else {
-  val assigned = if (!isLongType) {
-assignInt(array, i, resultArray, pos)
+}
+new GenericArrayData(arrayBuffer)
+} else {
+  (array1, array2) =>
+val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
+var alreadyIncludeNull = false
+Seq(array1, array2).foreach(_.foreach(elementType, (_, elem) => {
+  var found = false
+  if (elem == null) {
+if (alreadyIncludeNull) {
+  found = true
+} else {
+  alreadyIncludeNull = true
+}
   } else {
-assignLong(array, i, resultArray, pos)
+// check elem is already stored in arrayBuffer or not?
+var j = 0
+while (!found && j < arrayBuffer.size) {
+  val va = arrayBuffer(j)
+  if (va != null && ordering.equiv(va, elem)) {
+found = true
+  }
+  j = j + 1
+}
   }
-  if (assigned) {
-pos += 1
+  if (!found) {
+if (arrayBuffer.length > 
ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {
+  
ArraySetLike.throwUnionLengthOverflowException(arrayBuffer.length)
+}
+arrayBuffer += elem
   }
-}
-i += 1
-  }
+}))
+new GenericArrayData(arrayBuffer)
 }
-pos
   }
 
   override def nullSafeEval(input1: Any, input2: 

[GitHub] spark pull request #21102: [SPARK-23913][SQL] Add array_intersect function

2018-08-05 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21102#discussion_r207767226
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -3965,6 +4034,242 @@ object ArrayUnion {
   }
 }
 
+/**
+ * Returns an array of the elements in the intersect of x and y, without 
duplicates
+ */
+@ExpressionDescription(
+  usage = """
+  _FUNC_(array1, array2) - Returns an array of the elements in the 
intersection of array1 and
+array2, without duplicates.
+  """,
+  examples = """
+Examples:Fun
+  > SELECT _FUNC_(array(1, 2, 3), array(1, 3, 5));
+   array(1, 3)
+  """,
+  since = "2.4.0")
+case class ArrayIntersect(left: Expression, right: Expression) extends 
ArraySetLike
+  with ComplexTypeMergingExpression {
+  override def dataType: DataType = {
+dataTypeCheck
+ArrayType(elementType,
+  left.dataType.asInstanceOf[ArrayType].containsNull &&
+right.dataType.asInstanceOf[ArrayType].containsNull)
+  }
+
+  @transient lazy val evalIntersect: (ArrayData, ArrayData) => ArrayData = 
{
+if (elementTypeSupportEquals) {
+  (array1, array2) =>
+val hs = new OpenHashSet[Any]
+val hsResult = new OpenHashSet[Any]
+var foundNullElement = false
+var i = 0
+while (i < array2.numElements()) {
+  if (array2.isNullAt(i)) {
+foundNullElement = true
+  } else {
+val elem = array2.get(i, elementType)
+hs.add(elem)
+  }
+  i += 1
+}
+val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
+i = 0
+while (i < array1.numElements()) {
+  if (array1.isNullAt(i)) {
+if (foundNullElement) {
+  arrayBuffer += null
+  foundNullElement = false
+}
+  } else {
+val elem = array1.get(i, elementType)
+if (hs.contains(elem) && !hsResult.contains(elem)) {
+  arrayBuffer += elem
+  hsResult.add(elem)
+}
+  }
+  i += 1
+}
+new GenericArrayData(arrayBuffer)
+} else {
+  (array1, array2) =>
+val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
+var alreadySeenNull = false
+var i = 0
+while (i < array1.numElements()) {
+  var found = false
+  val elem1 = array1.get(i, elementType)
+  if (array1.isNullAt(i)) {
+if (!alreadySeenNull) {
+  var j = 0
+  while (!found && j < array2.numElements()) {
+found = array2.isNullAt(j)
+j += 1
+  }
+  // array2 is scanned only once for null element
+  alreadySeenNull = true
+}
+  } else {
+var j = 0
+while (!found && j < array2.numElements()) {
+  if (!array2.isNullAt(j)) {
+val elem2 = array2.get(j, elementType)
+if (ordering.equiv(elem1, elem2)) {
+  // check whether elem1 is already stored in arrayBuffer
+  var foundArrayBuffer = false
+  var k = 0
+  while (!foundArrayBuffer && k < arrayBuffer.size) {
+val va = arrayBuffer(k)
+foundArrayBuffer = (va != null) && ordering.equiv(va, 
elem1)
+k += 1
+  }
+  found = !foundArrayBuffer
+}
+  }
+  j += 1
+}
+  }
+  if (found) {
+arrayBuffer += elem1
+  }
+  i += 1
+}
+new GenericArrayData(arrayBuffer)
+}
+  }
+
+  override def nullSafeEval(input1: Any, input2: Any): Any = {
+val array1 = input1.asInstanceOf[ArrayData]
+val array2 = input2.asInstanceOf[ArrayData]
+
+evalIntersect(array1, array2)
+  }
+
+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+val arrayData = classOf[ArrayData].getName
+val i = ctx.freshName("i")
+val value = ctx.freshName("value")
+val size = ctx.freshName("size")
+if (canUseSpecializedHashSet) {
+  val jt = CodeGenerator.javaType(elementType)
+  val ptName = CodeGenerator.primitiveTypeName(jt)
+
+  nullSafeCodeGen(ctx, ev, (array1, array2) => {
+val foundNullElement = 

[GitHub] spark issue #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21305
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21305
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94257/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-08-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21305
  
**[Test build #94257 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94257/testReport)**
 for PR 21305 at commit 
[`618a79d`](https://github.com/apache/spark/commit/618a79dfebf52710e3d86abbde35c65910b91a81).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22004: [WIP][SPARK-25029][TESTS] Scala 2.12 issues: TaskNotSeri...

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22004
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22004: [WIP][SPARK-25029][TESTS] Scala 2.12 issues: TaskNotSeri...

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22004
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94255/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22004: [WIP][SPARK-25029][TESTS] Scala 2.12 issues: TaskNotSeri...

2018-08-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22004
  
**[Test build #94255 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94255/testReport)**
 for PR 22004 at commit 
[`626e7bd`](https://github.com/apache/spark/commit/626e7bd16769ee6dc42d7d04df10981e719c530d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class MyData(val i: Int) extends Serializable`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21102: [SPARK-23913][SQL] Add array_intersect function

2018-08-05 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21102#discussion_r207766511
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -3965,6 +4034,242 @@ object ArrayUnion {
   }
 }
 
+/**
+ * Returns an array of the elements in the intersect of x and y, without 
duplicates
+ */
+@ExpressionDescription(
+  usage = """
+  _FUNC_(array1, array2) - Returns an array of the elements in the 
intersection of array1 and
+array2, without duplicates.
+  """,
+  examples = """
+Examples:Fun
+  > SELECT _FUNC_(array(1, 2, 3), array(1, 3, 5));
+   array(1, 3)
+  """,
+  since = "2.4.0")
+case class ArrayIntersect(left: Expression, right: Expression) extends 
ArraySetLike
+  with ComplexTypeMergingExpression {
+  override def dataType: DataType = {
+dataTypeCheck
+ArrayType(elementType,
+  left.dataType.asInstanceOf[ArrayType].containsNull &&
+right.dataType.asInstanceOf[ArrayType].containsNull)
+  }
+
+  @transient lazy val evalIntersect: (ArrayData, ArrayData) => ArrayData = 
{
+if (elementTypeSupportEquals) {
+  (array1, array2) =>
+val hs = new OpenHashSet[Any]
+val hsResult = new OpenHashSet[Any]
+var foundNullElement = false
+var i = 0
+while (i < array2.numElements()) {
+  if (array2.isNullAt(i)) {
+foundNullElement = true
+  } else {
+val elem = array2.get(i, elementType)
+hs.add(elem)
+  }
+  i += 1
+}
+val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
+i = 0
+while (i < array1.numElements()) {
+  if (array1.isNullAt(i)) {
+if (foundNullElement) {
+  arrayBuffer += null
+  foundNullElement = false
+}
+  } else {
+val elem = array1.get(i, elementType)
+if (hs.contains(elem) && !hsResult.contains(elem)) {
+  arrayBuffer += elem
+  hsResult.add(elem)
+}
+  }
+  i += 1
+}
+new GenericArrayData(arrayBuffer)
+} else {
+  (array1, array2) =>
+val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
+var alreadySeenNull = false
+var i = 0
+while (i < array1.numElements()) {
+  var found = false
+  val elem1 = array1.get(i, elementType)
+  if (array1.isNullAt(i)) {
+if (!alreadySeenNull) {
+  var j = 0
+  while (!found && j < array2.numElements()) {
+found = array2.isNullAt(j)
+j += 1
+  }
+  // array2 is scanned only once for null element
+  alreadySeenNull = true
+}
+  } else {
+var j = 0
+while (!found && j < array2.numElements()) {
+  if (!array2.isNullAt(j)) {
+val elem2 = array2.get(j, elementType)
+if (ordering.equiv(elem1, elem2)) {
+  // check whether elem1 is already stored in arrayBuffer
+  var foundArrayBuffer = false
+  var k = 0
+  while (!foundArrayBuffer && k < arrayBuffer.size) {
+val va = arrayBuffer(k)
+foundArrayBuffer = (va != null) && ordering.equiv(va, 
elem1)
+k += 1
+  }
+  found = !foundArrayBuffer
+}
+  }
+  j += 1
+}
+  }
+  if (found) {
+arrayBuffer += elem1
+  }
+  i += 1
+}
+new GenericArrayData(arrayBuffer)
+}
+  }
+
+  override def nullSafeEval(input1: Any, input2: Any): Any = {
+val array1 = input1.asInstanceOf[ArrayData]
+val array2 = input2.asInstanceOf[ArrayData]
+
+evalIntersect(array1, array2)
+  }
+
+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+val arrayData = classOf[ArrayData].getName
+val i = ctx.freshName("i")
+val value = ctx.freshName("value")
+val size = ctx.freshName("size")
+if (canUseSpecializedHashSet) {
+  val jt = CodeGenerator.javaType(elementType)
+  val ptName = CodeGenerator.primitiveTypeName(jt)
+
+  nullSafeCodeGen(ctx, ev, (array1, array2) => {
+val foundNullElement = 

[GitHub] spark issue #21948: [SPARK-24991][SQL] use InternalRow in DataSourceWriter

2018-08-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21948
  
**[Test build #94263 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94263/testReport)**
 for PR 21948 at commit 
[`86817c7`](https://github.com/apache/spark/commit/86817c7ee36f1344e977bb5af14aeb56232c17d5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21102: [SPARK-23913][SQL] Add array_intersect function

2018-08-05 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21102#discussion_r207765490
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -3965,6 +4034,242 @@ object ArrayUnion {
   }
 }
 
+/**
+ * Returns an array of the elements in the intersect of x and y, without 
duplicates
+ */
+@ExpressionDescription(
+  usage = """
+  _FUNC_(array1, array2) - Returns an array of the elements in the 
intersection of array1 and
+array2, without duplicates.
+  """,
+  examples = """
+Examples:Fun
+  > SELECT _FUNC_(array(1, 2, 3), array(1, 3, 5));
+   array(1, 3)
+  """,
+  since = "2.4.0")
+case class ArrayIntersect(left: Expression, right: Expression) extends 
ArraySetLike
+  with ComplexTypeMergingExpression {
+  override def dataType: DataType = {
+dataTypeCheck
+ArrayType(elementType,
+  left.dataType.asInstanceOf[ArrayType].containsNull &&
+right.dataType.asInstanceOf[ArrayType].containsNull)
+  }
+
+  @transient lazy val evalIntersect: (ArrayData, ArrayData) => ArrayData = 
{
+if (elementTypeSupportEquals) {
+  (array1, array2) =>
+val hs = new OpenHashSet[Any]
--- End diff --

How about shortcutting to return an empty array when we find one of the two 
is empty?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21948: [SPARK-24991][SQL] use InternalRow in DataSourceWriter

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21948
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1830/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21948: [SPARK-24991][SQL] use InternalRow in DataSourceWriter

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21948
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21948: [SPARK-24991][SQL] use InternalRow in DataSourceWriter

2018-08-05 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21948
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21948: [SPARK-24991][SQL] use InternalRow in DataSourceW...

2018-08-05 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21948#discussion_r207765500
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/MemorySinkV2Suite.scala
 ---
@@ -44,16 +46,16 @@ class MemorySinkV2Suite extends StreamTest with 
BeforeAndAfter {
 val writer = new MemoryStreamWriter(sink, OutputMode.Append(), new 
StructType().add("i", "int"))
 writer.commit(0,
   Array(
-MemoryWriterCommitMessage(0, Seq(InternalRow(1), InternalRow(2))),
-MemoryWriterCommitMessage(1, Seq(InternalRow(3), InternalRow(4))),
-MemoryWriterCommitMessage(2, Seq(InternalRow(6), InternalRow(7)))
+MemoryWriterCommitMessage(0, Seq(Row(1), Row(2))),
--- End diff --

because the memory sink needs `Row`s at the end. Instead of collecting 
`InternalRow`s via copy and then convert to `Row`s, I think it's more efficient 
to collect `Row`s directly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22003: [SPARK-25019][BUILD] Fix orc dependency to use the same ...

2018-08-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22003
  
Hi, @yhuai . Could you review this PR? The following is the new pom.file 
for `spark-sql_2.11`.
```

  org.apache.orc
  orc-core
  1.5.2
  nohive
  compile
  

  hadoop-common
  org.apache.hadoop


  hadoop-hdfs
  org.apache.hadoop


  hive-storage-api
  org.apache.hive

  


  org.apache.orc
  orc-mapreduce
  1.5.2
  nohive
  compile
  

  hadoop-common
  org.apache.hadoop


  hadoop-mapreduce-client-core
  org.apache.hadoop


  orc-core
  org.apache.orc


  hive-storage-api
  org.apache.hive


  guava
  com.google.guava

  

```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21996: [SPARK-24888][CORE] spark-submit --master spark:/...

2018-08-05 Thread devaraj-kavali
Github user devaraj-kavali commented on a diff in the pull request:

https://github.com/apache/spark/pull/21996#discussion_r207763630
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -98,17 +98,24 @@ private[spark] class SparkSubmit extends Logging {
* Kill an existing submission using the REST protocol. Standalone and 
Mesos cluster mode only.
*/
   private def kill(args: SparkSubmitArguments): Unit = {
-new RestSubmissionClient(args.master)
-  .killSubmission(args.submissionToKill)
+createRestSubmissionClient(args).killSubmission(args.submissionToKill)
   }
 
   /**
* Request the status of an existing submission using the REST protocol.
* Standalone and Mesos cluster mode only.
*/
   private def requestStatus(args: SparkSubmitArguments): Unit = {
-new RestSubmissionClient(args.master)
-  .requestSubmissionStatus(args.submissionToRequestStatusFor)
+
createRestSubmissionClient(args).requestSubmissionStatus(args.submissionToRequestStatusFor)
+  }
+
+  /**
+   * Creates RestSubmissionClient with overridden logInfo()
+   */
+  private def createRestSubmissionClient(args: SparkSubmitArguments): 
RestSubmissionClient = {
+new RestSubmissionClient(args.master) {
+  override protected def logInfo(msg: => String): Unit = 
printMessage(msg)
--- End diff --

I agree, user can change this log level. If the users configure the log 
level as WARN or above(WARN is the default log level) then they can't see any 
update/status from status and kill commands. I think we cannot expect the user 
to configure the log level to INFO if they want to run status and kill commands 
with status/update. Please let me know if you have any thoughts to fix this 
better, I can make the changes. Thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22004: [WIP][SPARK-25029][TESTS] Scala 2.12 issues: TaskNotSeri...

2018-08-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22004
  
**[Test build #94262 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94262/testReport)**
 for PR 22004 at commit 
[`422c4ab`](https://github.com/apache/spark/commit/422c4ab259b5e27ef12c2d5093a4ae93f2b7f522).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22004: [WIP][SPARK-25029][TESTS] Scala 2.12 issues: TaskNotSeri...

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22004
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1829/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22004: [WIP][SPARK-25029][TESTS] Scala 2.12 issues: TaskNotSeri...

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22004
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22003: [SPARK-25019][BUILD] Fix orc dependency to use the same ...

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22003
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94254/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22003: [SPARK-25019][BUILD] Fix orc dependency to use the same ...

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22003
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22003: [SPARK-25019][BUILD] Fix orc dependency to use the same ...

2018-08-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22003
  
**[Test build #94254 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94254/testReport)**
 for PR 22003 at commit 
[`a801498`](https://github.com/apache/spark/commit/a801498f249d7526b64fcb9fe8144325ebb3d4e4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22004: [WIP][SPARK-25029][TESTS] Scala 2.12 issues: Task...

2018-08-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22004#discussion_r207763075
  
--- Diff: repl/src/test/scala/org/apache/spark/repl/ReplSuite.scala ---
@@ -84,6 +85,7 @@ class ReplSuite extends SparkFunSuite {
   settings = new scala.tools.nsc.Settings
   settings.usejavacp.value = true
   org.apache.spark.repl.Main.interp = this
+  in = SimpleReader()
--- End diff --

This was giving an NPE in 2.12. I think this little hack fixes the issue 
for purposes of this test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-08-05 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r207759811
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
 ---
@@ -169,25 +181,50 @@ package object expressions  {
 })
   }
 
-  // Find matches for the given name assuming that the 1st part is a 
qualifier (i.e. table name,
-  // alias, or subquery alias) and the 2nd part is the actual name. 
This returns a tuple of
+  // Find matches for the given name assuming that the 1st two parts 
are qualifier
+  // (i.e. database name and table name) and the 3rd part is the 
actual column name.
+  //
+  // For example, consider an example where "db1" is the database 
name, "a" is the table name
+  // and "b" is the column name and "c" is the struct field name.
+  // If the name parts is db1.a.b.c, then Attribute will match
--- End diff --

What I'm talking about is ambiguity. `col.innerField1.innerField2` can fail 
if the `innerField2` doesn't exist. My question is: shall we try all the 
possible resolution paths and pick the valid one, or define a rule that we can 
decide the resolution path ahead.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21984: [SPARK-24772][SQL] Avro: support logical date type

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21984
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21984: [SPARK-24772][SQL] Avro: support logical date type

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21984
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94260/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21984: [SPARK-24772][SQL] Avro: support logical date type

2018-08-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21984
  
**[Test build #94260 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94260/testReport)**
 for PR 21984 at commit 
[`c6b997b`](https://github.com/apache/spark/commit/c6b997b673790870b445f0c1e16941fa244d5dea).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21622: [SPARK-24637][SS] Add metrics regarding state and waterm...

2018-08-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21622
  
**[Test build #94261 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94261/testReport)**
 for PR 21622 at commit 
[`722e6a0`](https://github.com/apache/spark/commit/722e6a0f7506440f260126d841d0cb27cf744100).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21956: [MINOR][DOCS] Fix grammatical error in SortShuffleManage...

2018-08-05 Thread deshanxiao
Github user deshanxiao commented on the issue:

https://github.com/apache/spark/pull/21956
  
It's ok. Thank you for your suggestions!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21102: [SPARK-23913][SQL] Add array_intersect function

2018-08-05 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21102#discussion_r207758427
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -1647,6 +1647,60 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSQLContext {
 assert(result10.first.schema(0).dataType === expectedType10)
   }
 
+  test("array_intersect functions") {
+val df1 = Seq((Array(1, 2, 4), Array(4, 2))).toDF("a", "b")
+val ans1 = Row(Seq(2, 4))
+checkAnswer(df1.select(array_intersect($"a", $"b")), ans1)
+checkAnswer(df1.selectExpr("array_intersect(a, b)"), ans1)
+
+val df2 = Seq((Array[Integer](1, 2, null, 4, 5), Array[Integer](-5, 4, 
null, 2, -1)))
+  .toDF("a", "b")
+val ans2 = Row(Seq(2, null, 4))
+checkAnswer(df2.select(array_intersect($"a", $"b")), ans2)
+checkAnswer(df2.selectExpr("array_intersect(a, b)"), ans2)
+
+val df3 = Seq((Array(1L, 2L, 4L), Array(4L, 2L))).toDF("a", "b")
+val ans3 = Row(Seq(2L, 4L))
+checkAnswer(df3.select(array_intersect($"a", $"b")), ans3)
+checkAnswer(df3.selectExpr("array_intersect(a, b)"), ans3)
+
+val df4 = Seq(
+  (Array[java.lang.Long](1L, 2L, null, 4L, 5L), 
Array[java.lang.Long](-5L, 4L, null, 2L, -1L)))
+  .toDF("a", "b")
+val ans4 = Row(Seq(2L, null, 4L))
+checkAnswer(df4.select(array_intersect($"a", $"b")), ans4)
+checkAnswer(df4.selectExpr("array_intersect(a, b)"), ans4)
+
+val df5 = Seq((Array("c", null, "a", "f"), Array("b", "a", null, 
"g"))).toDF("a", "b")
+val ans5 = Row(Seq(null, "a"))
+checkAnswer(df5.select(array_intersect($"a", $"b")), ans5)
+checkAnswer(df5.selectExpr("array_intersect(a, b)"), ans5)
+
+val df6 = Seq((null, null)).toDF("a", "b")
+intercept[AnalysisException] {
--- End diff --

Could you also check the error message?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21622: [SPARK-24637][SS] Add metrics regarding state and waterm...

2018-08-05 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/spark/pull/21622
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21984: [SPARK-24772][SQL] Avro: support logical date type

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21984
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1828/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21984: [SPARK-24772][SQL] Avro: support logical date type

2018-08-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21984
  
**[Test build #94260 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94260/testReport)**
 for PR 21984 at commit 
[`c6b997b`](https://github.com/apache/spark/commit/c6b997b673790870b445f0c1e16941fa244d5dea).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21984: [SPARK-24772][SQL] Avro: support logical date type

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21984
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21898: [SPARK-24817][Core] Implement BarrierTaskContext....

2018-08-05 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21898#discussion_r207758158
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -566,6 +579,9 @@ private[spark] class TaskSchedulerImpl(
 if (taskResultGetter != null) {
   taskResultGetter.stop()
 }
+if (barrierCoordinator != null) {
--- End diff --

maybe we should not use `lazy val`, but use `var` and control the 
initialization ourselves.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21898: [SPARK-24817][Core] Implement BarrierTaskContext....

2018-08-05 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21898#discussion_r207758062
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -1930,6 +1930,12 @@ class SparkContext(config: SparkConf) extends 
Logging {
 Utils.tryLogNonFatalError {
   _executorAllocationManager.foreach(_.stop())
 }
+if (_dagScheduler != null) {
--- End diff --

why this change?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21898: [SPARK-24817][Core] Implement BarrierTaskContext....

2018-08-05 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21898#discussion_r207758063
  
--- Diff: core/src/main/scala/org/apache/spark/BarrierTaskContext.scala ---
@@ -80,7 +101,45 @@ class BarrierTaskContext(
   @Experimental
   @Since("2.4.0")
   def barrier(): Unit = {
-// TODO SPARK-24817 implement global barrier.
+val callSite = Utils.getCallSite()
+logInfo(s"Task $taskAttemptId from Stage $stageId(Attempt 
$stageAttemptNumber) has entered " +
+  s"the global sync, current barrier epoch is $barrierEpoch.")
+logTrace(s"Current callSite: $callSite")
--- End diff --

or simpler: `logTrace("Current callSite: " + Utils.getCallSite())`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21898: [SPARK-24817][Core] Implement BarrierTaskContext....

2018-08-05 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21898#discussion_r207758062
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -1930,6 +1930,12 @@ class SparkContext(config: SparkConf) extends 
Logging {
 Utils.tryLogNonFatalError {
   _executorAllocationManager.foreach(_.stop())
 }
+if (_dagScheduler != null) {
--- End diff --

why this change?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21898: [SPARK-24817][Core] Implement BarrierTaskContext....

2018-08-05 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21898#discussion_r207758063
  
--- Diff: core/src/main/scala/org/apache/spark/BarrierTaskContext.scala ---
@@ -80,7 +101,45 @@ class BarrierTaskContext(
   @Experimental
   @Since("2.4.0")
   def barrier(): Unit = {
-// TODO SPARK-24817 implement global barrier.
+val callSite = Utils.getCallSite()
+logInfo(s"Task $taskAttemptId from Stage $stageId(Attempt 
$stageAttemptNumber) has entered " +
+  s"the global sync, current barrier epoch is $barrierEpoch.")
+logTrace(s"Current callSite: $callSite")
--- End diff --

or simpler: `logTrace("Current callSite: " + Utils.getCallSite())`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21898: [SPARK-24817][Core] Implement BarrierTaskContext....

2018-08-05 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21898#discussion_r207758008
  
--- Diff: core/src/main/scala/org/apache/spark/BarrierTaskContext.scala ---
@@ -80,7 +101,45 @@ class BarrierTaskContext(
   @Experimental
   @Since("2.4.0")
   def barrier(): Unit = {
-// TODO SPARK-24817 implement global barrier.
+val callSite = Utils.getCallSite()
+logInfo(s"Task $taskAttemptId from Stage $stageId(Attempt 
$stageAttemptNumber) has entered " +
+  s"the global sync, current barrier epoch is $barrierEpoch.")
+logTrace(s"Current callSite: $callSite")
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21898: [SPARK-24817][Core] Implement BarrierTaskContext....

2018-08-05 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21898#discussion_r207757953
  
--- Diff: core/src/main/scala/org/apache/spark/BarrierTaskContext.scala ---
@@ -39,6 +44,22 @@ class BarrierTaskContext(
   extends TaskContextImpl(stageId, stageAttemptNumber, partitionId, 
taskAttemptId, attemptNumber,
   taskMemoryManager, localProperties, metricsSystem, taskMetrics) {
 
+  // Find the driver side RPCEndpointRef of the coordinator that handles 
all the barrier() calls.
+  private val barrierCoordinator: RpcEndpointRef = {
+val env = SparkEnv.get
+RpcUtils.makeDriverRef("barrierSync", env.conf, env.rpcEnv)
+  }
+
+  private val timer = new Timer("Barrier task timer for barrier() calls.")
+
+  // Local barrierEpoch that identify a barrier() call from current task, 
it shall be identical
+  // with the driver side epoch.
+  private var barrierEpoch = 0
+
+  // Number of tasks of the current barrier stage, a barrier() call must 
collect enough requests
+  // from different tasks within the same barrier stage attempt to succeed.
+  private lazy val numTasks = getTaskInfos().size
--- End diff --

this can be a `def`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-05 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21889
  
Jenkins build successful. Any PR comments/blockers to merge for phase 1?

cc @HyukjinKwon, @gatorsmile, @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21608: [SPARK-24626] [SQL] Improve location size calculation in...

2018-08-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21608
  
**[Test build #94259 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94259/testReport)**
 for PR 21608 at commit 
[`253af70`](https://github.com/apache/spark/commit/253af70cd4cd525db4951db67fce01a5ca1f0014).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-05 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/21898
  
I see. got it, thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21608: [SPARK-24626] [SQL] Improve location size calculation in...

2018-08-05 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21608
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21889
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94252/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21998: [SPARK-24940][SQL] Use IntegerLiteral in ResolveCoalesce...

2018-08-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21998
  
**[Test build #94258 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94258/testReport)**
 for PR 21998 at commit 
[`8f0c578`](https://github.com/apache/spark/commit/8f0c57804e75b2a74f7573a21f6c7c63f7b85e03).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21889
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21889
  
**[Test build #94252 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94252/testReport)**
 for PR 21889 at commit 
[`8d7f4bc`](https://github.com/apache/spark/commit/8d7f4bc1874f8ae3c2cda8e5aa96a8647a56128d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21998: [SPARK-24940][SQL] Use IntegerLiteral in ResolveCoalesce...

2018-08-05 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21998
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-05 Thread jiangxb1987
Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/21898
  
>  why 94247 is successful while 94241 was failed with the same set of test 
suites since they are tested using the same source revision.

They are not - I made the variable `timer` lazy for 94247.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-05 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/21898
  
Looks good to finish without failure.
I am curious why 94247 is successful while 94241 was failed with the same 
set of test suites since they are tested using the same source revision.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21977
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94251/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21977
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

2018-08-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21977
  
**[Test build #94251 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94251/testReport)**
 for PR 21977 at commit 
[`a5a788b`](https://github.com/apache/spark/commit/a5a788b3dbb3c6285f903fda740fe3f69ab275a9).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22004: [WIP][SPARK-25029][TESTS] Scala 2.12 issues: Task...

2018-08-05 Thread skonto
Github user skonto commented on a diff in the pull request:

https://github.com/apache/spark/pull/22004#discussion_r207754816
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -2369,39 +2369,12 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with TimeLi
 assert(scheduler.getShuffleDependencies(rddE) === Set(shuffleDepA, 
shuffleDepC))
   }
 
-  test("SPARK-17644: After one stage is aborted for too many failed 
attempts, subsequent stages" +
+  test("SPARK-17644: After one stage is aborted for too many failed 
attempts, subsequent stages " +
 "still behave correctly on fetch failures") {
-// Runs a job that always encounters a fetch failure, so should 
eventually be aborted
--- End diff --

Yeah, I was also wondering if I had to implement this as well, but I feel 
people need to move to 2.12 with a different mindset as things have changed. 
Not sure if it is possible as well, so asked @LRytz in jira.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22002: [FOLLOW-UP][SPARK-23772][SQL] Provide an option to ignor...

2018-08-05 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/22002
  
LGTM cc: @HyukjinKwon 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22002: [FOLLOW-UP][SPARK-23772][SQL] Provide an option t...

2018-08-05 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22002#discussion_r207754359
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -267,7 +267,7 @@ def json(self, path, schema=None, 
primitivesAsString=None, prefersDecimal=None,
 mode=mode, 
columnNameOfCorruptRecord=columnNameOfCorruptRecord, dateFormat=dateFormat,
 timestampFormat=timestampFormat, multiLine=multiLine,
 allowUnquotedControlChars=allowUnquotedControlChars, 
lineSep=lineSep,
-samplingRatio=samplingRatio, encoding=encoding)
+samplingRatio=samplingRatio, 
dropFieldIfAllNull=dropFieldIfAllNull, encoding=encoding)
--- End diff --

oh... good catch. thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22004: [WIP][SPARK-25029][TESTS] Scala 2.12 issues: Task...

2018-08-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22004#discussion_r207753917
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -2369,39 +2369,12 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with TimeLi
 assert(scheduler.getShuffleDependencies(rddE) === Set(shuffleDepA, 
shuffleDepC))
   }
 
-  test("SPARK-17644: After one stage is aborted for too many failed 
attempts, subsequent stages" +
+  test("SPARK-17644: After one stage is aborted for too many failed 
attempts, subsequent stages " +
 "still behave correctly on fetch failures") {
-// Runs a job that always encounters a fetch failure, so should 
eventually be aborted
--- End diff --

Yeah you've said it more correctly, it's really the body of the lambda 
capturing references, and it just happens to be to an enclosing class of the 
test method in this case.

This is the kind of the I wonder if we *do* have to do, like in the old 
closure cleaner. Because LMF closures capture far less to begin with, it's much 
less of an issue. I also remember that closure cleaning such things got dicey 
because it's not clear when it's just OK to null some object's field. It may 
not be possible to do reasonably.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22004: [WIP][SPARK-25029][TESTS] Scala 2.12 issues: Task...

2018-08-05 Thread skonto
Github user skonto commented on a diff in the pull request:

https://github.com/apache/spark/pull/22004#discussion_r207753682
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -2369,39 +2369,12 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with TimeLi
 assert(scheduler.getShuffleDependencies(rddE) === Set(shuffleDepA, 
shuffleDepC))
   }
 
-  test("SPARK-17644: After one stage is aborted for too many failed 
attempts, subsequent stages" +
+  test("SPARK-17644: After one stage is aborted for too many failed 
attempts, subsequent stages " +
 "still behave correctly on fetch failures") {
-// Runs a job that always encounters a fetch failure, so should 
eventually be aborted
--- End diff --

If you move 
```
object FailThisAttempt {
  val _fail = new AtomicBoolean(true)
}
```
outside tests as a top object, tests pass, no need to move the functions to 
the companion object. 

Btw the closure cleaner does not look into the body of the lambda to check 
if references of other objects create an issue. This is done only for the old 
closures. According to document we only checked for the return statements. Also 
Lambdas dont have outers by definition.

Regarding the LegacyAccumulatorWrapper there is no closure


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-05 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/21889
  
> Alright to make sure we're all on the same page, it sounds like we're 
ready to merge this PR pending:
>
> * Successful build by Jenkins
> * Any PR comments from a maintainer
>
> This feature will be merged in disabled state and can't be enabled until 
the next PR is merged, but we do not expect any regression in behavior in the 
default disabled state.

I agree.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22004: [WIP][SPARK-25029][TESTS] Scala 2.12 issues: Task...

2018-08-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22004#discussion_r207752883
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -2369,39 +2369,12 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with TimeLi
 assert(scheduler.getShuffleDependencies(rddE) === Set(shuffleDepA, 
shuffleDepC))
   }
 
-  test("SPARK-17644: After one stage is aborted for too many failed 
attempts, subsequent stages" +
+  test("SPARK-17644: After one stage is aborted for too many failed 
attempts, subsequent stages " +
 "still behave correctly on fetch failures") {
-// Runs a job that always encounters a fetch failure, so should 
eventually be aborted
--- End diff --

Just that the task (the rdd.map call's argument) isn't serializable for the 
same reason that the LegacyAccumulatorWrapper failed -- captures the test 
class, which has an unserializable AssertionsHelper field in a scalatest 
superclass. The problem here is capturing the enclosing test class to begin 
with as it's not relevant.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21305
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21305
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1827/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22004: [WIP][SPARK-25029][TESTS] Scala 2.12 issues: Task...

2018-08-05 Thread skonto
Github user skonto commented on a diff in the pull request:

https://github.com/apache/spark/pull/22004#discussion_r207752807
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -2369,39 +2369,12 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with TimeLi
 assert(scheduler.getShuffleDependencies(rddE) === Set(shuffleDepA, 
shuffleDepC))
   }
 
-  test("SPARK-17644: After one stage is aborted for too many failed 
attempts, subsequent stages" +
+  test("SPARK-17644: After one stage is aborted for too many failed 
attempts, subsequent stages " +
 "still behave correctly on fetch failures") {
-// Runs a job that always encounters a fetch failure, so should 
eventually be aborted
--- End diff --

Ok I will have a look. Do you have the output of the failure? Scala test 
does not report much. Btw in these tests what I noticed is that only the last 
one failed (failAfter "A job with one fetch failure should eventually 
succeed"), so not sure if it is the closure or anything else (need to debug it).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-08-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21305
  
**[Test build #94257 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94257/testReport)**
 for PR 21305 at commit 
[`618a79d`](https://github.com/apache/spark/commit/618a79dfebf52710e3d86abbde35c65910b91a81).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21305
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21305
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1826/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-08-05 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21305
  
@cloud-fan, I've added a suite for the `DataType.canWrite`. I still need to 
add tests for the analyzer rule to make sure it catches any problems and so to 
validate that AppendData's `resolved` check. If you want to look at the 
`canWrite` suite in the mean time, I think it's ready for review.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-08-05 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21305#discussion_r207752632
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala ---
@@ -336,4 +337,97 @@ object DataType {
   case (fromDataType, toDataType) => fromDataType == toDataType
 }
   }
+
+  /**
+   * Returns true if the write data type can be read using the read data 
type.
+   *
+   * The write type is compatible with the read type if:
+   * - Both types are arrays, the array element types are compatible, and 
element nullability is
+   *   compatible (read allows nulls or write does not contain nulls).
+   * - Both types are maps and the map key and value types are compatible, 
and value nullability
+   *   is compatible  (read allows nulls or write does not contain nulls).
+   * - Both types are structs and each field in the read struct is present 
in the write struct and
+   *   compatible (including nullability), or is nullable if the write 
struct does not contain the
+   *   field. Write-side structs are not compatible if they contain fields 
that are not present in
+   *   the read-side struct.
+   * - Both types are atomic and the write type can be safely cast to the 
read type.
+   *
+   * Extra fields in write-side structs are not allowed to avoid 
accidentally writing data that
+   * the read schema will not read, and to ensure map key equality is not 
changed when data is read.
+   *
+   * @param write a write-side data type to validate against the read type
+   * @param read a read-side data type
+   * @return true if data written with the write type can be read using 
the read type
+   */
+  def canWrite(
+  write: DataType,
+  read: DataType,
+  resolver: Resolver,
+  context: String,
+  addError: String => Unit = (_: String) => {}): Boolean = {
+(write, read) match {
+  case (wArr: ArrayType, rArr: ArrayType) =>
+if (wArr.containsNull && !rArr.containsNull) {
+  addError(s"Cannot write nullable elements to array of non-nulls: 
'$context'")
+  false
+} else {
+  canWrite(wArr.elementType, rArr.elementType, resolver, context + 
".element", addError)
+}
+
+  case (wMap: MapType, rMap: MapType) =>
+// map keys cannot include data fields not in the read schema 
without changing equality when
+// read. map keys can be missing fields as long as they are 
nullable in the read schema.
+if (wMap.valueContainsNull && !rMap.valueContainsNull) {
+  addError(s"Cannot write nullable values to map of non-nulls: 
'$context'")
+  false
+} else {
+  canWrite(wMap.keyType, rMap.keyType, resolver, context + ".key", 
addError) &&
+  canWrite(wMap.valueType, rMap.valueType, resolver, context + 
".value", addError)
+}
+
+  case (StructType(writeFields), StructType(readFields)) =>
+lazy val extraFields = writeFields.map(_.name).toSet -- 
readFields.map(_.name)
+
+var result = readFields.forall { readField =>
+  val fieldContext = context + "." + readField.name
+  writeFields.find(writeField => resolver(writeField.name, 
readField.name)) match {
--- End diff --

I've implemented this as described and added 
`DataTypeWriteCompatibilitySuite` to validate the behavior.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-08-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21305
  
**[Test build #94256 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94256/testReport)**
 for PR 21305 at commit 
[`ae0d242`](https://github.com/apache/spark/commit/ae0d2424315634760d46be0f21e0e98160bada5a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22004: [WIP][SPARK-25029][TESTS] Scala 2.12 issues: Task...

2018-08-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22004#discussion_r207752563
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -2369,39 +2369,12 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with TimeLi
 assert(scheduler.getShuffleDependencies(rddE) === Set(shuffleDepA, 
shuffleDepC))
   }
 
-  test("SPARK-17644: After one stage is aborted for too many failed 
attempts, subsequent stages" +
+  test("SPARK-17644: After one stage is aborted for too many failed 
attempts, subsequent stages " +
 "still behave correctly on fetch failures") {
-// Runs a job that always encounters a fetch failure, so should 
eventually be aborted
--- End diff --

@skonto in answer to your question, here's an example of a method that runs 
a closure that seems to capture the enclosing test class and fails. I moved the 
definition of these methods out of the test method, but didn't help. Moving to 
the companion object did. Not sure what is going on underneath there, or 
whether you might have expected the closure cleaner to handle this case. I am 
not worried about it, but just pointing out a slightly more complex example.

There are more failures in the mllib module, coming soon ...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-05 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21889
  
Alright to make sure we're all on the same page, it sounds like we're ready 
to merge this PR pending:
* Successful build by Jenkins
* Any PR comments from a maintainer

This feature will be merged in disabled state and can't be enabled until 
the next PR is merged, but we do not expect any regression in behavior in the 
default disabled state.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21913: [SPARK-24005][CORE] Remove usage of Scala’s parallel c...

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21913
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94248/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21913: [SPARK-24005][CORE] Remove usage of Scala’s parallel c...

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21913
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21913: [SPARK-24005][CORE] Remove usage of Scala’s parallel c...

2018-08-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21913
  
**[Test build #94248 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94248/testReport)**
 for PR 21913 at commit 
[`610154c`](https://github.com/apache/spark/commit/610154c7a127628d6bb4d30c790d27db474fbf65).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22004: [WIP][SPARK-25029][TESTS] Scala 2.12 issues: Task...

2018-08-05 Thread skonto
Github user skonto commented on a diff in the pull request:

https://github.com/apache/spark/pull/22004#discussion_r207752019
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -2583,4 +2556,32 @@ object DAGSchedulerSuite {
 
   def makeBlockManagerId(host: String): BlockManagerId =
 BlockManagerId("exec-" + host, host, 12345)
+
+  // Runs a job that always encounters a fetch failure, so should 
eventually be aborted
+  def runJobWithPersistentFetchFailure(sc: SparkContext): Unit = {
--- End diff --

:+1: How was this affecting it (just curious)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22004: [WIP][SPARK-25029][TESTS] Scala 2.12 issues: TaskNotSeri...

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22004
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22004: [WIP][SPARK-25029][TESTS] Scala 2.12 issues: TaskNotSeri...

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22004
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1825/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22004: [WIP][SPARK-25029][TESTS] Scala 2.12 issues: TaskNotSeri...

2018-08-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22004
  
**[Test build #94255 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94255/testReport)**
 for PR 22004 at commit 
[`626e7bd`](https://github.com/apache/spark/commit/626e7bd16769ee6dc42d7d04df10981e719c530d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22004: [SPARK-25029][TESTS] Scala 2.12 issues: TaskNotSe...

2018-08-05 Thread srowen
GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/22004

[SPARK-25029][TESTS] Scala 2.12 issues: TaskNotSerializable and Janino "Two 
non-abstract methods ..." errors

## What changes were proposed in this pull request?

Fixes for test issues that arose after Scala 2.12 support was added -- ones 
that only affect the 2.12 build.

## How was this patch tested?

Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-25029

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22004.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22004


commit 626e7bd16769ee6dc42d7d04df10981e719c530d
Author: Sean Owen 
Date:   2018-08-05T23:01:07Z

Initial fixes for 2.12 test issues




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21898
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >