[GitHub] spark issue #19820: [SPARK-22607][BUILD] Set large stack size consistently f...

2017-11-25 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19820
  
Would it be possible to add which tests caused this exception in the 
description?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19820: [SPARK-22607][BUILD] Set large stack size consistently f...

2017-11-25 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19820
  
LGTM except one comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19814: [SPARK-22484][DOC] Document PySpark DataFrame csv...

2017-11-25 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19814#discussion_r153062402
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -828,8 +828,7 @@ def csv(self, path, mode=None, compression=None, 
sep=None, quote=None, escape=No
 set, it uses the default value, ``,``.
 :param quote: sets the single character used for escaping quoted 
values where the
   separator can be part of the value. If None is set, 
it uses the default
-  value, ``"``. If you would like to turn off 
quotations, you need to set an
-  empty string.
+  value, ``"``. If empty string is set, it uses 
``u``.
--- End diff --

If there are doc changes to be done for options here, let's make sure 
chaning all. Quick and easy way will be just `grep` and replace.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19814: [SPARK-22484][DOC] Document PySpark DataFrame csv writer...

2017-11-25 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19814
  
Can we turn off it as documented? We could try to open a JIRA in Univocity 
if this functionality is not there and incorporate the change in Spark.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

2017-11-25 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19815
  
LGTM except one comment


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

2017-11-25 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19815#discussion_r153061379
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
 ---
@@ -324,24 +315,27 @@ public void putDecimal(int rowId, Decimal value, int 
precision) {
   @Override
   public UTF8String getUTF8String(int rowId) {
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19817: [SPARK-22603][SQL] Fix 64KB JVM bytecode limit pr...

2017-11-25 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19817#discussion_r153060800
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -1372,19 +1372,30 @@ case class FormatString(children: Expression*) 
extends Expression with ImplicitC
 val pattern = children.head.genCode(ctx)
 
 val argListGen = children.tail.map(x => (x.dataType, x.genCode(ctx)))
-val argListCode = argListGen.map(_._2.code + "\n")
-
-val argListString = argListGen.foldLeft("")((s, v) => {
-  val nullSafeString =
+val argList = ctx.freshName("argLists")
+val numArgLists = argListGen.length
+val argListCode = argListGen.zipWithIndex.map { case(v, index) =>
+  val value =
 if (ctx.boxedType(v._1) != ctx.javaType(v._1)) {
   // Java primitives get boxed in order to allow null values.
   s"(${v._2.isNull}) ? (${ctx.boxedType(v._1)}) null : " +
 s"new ${ctx.boxedType(v._1)}(${v._2.value})"
 } else {
   s"(${v._2.isNull}) ? null : ${v._2.value}"
 }
-  s + "," + nullSafeString
-})
+  s"""
+ ${v._2.code}
+ $argList[$index] = $value;
+   """
+}
+val argListCodes = if (ctx.INPUT_ROW != null && ctx.currentVars == 
null) {
+  ctx.splitExpressions(
+expressions = argListCode,
+funcName = "valueFormatString",
+arguments = ("InternalRow", ctx.INPUT_ROW) :: ("Object[]", 
argList) :: Nil)
+} else {
+  argListCode.mkString("\n")
+}
--- End diff --

Sure


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18692: [SPARK-21417][SQL] Infer join conditions using pr...

2017-11-25 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18692#discussion_r153060560
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -152,3 +152,99 @@ object EliminateOuterJoin extends Rule[LogicalPlan] 
with PredicateHelper {
   if (j.joinType == newJoinType) f else Filter(condition, 
j.copy(joinType = newJoinType))
   }
 }
+
+/**
+ * A rule that eliminates CROSS joins by inferring join conditions from 
propagated constraints.
+ *
+ * The optimization is applicable only to CROSS joins. For other join 
types, adding inferred join
+ * conditions would potentially shuffle children as child node's 
partitioning won't satisfy the JOIN
+ * node's requirements which otherwise could have.
+ *
+ * For instance, if there is a CROSS join, where the left relation has 'a 
= 1' and the right
+ * relation has 'b = 1', the rule infers 'a = b' as a join predicate.
+ */
+object EliminateCrossJoin extends Rule[LogicalPlan] with PredicateHelper {
+
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (SQLConf.get.constraintPropagationEnabled) {
+  eliminateCrossJoin(plan)
+} else {
+  plan
+}
+  }
+
+  private def eliminateCrossJoin(plan: LogicalPlan): LogicalPlan = plan 
transform {
+case join@Join(leftPlan, rightPlan, Cross, None) =>
--- End diff --

Nit: `join@Join` -> `join @ Join`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18692: [SPARK-21417][SQL] Infer join conditions using pr...

2017-11-25 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18692#discussion_r153060551
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -152,3 +152,99 @@ object EliminateOuterJoin extends Rule[LogicalPlan] 
with PredicateHelper {
   if (j.joinType == newJoinType) f else Filter(condition, 
j.copy(joinType = newJoinType))
   }
 }
+
+/**
+ * A rule that eliminates CROSS joins by inferring join conditions from 
propagated constraints.
+ *
+ * The optimization is applicable only to CROSS joins. For other join 
types, adding inferred join
+ * conditions would potentially shuffle children as child node's 
partitioning won't satisfy the JOIN
+ * node's requirements which otherwise could have.
+ *
+ * For instance, if there is a CROSS join, where the left relation has 'a 
= 1' and the right
+ * relation has 'b = 1', the rule infers 'a = b' as a join predicate.
--- End diff --

> For instance, given a CROSS join with the constraint 'a = 1' from the 
left child and the constraint 'b = 1' from the right child, this rule infers a 
new join predicate 'a = b' and convert it to an Inner join.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18692: [SPARK-21417][SQL] Infer join conditions using pr...

2017-11-25 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18692#discussion_r153060595
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -152,3 +152,99 @@ object EliminateOuterJoin extends Rule[LogicalPlan] 
with PredicateHelper {
   if (j.joinType == newJoinType) f else Filter(condition, 
j.copy(joinType = newJoinType))
   }
 }
+
+/**
+ * A rule that eliminates CROSS joins by inferring join conditions from 
propagated constraints.
+ *
+ * The optimization is applicable only to CROSS joins. For other join 
types, adding inferred join
+ * conditions would potentially shuffle children as child node's 
partitioning won't satisfy the JOIN
+ * node's requirements which otherwise could have.
+ *
+ * For instance, if there is a CROSS join, where the left relation has 'a 
= 1' and the right
+ * relation has 'b = 1', the rule infers 'a = b' as a join predicate.
+ */
+object EliminateCrossJoin extends Rule[LogicalPlan] with PredicateHelper {
+
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (SQLConf.get.constraintPropagationEnabled) {
+  eliminateCrossJoin(plan)
+} else {
+  plan
+}
+  }
+
+  private def eliminateCrossJoin(plan: LogicalPlan): LogicalPlan = plan 
transform {
+case join@Join(leftPlan, rightPlan, Cross, None) =>
+  val leftConstraints = 
join.constraints.filter(_.references.subsetOf(leftPlan.outputSet))
+  val rightConstraints = 
join.constraints.filter(_.references.subsetOf(rightPlan.outputSet))
+  val inferredJoinPredicates = inferJoinPredicates(leftConstraints, 
rightConstraints)
+  val joinConditionOpt = inferredJoinPredicates.reduceOption(And)
+  if (joinConditionOpt.isDefined) Join(leftPlan, rightPlan, Inner, 
joinConditionOpt) else join
+  }
+
+  private def inferJoinPredicates(
+  leftConstraints: Set[Expression],
+  rightConstraints: Set[Expression]): Set[EqualTo] = {
+
+// iterate through the left constraints and build a hash map that 
points semantically
+// equivalent expressions into attributes
+val emptyEquivalenceMap = Map.empty[SemanticExpression, Set[Attribute]]
+val equivalenceMap = leftConstraints.foldLeft(emptyEquivalenceMap) { 
case (map, constraint) =>
+  constraint match {
+case EqualTo(attr: Attribute, expr: Expression) =>
+  updateEquivalenceMap(map, attr, expr)
+case EqualTo(expr: Expression, attr: Attribute) =>
+  updateEquivalenceMap(map, attr, expr)
+case _ => map
+  }
+}
+
+// iterate through the right constraints and infer join conditions 
using the equivalence map
+rightConstraints.foldLeft(Set.empty[EqualTo]) { case (joinConditions, 
constraint) =>
+  constraint match {
+case EqualTo(attr: Attribute, expr: Expression) =>
+  appendJoinConditions(attr, expr, equivalenceMap, joinConditions)
+case EqualTo(expr: Expression, attr: Attribute) =>
+  appendJoinConditions(attr, expr, equivalenceMap, joinConditions)
+case _ => joinConditions
+  }
+}
+  }
+
+  private def updateEquivalenceMap(
+  equivalenceMap: Map[SemanticExpression, Set[Attribute]],
+  attr: Attribute,
+  expr: Expression): Map[SemanticExpression, Set[Attribute]] = {
+
+val equivalentAttrs = equivalenceMap.getOrElse(expr, 
Set.empty[Attribute])
+if (equivalentAttrs.contains(attr)) {
+  equivalenceMap
+} else {
+  equivalenceMap.updated(expr, equivalentAttrs + attr)
+}
+  }
+
+  private def appendJoinConditions(
+  attr: Attribute,
+  expr: Expression,
+  equivalenceMap: Map[SemanticExpression, Set[Attribute]],
+  joinConditions: Set[EqualTo]): Set[EqualTo] = {
+
+equivalenceMap.get(expr) match {
+  case Some(equivalentAttrs) => joinConditions ++ 
equivalentAttrs.map(EqualTo(attr, _))
+  case None => joinConditions
+}
+  }
+
+  // the purpose of this class is to treat 'a === 1 and 1 === 'a as the 
same expressions
+  implicit class SemanticExpression(private val expr: Expression) {
--- End diff --

Can we reuse `EquivalentExpressions`? You can search the code base and see 
how the others use it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19817: [SPARK-22603][SQL] Fix 64KB JVM bytecode limit pr...

2017-11-25 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19817#discussion_r153060489
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -1372,19 +1372,30 @@ case class FormatString(children: Expression*) 
extends Expression with ImplicitC
 val pattern = children.head.genCode(ctx)
 
 val argListGen = children.tail.map(x => (x.dataType, x.genCode(ctx)))
-val argListCode = argListGen.map(_._2.code + "\n")
-
-val argListString = argListGen.foldLeft("")((s, v) => {
-  val nullSafeString =
+val argList = ctx.freshName("argLists")
+val numArgLists = argListGen.length
+val argListCode = argListGen.zipWithIndex.map { case(v, index) =>
+  val value =
 if (ctx.boxedType(v._1) != ctx.javaType(v._1)) {
   // Java primitives get boxed in order to allow null values.
   s"(${v._2.isNull}) ? (${ctx.boxedType(v._1)}) null : " +
 s"new ${ctx.boxedType(v._1)}(${v._2.value})"
 } else {
   s"(${v._2.isNull}) ? null : ${v._2.value}"
 }
-  s + "," + nullSafeString
-})
+  s"""
+ ${v._2.code}
+ $argList[$index] = $value;
+   """
+}
+val argListCodes = if (ctx.INPUT_ROW != null && ctx.currentVars == 
null) {
+  ctx.splitExpressions(
+expressions = argListCode,
+funcName = "valueFormatString",
+arguments = ("InternalRow", ctx.INPUT_ROW) :: ("Object[]", 
argList) :: Nil)
+} else {
+  argListCode.mkString("\n")
+}
--- End diff --

Could you create a `splitExpressions` in `CodegenContext` for avoiding the 
duplicate codes, like 

```Scala
if (ctx.INPUT_ROW != null && ctx.currentVars == null) {
  ctx.splitExpressions(...)
} else {
  inputs.mkString("\n")
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

2017-11-25 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19815#discussion_r153060413
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
 ---
@@ -324,24 +315,27 @@ public void putDecimal(int rowId, Decimal value, int 
precision) {
   @Override
   public UTF8String getUTF8String(int rowId) {
 if (dictionary == null) {
-  ColumnarArray a = getByteArray(rowId);
-  return UTF8String.fromBytes(a.byteArray, a.byteArrayOffset, 
a.length);
+  return childColumns[0].getUTF8String0(getArrayOffset(rowId), 
getArrayLength(rowId));
 } else {
   byte[] bytes = 
dictionary.decodeToBinary(dictionaryIds.getDictId(rowId));
   return UTF8String.fromBytes(bytes);
--- End diff --

hmm, but looks `decodeToBinary` will copy byte data?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

2017-11-25 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19815#discussion_r153060336
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
 ---
@@ -324,24 +315,27 @@ public void putDecimal(int rowId, Decimal value, int 
precision) {
   @Override
   public UTF8String getUTF8String(int rowId) {
--- End diff --

Shall we add comment that `getUTF8String` reuse the data in column vector? 
It seems different than other getXXX APIs


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19816: [SPARK-21693][R][FOLLOWUP] Reduce shuffle partitions run...

2017-11-25 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19816
  
Let me maybe try to deal with this separately .. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19816: [SPARK-21693][R][FOLLOWUP] Reduce shuffle partitions run...

2017-11-25 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19816
  
I tested 16, 32 and 64 too:

16:

```
Failed 
-
1. Failure: group by, agg functions (@test_sparkSQL.R#1839) 

30 not equal to collect(max(gd))[2, 2].
1/1 mismatches
[1] 30 - NA == NA


2. Failure: pivot GroupedData column (@test_sparkSQL.R#1921) 
---
`sum1` not equal to `correct_answer`.
Component “year”: Mean relative difference: 0.0004964011
Component “Python”: Mean relative difference: 0.2857143
Component “R”: Mean relative difference: 0.08695652


3. Failure: pivot GroupedData column (@test_sparkSQL.R#1922) 
---
`sum2` not equal to `correct_answer`.
Component “year”: Mean relative difference: 0.0004964011
Component “Python”: Mean relative difference: 0.2857143
Component “R”: Mean relative difference: 0.08695652


4. Failure: pivot GroupedData column (@test_sparkSQL.R#1923) 
---
`sum3` not equal to `correct_answer`.
Component “year”: Mean relative difference: 0.0004964011
Component “Python”: Mean relative difference: 0.2857143
Component “R”: Mean relative difference: 0.08695652


5. Failure: pivot GroupedData column (@test_sparkSQL.R#1924) 
---
`sum4` not equal to correct_answer[, c("year", "R")].
Component “year”: Mean relative difference: 0.0004964011
Component “R”: Mean relative difference: 0.08695652


DONE 
===
```


32:

```
Failed 
-
1. Failure: spark.als (@test_mllib_recommendation.R#36) 

predictions$prediction not equal to c(-0.1380762, 2.6258414, -1.5018409).
3/3 mismatches (average diff: 2.75)
[1]  2.626 - -0.138 ==  2.76
[2] -1.502 -  2.626 == -4.13
[3] -0.138 - -1.502 ==  1.36


2. Failure: group by, agg functions (@test_sparkSQL.R#1839) 

30 not equal to collect(max(gd))[2, 2].
1/1 mismatches
[1] 30 - 19 == 11


3. Failure: pivot GroupedData column (@test_sparkSQL.R#1921) 
---
`sum1` not equal to `correct_answer`.
Component “year”: Mean relative difference: 0.0009925558
Component “Python”: Mean relative difference: 0.3783784
Component “R”: Mean relative difference: 0.625


4. Failure: pivot GroupedData column (@test_sparkSQL.R#1922) 
---
`sum2` not equal to `correct_answer`.
Component “year”: Mean relative difference: 0.0009925558
Component “Python”: Mean relative difference: 0.3783784
Component “R”: Mean relative difference: 0.625


5. Failure: pivot GroupedData column (@test_sparkSQL.R#1923) 
---
`sum3` not equal to `correct_answer`.
Component “year”: Mean relative difference: 0.0009925558
Component “Python”: Mean relative difference: 0.3783784
Component “R”: Mean relative difference: 0.625


6. Failure: pivot GroupedData column (@test_sparkSQL.R#1924) 
---
`sum4` not equal to correct_answer[, c("year", "R")].
Component “year”: Mean relative difference: 0.0009925558
Component “R”: Mean relative difference: 0.625


DONE 
===
```

64:

```
Failed 
-
1. Failure: spark.als (@test_mllib_recommendation.R#36) 

predictions$prediction not equal to c(-0.1380762, 2.6258414, -1.5018409).
2/3 mismatches (average diff: 1.36)
[1] -1.502 - -0.138 == -1.36
[3] -0.138 - -1.502 ==  1.36


2. Failure: group by, agg functions (@test_sparkSQL.R#1839) 

30 not equal to collect(max(gd))[2, 2].
1/1 mismatches
[1] 30 - 19 == 11


3. Failure: pivot GroupedData column (@test_sparkSQL.R#1921) 
---
`sum1` not equal to `correct_answer`.
Component “year”: Mean relative difference: 0.0009925558
Component “Python”: Mean relative difference: 0.3783784
Component “R”: Mean relative difference: 0.625


4. Failure: pivot GroupedData column (@test_sparkSQL.R#1922) 
---
`sum2` not equal to `correct_answer`.
Component “year”: Mean relative difference: 0.0009925558
Component “Python”: Mean relative difference: 0.3783784
Component “R”: Mean relative difference: 0.625


5. Failure: pivot GroupedData column 

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

2017-11-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19783
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84190/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

2017-11-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19783
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

2017-11-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19783
  
**[Test build #84190 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84190/testReport)**
 for PR 19783 at commit 
[`8e5d04e`](https://github.com/apache/spark/commit/8e5d04ef4687917b2487e5b9267bc40ba3ef33a1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19788: [SPARK-9853][Core] Optimize shuffle fetch of contiguous ...

2017-11-25 Thread yucai
Github user yucai commented on the issue:

https://github.com/apache/spark/pull/19788
  
Currently users need update their external shuffle service for this 
feature, because we change the format of `ShuffleBlockId`, which is supposed to 
be parsed by external shuffle service.
I am trying to introduce a new configure like 
`spark.shuffle.continuousFetch`. By default, it is `false`, Spark will still 
use `ShuffleBlockId` as always, and when it is set `true` intentionally, Spark 
will use `ContinuousShuffleBlockIds`. In this way, users no need update their 
external shuffle service if they only want to work with `ShuffleBlockId`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19788: [SPARK-9853][Core] Optimize shuffle fetch of cont...

2017-11-25 Thread yucai
Github user yucai commented on a diff in the pull request:

https://github.com/apache/spark/pull/19788#discussion_r153057489
  
--- Diff: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java
 ---
@@ -165,13 +165,23 @@ public ManagedBuffer getBlockData(
   String execId,
   int shuffleId,
   int mapId,
-  int reduceId) {
+  int reduceId,
+  int length) {
 ExecutorShuffleInfo executor = executors.get(new AppExecId(appId, 
execId));
 if (executor == null) {
   throw new RuntimeException(
 String.format("Executor is not registered (appId=%s, execId=%s)", 
appId, execId));
 }
-return getSortBasedShuffleBlockData(executor, shuffleId, mapId, 
reduceId);
+return getSortBasedShuffleBlockData(executor, shuffleId, mapId, 
reduceId, length);
+  }
+
+  public ManagedBuffer getBlockData(
--- End diff --

Thanks, will update.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19816: [SPARK-21693][R][FOLLOWUP] Reduce shuffle partitions run...

2017-11-25 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19816
  
Ah, makes sense. Let me at least try other numbers and be back soon anyway.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

2017-11-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19783
  
**[Test build #84190 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84190/testReport)**
 for PR 19783 at commit 
[`8e5d04e`](https://github.com/apache/spark/commit/8e5d04ef4687917b2487e5b9267bc40ba3ef33a1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19754: [BUILD] update release scripts

2017-11-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19754
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19754: [BUILD] update release scripts

2017-11-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19754
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84189/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19754: [BUILD] update release scripts

2017-11-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19754
  
**[Test build #84189 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84189/testReport)**
 for PR 19754 at commit 
[`71ec15b`](https://github.com/apache/spark/commit/71ec15b698af18093143ea5f3fa88bf86dc26b4e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19754: [BUILD] update release scripts

2017-11-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19754
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84188/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19754: [BUILD] update release scripts

2017-11-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19754
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19754: [BUILD] update release scripts

2017-11-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19754
  
**[Test build #84188 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84188/testReport)**
 for PR 19754 at commit 
[`41f4493`](https://github.com/apache/spark/commit/41f449354a91e8e10e8c39f63a2542eb6193).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19820: [SPARK-22607][BUILD] Set large stack size consistently f...

2017-11-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19820
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19820: [SPARK-22607][BUILD] Set large stack size consistently f...

2017-11-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19820
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84187/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19820: [SPARK-22607][BUILD] Set large stack size consistently f...

2017-11-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19820
  
**[Test build #84187 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84187/testReport)**
 for PR 19820 at commit 
[`a0b6658`](https://github.com/apache/spark/commit/a0b6658a65d98158a8bf610a57edb240b99c8839).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19811: [WIP][SPARK-18016][SQL] Code Generation: Constant Pool L...

2017-11-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19811
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19811: [WIP][SPARK-18016][SQL] Code Generation: Constant Pool L...

2017-11-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19811
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84186/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19811: [WIP][SPARK-18016][SQL] Code Generation: Constant Pool L...

2017-11-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19811
  
**[Test build #84186 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84186/testReport)**
 for PR 19811 at commit 
[`d01fcb1`](https://github.com/apache/spark/commit/d01fcb10f877fdc5635acaedc2676e5e3d18c772).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19754: [BUILD] update release scripts

2017-11-25 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19754#discussion_r153052719
  
--- Diff: dev/create-release/release-build.sh ---
@@ -392,6 +375,7 @@ if [[ "$1" == "publish-release" ]]; then
   find . -type f |grep -v \.jar |grep -v \.pom | xargs rm
 
   echo "Creating hash and signature files"
+  # this must have .asc, .md5 and .sha1 - it really doesn't like anything 
else there
--- End diff --

it rejects it when there are "extra files" of the sha512 extension.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19816: [SPARK-21693][R][FOLLOWUP] Reduce shuffle partitions run...

2017-11-25 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/19816
  
I think likely this is a general problem with default shuffle size being 
too big on limited resources when in tests (particularly on Windows/AppVeyor). 
The additional test failures are expected, I guess, since less partition can 
alter the result.

Let's do a more isolated change for now? But I agree new tests can cause 
problems in the future. Perhaps we could identify a size that is smaller than 
the default that could get "close enough" to the result we have and set that 
for tests by default (16? 32? 64?).



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19754: [BUILD] update release scripts

2017-11-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19754
  
**[Test build #84189 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84189/testReport)**
 for PR 19754 at commit 
[`71ec15b`](https://github.com/apache/spark/commit/71ec15b698af18093143ea5f3fa88bf86dc26b4e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19754: [BUILD] update release scripts

2017-11-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19754
  
**[Test build #84188 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84188/testReport)**
 for PR 19754 at commit 
[`41f4493`](https://github.com/apache/spark/commit/41f449354a91e8e10e8c39f63a2542eb6193).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19820: [SPARK-22607][BUILD] Set large stack size consistently f...

2017-11-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19820
  
**[Test build #84187 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84187/testReport)**
 for PR 19820 at commit 
[`a0b6658`](https://github.com/apache/spark/commit/a0b6658a65d98158a8bf610a57edb240b99c8839).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19820: [SPARK-22607][BUILD] Set large stack size consist...

2017-11-25 Thread srowen
GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/19820

[SPARK-22607][BUILD] Set large stack size consistently for tests to avoid 
StackOverflowError

## What changes were proposed in this pull request?

Set `-ea` and `-Xss4m` consistently for tests

## How was this patch tested?

Existing tests. Manually verified it resolves the StackOverflowError this 
intends to resolve.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-22607

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19820.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19820


commit a0b6658a65d98158a8bf610a57edb240b99c8839
Author: Sean Owen 
Date:   2017-11-25T18:25:29Z

Set -ea and -Xss4m consistently for tests




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

2017-11-25 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19815
  
I will look this Sunday.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19811: [WIP][SPARK-18016][SQL] Code Generation: Constant Pool L...

2017-11-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19811
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19811: [WIP][SPARK-18016][SQL] Code Generation: Constant Pool L...

2017-11-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19811
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84185/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19811: [WIP][SPARK-18016][SQL] Code Generation: Constant Pool L...

2017-11-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19811
  
**[Test build #84185 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84185/testReport)**
 for PR 19811 at commit 
[`d8a9f9e`](https://github.com/apache/spark/commit/d8a9f9e91d54353b19eec0fec585b3ae5143eb34).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19811: [WIP][SPARK-18016][SQL] Code Generation: Constant Pool L...

2017-11-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19811
  
**[Test build #84186 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84186/testReport)**
 for PR 19811 at commit 
[`d01fcb1`](https://github.com/apache/spark/commit/d01fcb10f877fdc5635acaedc2676e5e3d18c772).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19788: [SPARK-9853][Core] Optimize shuffle fetch of cont...

2017-11-25 Thread yucai
Github user yucai commented on a diff in the pull request:

https://github.com/apache/spark/pull/19788#discussion_r153049711
  
--- Diff: 
common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolverSuite.java
 ---
@@ -110,6 +110,13 @@ public void testSortShuffleBlocks() throws IOException 
{
 new InputStreamReader(block1Stream, StandardCharsets.UTF_8));
 block1Stream.close();
 assertEquals(sortBlock1, block1);
+
+InputStream block01Stream =
+resolver.getBlockData("app0", "exec0", 0, 0, 0, 
2).createInputStream();
+String block01 = CharStreams.toString(
+new InputStreamReader(block01Stream, StandardCharsets.UTF_8));
--- End diff --

Thanks, updated!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19788: [SPARK-9853][Core] Optimize shuffle fetch of cont...

2017-11-25 Thread yucai
Github user yucai commented on a diff in the pull request:

https://github.com/apache/spark/pull/19788#discussion_r153049707
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockId.scala ---
@@ -52,8 +52,9 @@ case class RDDBlockId(rddId: Int, splitIndex: Int) 
extends BlockId {
 // Format of the shuffle block ids (including data and index) should be 
kept in sync with
 // 
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver#getBlockData().
 @DeveloperApi
-case class ShuffleBlockId(shuffleId: Int, mapId: Int, reduceId: Int) 
extends BlockId {
-  override def name: String = "shuffle_" + shuffleId + "_" + mapId + "_" + 
reduceId
+case class ShuffleBlockId(shuffleId: Int, mapId: Int, reduceId: Int, 
length: Int = 1)
+  extends BlockId {
+  override def name: String = "shuffle_" + shuffleId + "_" + mapId + "_" + 
reduceId + "_" + length
--- End diff --

`ContinuousShuffleBlockIds` looks like a good idea, let me try.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19788: [SPARK-9853][Core] Optimize shuffle fetch of cont...

2017-11-25 Thread yucai
Github user yucai commented on a diff in the pull request:

https://github.com/apache/spark/pull/19788#discussion_r153049650
  
--- Diff: 
core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala ---
@@ -196,12 +196,14 @@ private[spark] class IndexShuffleBlockResolver(
   override def getBlockData(blockId: ShuffleBlockId): ManagedBuffer = {
 // The block is actually going to be a range of a single map output 
file for this map, so
 // find out the consolidated file, then the offset within that from 
our index
+logDebug(s"Fetch block data for $blockId")
--- End diff --

Without this info, it looks hard to know continuous shuffle block read 
really happen, and I found `getLocalBytes` had similar debug info also.
```
logDebug(s"Getting local block $blockId as bytes")
```
How about keeping it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19811: [WIP][SPARK-18016][SQL] Code Generation: Constant Pool L...

2017-11-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19811
  
**[Test build #84185 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84185/testReport)**
 for PR 19811 at commit 
[`d8a9f9e`](https://github.com/apache/spark/commit/d8a9f9e91d54353b19eec0fec585b3ae5143eb34).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19798: [SPARK-22583] First delegation token renewal time...

2017-11-25 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19798


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19798: [SPARK-22583] First delegation token renewal time is not...

2017-11-25 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19798
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

2017-11-25 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19518
  
> I think ldc is 2 bytes and ldc_w is 3 bytes?
You are right, thanks, updated.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19816: [SPARK-21693][R][FOLLOWUP] Reduce shuffle partitions run...

2017-11-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19816
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19816: [SPARK-21693][R][FOLLOWUP] Reduce shuffle partitions run...

2017-11-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19816
  
**[Test build #84184 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84184/testReport)**
 for PR 19816 at commit 
[`41ffb54`](https://github.com/apache/spark/commit/41ffb54d391009e9f34b92af1f7686ff544e8b46).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19816: [SPARK-21693][R][FOLLOWUP] Reduce shuffle partitions run...

2017-11-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19816
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84184/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19819: [SPARK-22606][Streaming]Add threadId to the CachedKafkaC...

2017-11-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19819
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84183/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19819: [SPARK-22606][Streaming]Add threadId to the CachedKafkaC...

2017-11-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19819
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19819: [SPARK-22606][Streaming]Add threadId to the CachedKafkaC...

2017-11-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19819
  
**[Test build #84183 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84183/testReport)**
 for PR 19819 at commit 
[`aa02d89`](https://github.com/apache/spark/commit/aa02d8904fcbaa91df47ac224d90345bd555a372).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...

2017-11-25 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/18906
  
Thanks for the background Bryan :) So it sounds like from an API 
perspective it makes sense to support this in the future possibly on the Pandas 
UDFs (but the code isn't there on the JVM side). I'd say if @ptkool has the 
time it might make sense to match the scala API on the current UDFs its easier 
when we want to add this to the Panda's UDFs


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19816: [SPARK-21693][R][FOLLOWUP] Reduce shuffle partitions run...

2017-11-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19816
  
**[Test build #84184 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84184/testReport)**
 for PR 19816 at commit 
[`41ffb54`](https://github.com/apache/spark/commit/41ffb54d391009e9f34b92af1f7686ff544e8b46).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19816: [SPARK-21693][FOLLOWUP][R] Reduce shuffle partitions run...

2017-11-25 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19816
  
Not sure. Let me know if you have a preference @felixcheung.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19816: [SPARK-21693][FOLLOWUP][R] Reduce shuffle partitions run...

2017-11-25 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19816
  
@felixcheung, I just tried to lower this by default and ran. Seems some 
tests are being failed. For example, if we lower`spark.sql.shuffle.partitions` 
to 5, these fail additionally:

```
Failed 
-
1. Failure: spark.als (@test_mllib_recommendation.R#36) 

predictions$prediction not equal to c(-0.1380762, 2.6258414, -1.5018409).
3/3 mismatches (average diff: 2.75)
[1]  2.626 - -0.138 ==  2.76
[2] -1.502 -  2.626 == -4.13
[3] -0.138 - -1.502 ==  1.36


2. Failure: pivot GroupedData column (@test_sparkSQL.R#1921) 
---
`sum1` not equal to `correct_answer`.
Component “year”: Mean relative difference: 0.0004961548
Component “Python”: Mean relative difference: 0.0952381
Component “R”: Mean relative difference: 0.5454545


3. Failure: pivot GroupedData column (@test_sparkSQL.R#1922) 
---
`sum2` not equal to `correct_answer`.
Component “year”: Mean relative difference: 0.0004961548
Component “Python”: Mean relative difference: 0.0952381
Component “R”: Mean relative difference: 0.5454545


4. Failure: pivot GroupedData column (@test_sparkSQL.R#1923) 
---
`sum3` not equal to `correct_answer`.
Component “year”: Mean relative difference: 0.0004961548
Component “Python”: Mean relative difference: 0.0952381
Component “R”: Mean relative difference: 0.5454545


5. Failure: pivot GroupedData column (@test_sparkSQL.R#1924) 
---
`sum4` not equal to correct_answer[, c("year", "R")].
Component “year”: Mean relative difference: 0.0004961548
Component “R”: Mean relative difference: 0.5454545
```
 
Shuffle + R worker cases look not quite frequent (to be clear, just shuffle 
without R will be fine IIUC). 

I don't have a strong opinion on lowering because ..  if we don't lower, 
some tests in the future could cause such problem again vs if we should lower, 
the required change looks quite larger and this case might be not quite 
frequent. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19819: [SPARK-22606][Streaming]Add threadId to the CachedKafkaC...

2017-11-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19819
  
**[Test build #84183 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84183/testReport)**
 for PR 19819 at commit 
[`aa02d89`](https://github.com/apache/spark/commit/aa02d8904fcbaa91df47ac224d90345bd555a372).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19819: [SPARK-22606][Streaming]Add threadId to the Cache...

2017-11-25 Thread eatoncys
GitHub user eatoncys opened a pull request:

https://github.com/apache/spark/pull/19819

[SPARK-22606][Streaming]Add threadId to the CachedKafkaConsumer key

## What changes were proposed in this pull request?
If the value of param 'spark.streaming.concurrentJobs' is more than one, 
and the value of param 'spark.executor.cores' is more than one, there may be 
two or more tasks in one executor will use the same kafka consumer at the same 
time, then it will throw an exception: "KafkaConsumer is not safe for 
multi-threaded access";
for example:
spark.streaming.concurrentJobs=2
spark.executor.cores=2
spark.cores.max=2
if there is only one topic with one partition('topic1',0) to consume, there 
will be two jobs to run at the same time, and they will use the same 
cacheKey('groupid','topic1',0) to get the CachedKafkaConsumer from the cache 
list of' private var cache: ju.LinkedHashMap[CacheKey, CachedKafkaConsumer[_, 
_]]' , then it will get the same CachedKafkaConsumer.

this PR add threadId  to the CachedKafkaConsumer key to prevent two thread 
using a consumer at the same time.



## How was this patch tested?
existing ut test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/eatoncys/spark kafka

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19819.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19819


commit aa02d8904fcbaa91df47ac224d90345bd555a372
Author: 10129659 
Date:   2017-11-25T08:15:17Z

Add threadId to CachedKafkaConsumer key




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-11-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19813
  
**[Test build #84182 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84182/testReport)**
 for PR 19813 at commit 
[`65d07d5`](https://github.com/apache/spark/commit/65d07d525344e1d00457d2f538b2ef0b1c38a8e8).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-11-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19813
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84182/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-11-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19813
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-11-25 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19813
  
> However for whole stage codegen path, it's way more complex:
> 1. some of ctx.currentVars are just variables, their codes have already 
been generated before. But some are not. For those whose codes are not 
generated, they are not valid inputs.
> 2. ctx.currentVars is not null but has null slots, and ctx.INPUT_ROW is 
not null. Then both ctx.currentVars and ctx.INPUT_ROW are valid inputs.


Yes, this is correct.

So, for 1, only the variables not evaluate yet, we don't include them as 
parameters.
For 2, null slots in ctx.currentVars won't be included as parameters too. 
ctx.INPUT_ROW will be included only if it is not null.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-11-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19813
  
**[Test build #84182 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84182/testReport)**
 for PR 19813 at commit 
[`65d07d5`](https://github.com/apache/spark/commit/65d07d525344e1d00457d2f538b2ef0b1c38a8e8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...

2017-11-25 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19813#discussion_r153040576
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
 ---
@@ -236,4 +237,22 @@ class WholeStageCodegenSuite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  test("SPARK-22551: Fix 64kb limit for deeply nested expressions under 
wholestage codegen") {
+import testImplicits._
+withTempPath { dir =>
+  val path = dir.getCanonicalPath
+  val df = Seq(("abc", 1)).toDF("key", "int")
+  df.write.parquet(path)
+
+  var strExpr: Expression = col("key").expr
+  for (_ <- 1 to 150) {
+strExpr = Decode(Encode(strExpr, Literal("utf-8")), 
Literal("utf-8"))
+  }
+  val expressions = Seq(If(EqualTo(strExpr, strExpr), strExpr, 
strExpr))
+
+  val df2 = 
spark.read.parquet(path).select(expressions.map(Column(_)): _*)
--- End diff --

Ok. Add assert to make sure this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19817: [SPARK-22603][SQL] Fix 64KB JVM bytecode limit problem w...

2017-11-25 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19817
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org