date:20171117

[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19769
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19769
  
LGTM except a few minor comment


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN command

2017-11-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19774
  
**[Test build #83966 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83966/testReport)**
 for PR 19774 at commit 
[`24bfcb1`](https://github.com/apache/spark/commit/24bfcb1132d35ffa8ba2341a7ea9057b14b5ab8a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19763: [SPARK-22537][core] Aggregation of map output statistics...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19763
  
cc @zsxwing 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19767: [WIP][SPARK-22543][SQL] fix java 64kb compile err...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19767#discussion_r151637511
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -105,6 +105,41 @@ abstract class Expression extends TreeNode[Expression] 
{
   val isNull = ctx.freshName("isNull")
   val value = ctx.freshName("value")
   val ve = doGenCode(ctx, ExprCode("", isNull, value))
+
+  // TODO: support whole stage codegen too
+  if (ve.code.trim.length > 1024 && ctx.INPUT_ROW != null && 
ctx.currentVars == null) {
+val setIsNull = if (ve.isNull != "false" && ve.isNull != "true") {
+  val globalIsNull = ctx.freshName("globalIsNull")
+  ctx.addMutableState("boolean", globalIsNull, s"$globalIsNull = 
false;")
+  val localIsNull = ve.isNull
+  ve.isNull = globalIsNull
+  s"$globalIsNull = $localIsNull;"
+} else {
+  ""
+}
+
+val setValue = {
+  val globalValue = ctx.freshName("globalValue")
+  ctx.addMutableState(
+ctx.javaType(dataType), globalValue, s"$globalValue = 
${ctx.defaultValue(dataType)};")
+  val localValue = ve.value
+  ve.value = globalValue
+  s"$globalValue = $localValue;"
+}
+
+val funcName = ctx.freshName(nodeName)
+val funcFullName = ctx.addNewFunction(funcName,
+  s"""
+ |private void $funcName(InternalRow ${ctx.INPUT_ROW}) {
+ |  ${ve.code.trim}
+ |  $setValue
+ |  $setIsNull
+ |}
+   """.stripMargin)
+
+ve.code = s"$funcFullName(${ctx.INPUT_ROW});"
+  }
+
   if (ve.code.nonEmpty) {
 // Add `this` in the comment.
 ve.copy(code = s"${ctx.registerComment(this.toString)}\n" + 
ve.code.trim)
--- End diff --

I don't have a strong preference, it's ok to have comment at function 
caller side.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19773: Supporting for changing column dataType

2017-11-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19773
  
**[Test build #83964 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83964/testReport)**
 for PR 19773 at commit 
[`1bcd74f`](https://github.com/apache/spark/commit/1bcd74fae9cb6595e04eab6ecaf621739644102f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19390: [SPARK-18935][MESOS] Fix dynamic reservations on ...

2017-11-17 Thread skonto

Github user skonto commented on a diff in the pull request:

https://github.com/apache/spark/pull/19390#discussion_r151672855
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala
 ---
@@ -427,10 +444,10 @@ trait MesosSchedulerUtils extends Logging {
   // partition port offers
   val (resourcesWithoutPorts, portResources) = 
filterPortResources(offeredResources)
 
-  val portsAndRoles = requestedPorts.
-map(x => (x, findPortAndGetAssignedRangeRole(x, portResources)))
+  val portsAndResourceInfo = requestedPorts.
+map(x => (x, findPortAndGetAssignedResourceInfo(x, portResources)))
--- End diff --

Ok will fix no np.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19767: [WIP][SPARK-22543][SQL] fix java 64kb compile err...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19767#discussion_r151624776
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -105,6 +105,41 @@ abstract class Expression extends TreeNode[Expression] 
{
   val isNull = ctx.freshName("isNull")
   val value = ctx.freshName("value")
   val ve = doGenCode(ctx, ExprCode("", isNull, value))
+
+  // TODO: support whole stage codegen too
+  if (ve.code.trim.length > 1024 && ctx.INPUT_ROW != null && 
ctx.currentVars == null) {
+val setIsNull = if (ve.isNull != "false" && ve.isNull != "true") {
+  val globalIsNull = ctx.freshName("globalIsNull")
+  ctx.addMutableState("boolean", globalIsNull, s"$globalIsNull = 
false;")
+  val localIsNull = ve.isNull
+  ve.isNull = globalIsNull
+  s"$globalIsNull = $localIsNull;"
+} else {
+  ""
+}
+
+val setValue = {
+  val globalValue = ctx.freshName("globalValue")
+  ctx.addMutableState(
+ctx.javaType(dataType), globalValue, s"$globalValue = 
${ctx.defaultValue(dataType)};")
+  val localValue = ve.value
+  ve.value = globalValue
+  s"$globalValue = $localValue;"
+}
+
+val funcName = ctx.freshName(nodeName)
+val funcFullName = ctx.addNewFunction(funcName,
+  s"""
+ |private void $funcName(InternalRow ${ctx.INPUT_ROW}) {
+ |  ${ve.code.trim}
+ |  $setValue
+ |  $setIsNull
--- End diff --

yea it's already done when define `setIsNull`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19769#discussion_r151633919
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala
 ---
@@ -87,4 +95,107 @@ class ParquetInteroperabilitySuite extends 
ParquetCompatibilityTest with SharedS
   Row(Seq(2, 3
 }
   }
+
+  test("parquet timestamp conversion") {
+// Make a table with one parquet file written by impala, and one 
parquet file written by spark.
+// We should only adjust the timestamps in the impala file, and only 
if the conf is set
+val impalaFile = "test-data/impala_timestamp.parq"
+
+// here are the timestamps in the impala file, as they were saved by 
impala
+val impalaFileData =
+  Seq(
+"2001-01-01 01:01:01",
+"2002-02-02 02:02:02",
+"2003-03-03 03:03:03"
+  ).map { s => java.sql.Timestamp.valueOf(s) }
--- End diff --

nit: `.map(java.sql.Timestamp.valueOf)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...

2017-11-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19630
  
**[Test build #83959 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83959/testReport)**
 for PR 19630 at commit 
[`cf1d1ca`](https://github.com/apache/spark/commit/cf1d1caa4f41c6bcf565cfc5b9e9901d94f56af3).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...

2017-11-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19630
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83959/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...

2017-11-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19630
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19390: [SPARK-18935][MESOS] Fix dynamic reservations on ...

2017-11-17 Thread skonto

Github user skonto commented on a diff in the pull request:

https://github.com/apache/spark/pull/19390#discussion_r151674490
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala
 ---
@@ -451,15 +468,22 @@ trait MesosSchedulerUtils extends Logging {
   }
 
   /** Creates a mesos resource for a specific port number. */
-  private def createResourcesFromPorts(portsAndRoles: List[(Long, 
String)]) : List[Resource] = {
-portsAndRoles.flatMap{ case (port, role) =>
-  createMesosPortResource(List((port, port)), Some(role))}
+  private def createResourcesFromPorts(
+   portsAndResourcesInfo: List[(Long, (String, AllocationInfo, 
Option[ReservationInfo]))])
+: List[Resource] = {
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-17 Thread mgaido91

GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/19774

[SPARK-22475][SQL] show histogram in DESC COLUMN command

## What changes were proposed in this pull request?

Added the histogram representation to the output of the `DESCRIBE EXTENDED 
table_name column_name` command.

## How was this patch tested?

Modified SQL UT and checked output

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-22475

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19774.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19774


commit 24bfcb1132d35ffa8ba2341a7ea9057b14b5ab8a
Author: Marco Gaido 
Date:   2017-11-17T12:42:16Z

[SPARK-22475][SQL] show histogram in DESC COLUMN command




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19630: [SPARK-22409] Introduce function type argument in...

2017-11-17 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19630#discussion_r151676061
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2049,132 +2050,12 @@ def map_values(col):
 
 #  User Defined Function 
--
 
-def _wrap_function(sc, func, returnType):
-command = (func, returnType)
-pickled_command, broadcast_vars, env, includes = 
_prepare_for_python_RDD(sc, command)
-return sc._jvm.PythonFunction(bytearray(pickled_command), env, 
includes, sc.pythonExec,
-  sc.pythonVer, broadcast_vars, 
sc._javaAccumulator)
-
-
-class PythonUdfType(object):
-# row-at-a-time UDFs
-NORMAL_UDF = 0
-# scalar vectorized UDFs
-PANDAS_UDF = 1
-# grouped vectorized UDFs
-PANDAS_GROUPED_UDF = 2
-
-
-class UserDefinedFunction(object):
--- End diff --

So moving this will probably break some peoples code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19767: [WIP][SPARK-22543][SQL] fix java 64kb compile error for ...

2017-11-17 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19767
  
Looks good direction if we do not see performance degradation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #9428: [SPARK-8582][Core]Optimize checkpointing to avoid computi...

2017-11-17 Thread ferdonline

Github user ferdonline commented on the issue:

https://github.com/apache/spark/pull/9428
  
That's the reason why I want to checkpoint when they are first calculated. 
Further transformations use these results several times. Of course it's not a 
problem per se to calculate twice for the checkpoint, but doing so for 1+TB of 
data is nonsense and I can't cache.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19773: [SPARK-22546][SQL] Supporting for changing column dataTy...

2017-11-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19773
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19630
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19769#discussion_r151633640
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
 ---
@@ -151,6 +154,8 @@ private[parquet] class ParquetRowConverter(
|${catalystType.prettyJson}
  """.stripMargin)
 
+  val UTC = DateTimeUtils.TimeZoneUTC
--- End diff --

nit: `private val`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19767: [WIP][SPARK-22543][SQL] fix java 64kb compile err...

2017-11-17 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19767#discussion_r151644789
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -105,6 +105,41 @@ abstract class Expression extends TreeNode[Expression] 
{
   val isNull = ctx.freshName("isNull")
   val value = ctx.freshName("value")
   val ve = doGenCode(ctx, ExprCode("", isNull, value))
+
+  // TODO: support whole stage codegen too
+  if (ve.code.trim.length > 1024 && ctx.INPUT_ROW != null && 
ctx.currentVars == null) {
+val setIsNull = if (ve.isNull != "false" && ve.isNull != "true") {
+  val globalIsNull = ctx.freshName("globalIsNull")
+  ctx.addMutableState("boolean", globalIsNull, s"$globalIsNull = 
false;")
+  val localIsNull = ve.isNull
+  ve.isNull = globalIsNull
+  s"$globalIsNull = $localIsNull;"
+} else {
+  ""
+}
+
+val setValue = {
+  val globalValue = ctx.freshName("globalValue")
+  ctx.addMutableState(
+ctx.javaType(dataType), globalValue, s"$globalValue = 
${ctx.defaultValue(dataType)};")
+  val localValue = ve.value
+  ve.value = globalValue
+  s"$globalValue = $localValue;"
+}
+
+val funcName = ctx.freshName(nodeName)
+val funcFullName = ctx.addNewFunction(funcName,
+  s"""
+ |private void $funcName(InternalRow ${ctx.INPUT_ROW}) {
+ |  ${ve.code.trim}
+ |  $setValue
--- End diff --

Thanks.

IMHO, I am curious whether we will see no performance degradation by using 
one array to compact many boolean variables.  
I am waiting for the updated result in [this 
discussion](https://github.com/apache/spark/pull/19518#issuecomment-337965330). 
This is because the current code seems to measure performance of interpreter 
due to lack of warmup.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...

2017-11-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19769
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...

2017-11-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19630
  
**[Test build #83962 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83962/testReport)**
 for PR 19630 at commit 
[`cf1d1ca`](https://github.com/apache/spark/commit/cf1d1caa4f41c6bcf565cfc5b9e9901d94f56af3).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...

2017-11-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19769
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83960/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...

2017-11-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19769
  
**[Test build #83960 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83960/testReport)**
 for PR 19769 at commit 
[`953b4e8`](https://github.com/apache/spark/commit/953b4e84b717962316218aec0d635f344b44134c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19773: [SPARK-22546][SQL] Supporting for changing column dataTy...

2017-11-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19773
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83964/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19773: [SPARK-22546][SQL] Supporting for changing column dataTy...

2017-11-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19773
  
**[Test build #83964 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83964/testReport)**
 for PR 19773 at commit 
[`1bcd74f`](https://github.com/apache/spark/commit/1bcd74fae9cb6595e04eab6ecaf621739644102f).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...

2017-11-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19769
  
**[Test build #83960 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83960/testReport)**
 for PR 19769 at commit 
[`953b4e8`](https://github.com/apache/spark/commit/953b4e84b717962316218aec0d635f344b44134c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19769#discussion_r151634925
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala
 ---
@@ -87,4 +95,107 @@ class ParquetInteroperabilitySuite extends 
ParquetCompatibilityTest with SharedS
   Row(Seq(2, 3
 }
   }
+
+  test("parquet timestamp conversion") {
+// Make a table with one parquet file written by impala, and one 
parquet file written by spark.
+// We should only adjust the timestamps in the impala file, and only 
if the conf is set
+val impalaFile = "test-data/impala_timestamp.parq"
+
+// here are the timestamps in the impala file, as they were saved by 
impala
+val impalaFileData =
+  Seq(
+"2001-01-01 01:01:01",
+"2002-02-02 02:02:02",
+"2003-03-03 03:03:03"
+  ).map { s => java.sql.Timestamp.valueOf(s) }
+val impalaPath = 
Thread.currentThread().getContextClassLoader.getResource(impalaFile)
+  .toURI.getPath
+withTempPath { tableDir =>
+  val ts = Seq(
+"2004-04-04 04:04:04",
+"2005-05-05 05:05:05",
+"2006-06-06 06:06:06"
+  ).map { s => java.sql.Timestamp.valueOf(s) }
+  import testImplicits._
+  // match the column names of the file from impala
+  val df = 
spark.createDataset(ts).toDF().repartition(1).withColumnRenamed("value", "ts")
+  df.write.parquet(tableDir.getAbsolutePath)
+  FileUtils.copyFile(new File(impalaPath), new File(tableDir, 
"part-1.parq"))
+
+  Seq(false, true).foreach { int96TimestampConversion =>
+Seq(false, true).foreach { vectorized =>
+  withSQLConf(
+  (SQLConf.PARQUET_INT96_TIMESTAMP_CONVERSION.key, 
int96TimestampConversion.toString()),
+  (SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, 
vectorized.toString())
--- End diff --

to be future proof, let's explicitly set 
`PARQUET_OUTPUT_TIMESTAMP_TYPE=INT96`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19767: [WIP][SPARK-22543][SQL] fix java 64kb compile err...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19767#discussion_r151636953
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -105,6 +105,41 @@ abstract class Expression extends TreeNode[Expression] 
{
   val isNull = ctx.freshName("isNull")
   val value = ctx.freshName("value")
   val ve = doGenCode(ctx, ExprCode("", isNull, value))
+
+  // TODO: support whole stage codegen too
+  if (ve.code.trim.length > 1024 && ctx.INPUT_ROW != null && 
ctx.currentVars == null) {
+val setIsNull = if (ve.isNull != "false" && ve.isNull != "true") {
+  val globalIsNull = ctx.freshName("globalIsNull")
+  ctx.addMutableState("boolean", globalIsNull, s"$globalIsNull = 
false;")
+  val localIsNull = ve.isNull
+  ve.isNull = globalIsNull
+  s"$globalIsNull = $localIsNull;"
+} else {
+  ""
+}
+
+val setValue = {
+  val globalValue = ctx.freshName("globalValue")
+  ctx.addMutableState(
+ctx.javaType(dataType), globalValue, s"$globalValue = 
${ctx.defaultValue(dataType)};")
+  val localValue = ve.value
+  ve.value = globalValue
+  s"$globalValue = $localValue;"
+}
+
+val funcName = ctx.freshName(nodeName)
+val funcFullName = ctx.addNewFunction(funcName,
+  s"""
+ |private void $funcName(InternalRow ${ctx.INPUT_ROW}) {
+ |  ${ve.code.trim}
+ |  $setValue
--- End diff --

good suggestion! Actually, this is a general strategy which can be applied 
to more places. If there are only boolean global variables, it's very easy to 
fold them into one array.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19390: [SPARK-18935][MESOS] Fix dynamic reservations on ...

2017-11-17 Thread skonto

Github user skonto commented on a diff in the pull request:

https://github.com/apache/spark/pull/19390#discussion_r151672564
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala
 ---
@@ -175,17 +176,39 @@ trait MesosSchedulerUtils extends Logging {
 registerLatch.countDown()
   }
 
-  def createResource(name: String, amount: Double, role: Option[String] = 
None): Resource = {
+  private def setAllocationAndReservationInfo(
+   allocationInfo: Option[AllocationInfo],
+   reservationInfo: Option[ReservationInfo],
+   role: Option[String],
+   builder: Resource.Builder): Unit = {
+if (role.forall(r => !r.equals(ANY_ROLE))) {
--- End diff --

Even better !role.contains(ANY_ROLE)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19773: Supporting for changing column dataType

2017-11-17 Thread xuanyuanking

GitHub user xuanyuanking opened a pull request:

https://github.com/apache/spark/pull/19773

Supporting for changing column dataType

## What changes were proposed in this pull request?

Support user to change column dataType in hive table and datasource table, 
here also want to make a further discuss for other ddl requirement. 

## How was this patch tested?

Add test case in `DDLSuite.scala` and `SQLQueryTestSuite.scala`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xuanyuanking/spark SPARK-22546

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19773.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19773


commit 1bcd74fae9cb6595e04eab6ecaf621739644102f
Author: Yuanjian Li 
Date:   2017-11-17T12:11:33Z

Support change column dataType




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19257: [SPARK-22042] [SQL] ReorderJoinPredicates can break when...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19257
  
@felixcheung Don't worry, the bug only exists in the master branch, so it 
won't block the 2.2.1 release. I have corrected the JIRA ticket's affected 
version to 2.3 . Also I'm looking into this issue


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19767: [WIP][SPARK-22543][SQL] fix java 64kb compile err...

2017-11-17 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19767#discussion_r151631456
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -105,6 +105,41 @@ abstract class Expression extends TreeNode[Expression] 
{
   val isNull = ctx.freshName("isNull")
   val value = ctx.freshName("value")
   val ve = doGenCode(ctx, ExprCode("", isNull, value))
+
+  // TODO: support whole stage codegen too
+  if (ve.code.trim.length > 1024 && ctx.INPUT_ROW != null && 
ctx.currentVars == null) {
+val setIsNull = if (ve.isNull != "false" && ve.isNull != "true") {
+  val globalIsNull = ctx.freshName("globalIsNull")
+  ctx.addMutableState("boolean", globalIsNull, s"$globalIsNull = 
false;")
+  val localIsNull = ve.isNull
+  ve.isNull = globalIsNull
+  s"$globalIsNull = $localIsNull;"
+} else {
+  ""
+}
+
+val setValue = {
+  val globalValue = ctx.freshName("globalValue")
+  ctx.addMutableState(
+ctx.javaType(dataType), globalValue, s"$globalValue = 
${ctx.defaultValue(dataType)};")
+  val localValue = ve.value
+  ve.value = globalValue
+  s"$globalValue = $localValue;"
+}
+
+val funcName = ctx.freshName(nodeName)
+val funcFullName = ctx.addNewFunction(funcName,
+  s"""
+ |private void $funcName(InternalRow ${ctx.INPUT_ROW}) {
+ |  ${ve.code.trim}
+ |  $setValue
--- End diff --

Can we always pass `value` as a return value? It can reduce # of global 
variables.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19769#discussion_r151634730
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala
 ---
@@ -87,4 +95,107 @@ class ParquetInteroperabilitySuite extends 
ParquetCompatibilityTest with SharedS
   Row(Seq(2, 3
 }
   }
+
+  test("parquet timestamp conversion") {
+// Make a table with one parquet file written by impala, and one 
parquet file written by spark.
+// We should only adjust the timestamps in the impala file, and only 
if the conf is set
+val impalaFile = "test-data/impala_timestamp.parq"
+
+// here are the timestamps in the impala file, as they were saved by 
impala
+val impalaFileData =
+  Seq(
+"2001-01-01 01:01:01",
+"2002-02-02 02:02:02",
+"2003-03-03 03:03:03"
+  ).map { s => java.sql.Timestamp.valueOf(s) }
+val impalaPath = 
Thread.currentThread().getContextClassLoader.getResource(impalaFile)
+  .toURI.getPath
+withTempPath { tableDir =>
+  val ts = Seq(
+"2004-04-04 04:04:04",
+"2005-05-05 05:05:05",
+"2006-06-06 06:06:06"
+  ).map { s => java.sql.Timestamp.valueOf(s) }
+  import testImplicits._
+  // match the column names of the file from impala
+  val df = 
spark.createDataset(ts).toDF().repartition(1).withColumnRenamed("value", "ts")
+  df.write.parquet(tableDir.getAbsolutePath)
+  FileUtils.copyFile(new File(impalaPath), new File(tableDir, 
"part-1.parq"))
+
+  Seq(false, true).foreach { int96TimestampConversion =>
+Seq(false, true).foreach { vectorized =>
+  withSQLConf(
+  (SQLConf.PARQUET_INT96_TIMESTAMP_CONVERSION.key, 
int96TimestampConversion.toString()),
+  (SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, 
vectorized.toString())
+  ) {
+val readBack = 
spark.read.parquet(tableDir.getAbsolutePath).collect()
+assert(readBack.size === 6)
+// if we apply the conversion, we'll get the "right" values, 
as saved by impala in the
+// original file.  Otherwise, they're off by the local 
timezone offset, set to
+// America/Los_Angeles in tests
+val impalaExpectations = if (int96TimestampConversion) {
+  impalaFileData
+} else {
+  impalaFileData.map { ts =>
+DateTimeUtils.toJavaTimestamp(DateTimeUtils.convertTz(
+  DateTimeUtils.fromJavaTimestamp(ts),
+  DateTimeUtils.TimeZoneUTC,
+  DateTimeUtils.getTimeZone(conf.sessionLocalTimeZone)))
+  }
+}
+val fullExpectations = (ts ++ 
impalaExpectations).map(_.toString).sorted.toArray
+val actual = readBack.map(_.getTimestamp(0).toString).sorted
+withClue(s"applyConversion = $int96TimestampConversion; 
vectorized = $vectorized") {
+  assert(fullExpectations === actual)
+
+  // Now test that the behavior is still correct even with a 
filter which could get
+  // pushed down into parquet.  We don't need extra handling 
for pushed down
+  // predicates because (a) in ParquetFilters, we ignore 
TimestampType and (b) parquet
+  // does not read statistics from int96 fields, as they are 
unsigned.  See
+  // scalastyle:off line.size.limit
+  // 
https://github.com/apache/parquet-mr/blob/2fd62ee4d524c270764e9b91dca72e5cf1a005b7/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L419
+  // 
https://github.com/apache/parquet-mr/blob/2fd62ee4d524c270764e9b91dca72e5cf1a005b7/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L348
+  // scalastyle:on line.size.limit
+  //
+  // Just to be defensive in case anything ever changes in 
parquet, this test checks
+  // the assumption on column stats, and also the end-to-end 
behavior.
+
+  val hadoopConf = sparkContext.hadoopConfiguration
+  val fs = FileSystem.get(hadoopConf)
+  val parts = fs.listStatus(new 
Path(tableDir.getAbsolutePath), new PathFilter {
+override def accept(path: Path): Boolean = 
!path.getName.startsWith("_")
+  })
+  // grab the meta data from the parquet file.  The next 
section of asserts just make
+  // sure the test is configured correctly.
+  assert(parts.size == 2)
+

[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19518
  
ping @bdrillard 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...

2017-11-17 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19630
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19390: [SPARK-18935][MESOS] Fix dynamic reservations on ...

2017-11-17 Thread skonto

Github user skonto commented on a diff in the pull request:

https://github.com/apache/spark/pull/19390#discussion_r151674372
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala
 ---
@@ -228,24 +254,15 @@ trait MesosSchedulerUtils extends Logging {
 (attr.getName, attr.getText.getValue.split(',').toSet)
   }
 
-
-  /** Build a Mesos resource protobuf object */
-  protected def createResource(resourceName: String, quantity: Double): 
Protos.Resource = {
-Resource.newBuilder()
-  .setName(resourceName)
-  .setType(Value.Type.SCALAR)
-  .setScalar(Value.Scalar.newBuilder().setValue(quantity).build())
-  .build()
-  }
-
   /**
* Converts the attributes from the resource offer into a Map of name to 
Attribute Value
* The attribute values are the mesos attribute types and they are
*
* @param offerAttributes the attributes offered
* @return
*/
-  protected def toAttributeMap(offerAttributes: JList[Attribute]): 
Map[String, GeneratedMessage] = {
+  protected def toAttributeMap(offerAttributes: JList[Attribute])
+: Map[String, GeneratedMessageV3] = {
--- End diff --

 2 space indent is not correct?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...

2017-11-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19630
  
**[Test build #83965 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83965/testReport)**
 for PR 19630 at commit 
[`cf1d1ca`](https://github.com/apache/spark/commit/cf1d1caa4f41c6bcf565cfc5b9e9901d94f56af3).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19390: [SPARK-18935][MESOS] Fix dynamic reservations on ...

2017-11-17 Thread skonto

Github user skonto commented on a diff in the pull request:

https://github.com/apache/spark/pull/19390#discussion_r151679164
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala
 ---
@@ -451,15 +468,22 @@ trait MesosSchedulerUtils extends Logging {
   }
 
   /** Creates a mesos resource for a specific port number. */
-  private def createResourcesFromPorts(portsAndRoles: List[(Long, 
String)]) : List[Resource] = {
-portsAndRoles.flatMap{ case (port, role) =>
-  createMesosPortResource(List((port, port)), Some(role))}
+  private def createResourcesFromPorts(
+   portsAndResourcesInfo: List[(Long, (String, AllocationInfo, 
Option[ReservationInfo]))])
--- End diff --

yeah makes sense.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...

2017-11-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19630
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83961/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...

2017-11-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19630
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19767: [WIP][SPARK-22543][SQL] fix java 64kb compile error for ...

2017-11-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19767
  
**[Test build #83963 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83963/testReport)**
 for PR 19767 at commit 
[`3dab5bd`](https://github.com/apache/spark/commit/3dab5bdbc4d2bb1818c46905afb92422bac04d9e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19390: [SPARK-18935][MESOS] Fix dynamic reservations on ...

2017-11-17 Thread skonto

Github user skonto commented on a diff in the pull request:

https://github.com/apache/spark/pull/19390#discussion_r151673115
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala
 ---
@@ -349,13 +349,22 @@ private[spark] class 
MesosCoarseGrainedSchedulerBackend(
   val offerMem = getResource(offer.getResourcesList, "mem")
   val offerCpus = getResource(offer.getResourcesList, "cpus")
   val offerPorts = getRangeResource(offer.getResourcesList, "ports")
+  val offerAllocationInfo = offer.getAllocationInfo
+  val offerReservationInfo = offer
+.getResourcesList
+.asScala
+.find(resource => Option(resource.getReservation).isDefined)
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19390: [SPARK-18935][MESOS] Fix dynamic reservations on ...

2017-11-17 Thread skonto

Github user skonto commented on a diff in the pull request:

https://github.com/apache/spark/pull/19390#discussion_r151673131
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala
 ---
@@ -451,15 +468,22 @@ trait MesosSchedulerUtils extends Logging {
   }
 
   /** Creates a mesos resource for a specific port number. */
-  private def createResourcesFromPorts(portsAndRoles: List[(Long, 
String)]) : List[Resource] = {
-portsAndRoles.flatMap{ case (port, role) =>
-  createMesosPortResource(List((port, port)), Some(role))}
+  private def createResourcesFromPorts(
+   portsAndResourcesInfo: List[(Long, (String, AllocationInfo, 
Option[ReservationInfo]))])
+: List[Resource] = {
+portsAndResourcesInfo.flatMap{ case (port, rInfo) =>
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19767: [WIP][SPARK-22543][SQL] fix java 64kb compile error for ...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19767
  
@maropu it partially covers  #18641 . One problem is that, for an 
expression, if its child generates code less than 1024, and it has many 
children, then we still have an issue. `CaseWhen` is a little different because 
it at most can have 20 children(depends on 
`spark.sql.codegen.maxCaseBranches`). So we can still prevent failures, but may 
not be able to JIT.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...

2017-11-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19769#discussion_r151635947
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala
 ---
@@ -87,4 +96,113 @@ class ParquetInteroperabilitySuite extends 
ParquetCompatibilityTest with SharedS
   Row(Seq(2, 3
 }
   }
+
+  val ImpalaFile = "test-data/impala_timestamp.parq"
+  test("parquet timestamp conversion") {
+// Make a table with one parquet file written by impala, and one 
parquet file written by spark.
+// We should only adjust the timestamps in the impala file, and only 
if the conf is set
+
+// here's the timestamps in the impala file, as they were saved by 
impala
+val impalaFileData =
+  Seq(
+"2001-01-01 01:01:01",
+"2002-02-02 02:02:02",
+"2003-03-03 03:03:03"
+  ).map { s => java.sql.Timestamp.valueOf(s) }
+val impalaFile = 
Thread.currentThread().getContextClassLoader.getResource(ImpalaFile)
+  .toURI.getPath
+withTempPath { tableDir =>
+  val ts = Seq(
+"2004-04-04 04:04:04",
+"2005-05-05 05:05:05",
+"2006-06-06 06:06:06"
+  ).map { s => java.sql.Timestamp.valueOf(s) }
+  val s = spark
+  import s.implicits._
+  // match the column names of the file from impala
+  val df = 
spark.createDataset(ts).toDF().repartition(1).withColumnRenamed("value", "ts")
+  val schema = df.schema
+  df.write.parquet(tableDir.getAbsolutePath)
+  FileUtils.copyFile(new File(impalaFile), new File(tableDir, 
"part-1.parq"))
+
+  Seq(false, true).foreach { applyConversion =>
+Seq(false, true).foreach { vectorized =>
+  withSQLConf(
+  (SQLConf.PARQUET_INT96_TIMESTAMP_CONVERSION.key, 
applyConversion.toString()),
+  (SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, 
vectorized.toString())
+  ) {
+val read = 
spark.read.parquet(tableDir.getAbsolutePath).collect()
+assert(read.size === 6)
+// if we apply the conversion, we'll get the "right" values, 
as saved by impala in the
+// original file.  Otherwise, they're off by the local 
timezone offset, set to
+// America/Los_Angeles in tests
+val impalaExpectations = if (applyConversion) {
+  impalaFileData
+} else {
+  impalaFileData.map { ts =>
+DateTimeUtils.toJavaTimestamp(DateTimeUtils.convertTz(
+  DateTimeUtils.fromJavaTimestamp(ts),
+  TimeZone.getTimeZone("UTC"),
+  TimeZone.getDefault()))
+  }
+}
+val fullExpectations = (ts ++ impalaExpectations).map {
+  _.toString()
+}.sorted.toArray
+val actual = read.map {
+  _.getTimestamp(0).toString()
+}.sorted
+withClue(s"applyConversion = $applyConversion; vectorized = 
$vectorized") {
+  assert(fullExpectations === actual)
+
+  // Now test that the behavior is still correct even with a 
filter which could get
+  // pushed down into parquet.  We don't need extra handling 
for pushed down
+  // predicates because (a) in ParquetFilters, we ignore 
TimestampType and (b) parquet
+  // does not read statistics from int96 fields, as they are 
unsigned.  See
+  // scalastyle:off line.size.limit
+  // 
https://github.com/apache/parquet-mr/blob/2fd62ee4d524c270764e9b91dca72e5cf1a005b7/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L419
+  // 
https://github.com/apache/parquet-mr/blob/2fd62ee4d524c270764e9b91dca72e5cf1a005b7/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L348
+  // scalastyle:on line.size.limit
+  //
+  // Just to be defensive in case anything ever changes in 
parquet, this test checks
+  // the assumption on column stats, and also the end-to-end 
behavior.
+
+  val hadoopConf = sparkContext.hadoopConfiguration
+  val fs = FileSystem.get(hadoopConf)
+  val parts = fs.listStatus(new 
Path(tableDir.getAbsolutePath), new PathFilter {
+override def accept(path: Path): Boolean = 
!path.getName.startsWith("_")
+  })
+  // grab the meta data from the parquet file.  The next 
section of asserts just make
+  // sure the test is configured

[GitHub] spark pull request #19630: [SPARK-22409] Introduce function type argument in...

2017-11-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19630#discussion_r151677913
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2049,132 +2050,12 @@ def map_values(col):
 
 #  User Defined Function 
--
 
-def _wrap_function(sc, func, returnType):
-command = (func, returnType)
-pickled_command, broadcast_vars, env, includes = 
_prepare_for_python_RDD(sc, command)
-return sc._jvm.PythonFunction(bytearray(pickled_command), env, 
includes, sc.pythonExec,
-  sc.pythonVer, broadcast_vars, 
sc._javaAccumulator)
-
-
-class PythonUdfType(object):
-# row-at-a-time UDFs
-NORMAL_UDF = 0
-# scalar vectorized UDFs
-PANDAS_UDF = 1
-# grouped vectorized UDFs
-PANDAS_GROUPED_UDF = 2
-
-
-class UserDefinedFunction(object):
--- End diff --

Yup, I noticed it first too when I reviewed but then noticed he imported 
this indentedly:


https://github.com/icexelloss/spark/blob/cf1d1caa4f41c6bcf565cfc5b9e9901d94f56af3/python/pyspark/sql/functions.py#L35

So, I guess it could be fine. I manually just double checked:

```python
>>> from pyspark.sql import functions
>>> functions.UserDefinedFunction

>>> from pyspark import sql
>>> sql.functions.UserDefinedFunction

>>> from pyspark.sql.functions import UserDefinedFunction
>>> from pyspark.sql.udf import UserDefinedFunction
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19739: [SPARK-22513][BUILD] Provide build profile for hadoop 2....

2017-11-17 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19739
  
In any event, you can always produce your a build without any POM changes 
that does exactly this with `-Dhadoop.version=2.8.2` if you wanted to. You can 
close this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...

2017-11-17 Thread skambha

Github user skambha commented on the issue:

https://github.com/apache/spark/pull/19747
  
I have taken care of adding the check in the new 
HiveClientImpl.alterTableDataSchema as well and have added some new tests. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19760: [SPARK-22533][core] Handle deprecated names in ConfigEnt...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19760
  
I'm ok to move the deprecated config keys of 
`MAX_REMOTE_BLOCK_SIZE_FETCH_TO_MEM` and `LISTENER_BUS_EVENT_QUEUE_CAPACITY` to 
`SparkConf` if the deprecation message really matters. But I'd like to keep 
`withAlternatives`. Generally it's a better interface and my future plan is to 
move config related stuff to a new maven module, so it can be used in modules 
that don't depend on the core module(e.g. the network module). It will be 
annoying if everytime we wanna deprecate a conf we need to change the config 
module.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...

2017-11-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19630
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83962/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN command

2017-11-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19774
  
**[Test build #83966 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83966/testReport)**
 for PR 19774 at commit 
[`24bfcb1`](https://github.com/apache/spark/commit/24bfcb1132d35ffa8ba2341a7ea9057b14b5ab8a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN command

2017-11-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19774
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19775: Add support for publishing Spark metrics into Pro...

2017-11-17 Thread matyix

GitHub user matyix opened a pull request:

https://github.com/apache/spark/pull/19775

Add support for publishing Spark metrics into Prometheus

## What changes were proposed in this pull request?

_Originally this PR was submitted to the Spark on K8S fork 
[here](https://github.com/apache-spark-on-k8s/spark/pull/531) but has been 
advised to resend it upstream by @erikerlandson and @foxish. K8S specific items 
were removed from the PR and been reworked for the Apache version._

Publishing Spark metrics into Prometheus - as highlighted in the 
[JIRA](https://issues.apache.org/jira/browse/SPARK-22343). Implemented a 
metrics sink that publishes Spark metrics into Prometheus via [Prometheus 
Pushgateway](https://prometheus.io/docs/instrumenting/pushing/). Metrics data 
published by Spark is based on [Dropwizard](http://metrics.dropwizard.io/). The 
format of Spark metrics is not supported natively by Prometheus thus these are 
converted using 
[DropwizardExports](https://prometheus.io/client_java/io/prometheus/client/dropwizard/DropwizardExports.html)
 prior pushing metrics to the pushgateway.

Also the default Prometheus pushgateway client API implementation does not 
support metrics timestamp thus the client API has been ehanced to enrich 
metrics  data with timestamp. 

## How was this patch tested?

This PR is not affecting the existing code base and not altering the 
functionality. Nevertheless, I have executed all `unit and integration` tests. 
Also this setup has been deployed and been monitored via Prometheus (Prometheus 
1.7.1 + Pushgateway 0.3.1). 

`Manual` testing through deploying a Spark cluster, Prometheus server, 
Pushgateway and ran SparkPi.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/banzaicloud/spark 
apache_master_prometheus_support

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19775.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19775


commit 579cca96af187cf50fbedf5927cdea4e0bbdff26
Author: Janos Matyas 
Date:   2017-10-17T18:51:50Z

Add support for prometheus




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19775: Add support for publishing Spark metrics into Prometheus

2017-11-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19775
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19765: [SPARK-22540][SQL] Ensure HighlyCompressedMapStat...

2017-11-17 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19765


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19257: [SPARK-22042] [SQL] ReorderJoinPredicates can bre...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19257#discussion_r151714611
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala ---
@@ -602,6 +602,28 @@ abstract class BucketedReadSuite extends QueryTest 
with SQLTestUtils {
 )
   }
 
+  test("SPARK-22042 ReorderJoinPredicates can break when child's 
partitioning is not decided") {
+withTable("bucketed_table", "table1", "table2") {
+  df.write.format("parquet").saveAsTable("table1")
+  df.write.format("parquet").saveAsTable("table2")
+  df.write.format("parquet").bucketBy(8, "j", 
"k").saveAsTable("bucketed_table")
+
+  withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "0") {
+sql("""
+  |SELECT *
+  |FROM (
+  |  SELECT a.i, a.j, a.k
+  |  FROM bucketed_table a
+  |  JOIN table1 b
+  |  ON a.i = b.i
+  |) c
+  |JOIN table2
+  |ON c.i = table2.i
+  |""".stripMargin).explain()
--- End diff --

use checkAnswer instead of explain in the test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19767: [SPARK-22543][SQL] fix java 64kb compile error for deepl...

2017-11-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19767
  
**[Test build #83963 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83963/testReport)**
 for PR 19767 at commit 
[`3dab5bd`](https://github.com/apache/spark/commit/3dab5bdbc4d2bb1818c46905afb92422bac04d9e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...

2017-11-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19769
  
**[Test build #83971 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83971/testReport)**
 for PR 19769 at commit 
[`9bb4cf0`](https://github.com/apache/spark/commit/9bb4cf0514dddc005b90ddb17a22d3b05be929e5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19767: [SPARK-22543][SQL] fix java 64kb compile error for deepl...

2017-11-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19767
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...

2017-11-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19630
  
**[Test build #83962 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83962/testReport)**
 for PR 19630 at commit 
[`cf1d1ca`](https://github.com/apache/spark/commit/cf1d1caa4f41c6bcf565cfc5b9e9901d94f56af3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...

2017-11-17 Thread icexelloss

Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/19630
  
Everyone, I don't have more changes to the PR. I think all comments are 
addressed at this point. Please let me know if I missed anything or there are 
more comments. Thank you!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN command

2017-11-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19774
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83966/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19630
  
thanks, merging to master, cheers!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...

2017-11-17 Thread squito

Github user squito commented on the issue:

https://github.com/apache/spark/pull/19769
  
cc @henryr @zivanfi


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19730: [SPARK-22500][SQL] Fix 64KB JVM bytecode limit problem w...

2017-11-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19730
  
**[Test build #83970 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83970/testReport)**
 for PR 19730 at commit 
[`83fef40`](https://github.com/apache/spark/commit/83fef403b92a96a13421901d161a0df5e6a6d7b3).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN command

2017-11-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19774
  
**[Test build #83972 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83972/testReport)**
 for PR 19774 at commit 
[`9bfa80c`](https://github.com/apache/spark/commit/9bfa80cca04a3b00e0fc2b02beb45c56f2058a34).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...

2017-11-17 Thread icexelloss

Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/19630
  
@HyukjinKwon Thanks for the reply on coverage. It'd be great to have an 
easy way to run coverage :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19257: [SPARK-22042] [SQL] ReorderJoinPredicates can break when...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19257
  
After some more thoughts, I think the best choice is to do planning bottom 
up. That requires a lot of refactoring and I'm fine to merge this workaround 
first.

LGTM except one minor comment for the test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-17 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r151689883
  
--- Diff: 
sql/core/src/test/resources/sql-tests/inputs/describe-table-column.sql ---
@@ -24,6 +24,18 @@ DESC EXTENDED desc_col_table key;
 
 DESC FORMATTED desc_col_table key;
 
+SET spark.sql.statistics.histogram.enabled=true;
+SET spark.sql.statistics.histogram.numBins=2;
+
+INSERT INTO desc_col_table values(1);
+INSERT INTO desc_col_table values(2);
+INSERT INTO desc_col_table values(3);
+INSERT INTO desc_col_table values(4);
--- End diff --

INSERT INTO desc_col_table values 1, 2, 3, 4


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...

2017-11-17 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19774#discussion_r151693478
  
--- Diff: 
sql/core/src/test/resources/sql-tests/inputs/describe-table-column.sql ---
@@ -24,6 +24,18 @@ DESC EXTENDED desc_col_table key;
 
 DESC FORMATTED desc_col_table key;
 
+SET spark.sql.statistics.histogram.enabled=true;
+SET spark.sql.statistics.histogram.numBins=2;
+
+INSERT INTO desc_col_table values(1);
+INSERT INTO desc_col_table values(2);
+INSERT INTO desc_col_table values(3);
+INSERT INTO desc_col_table values(4);
+
+ANALYZE TABLE desc_col_table COMPUTE STATISTICS FOR COLUMNS key;
--- End diff --

please set the sql conf back to default value.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19765: [SPARK-22540][SQL] Ensure HighlyCompressedMapStatus calc...

2017-11-17 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19765
  
Merged to master/2.2


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19747: [Spark-22431][SQL] Ensure that the datatype in th...

2017-11-17 Thread skambha

Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/19747#discussion_r151689272
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -40,6 +40,22 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
 
   setupTestData()
 
+  test("SPARK-22431: table with nested type col with special char") {
--- End diff --

Thanks @gatorsmile for your comments. I have addressed them in the latest 
commit. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...

2017-11-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19747
  
**[Test build #83968 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83968/testReport)**
 for PR 19747 at commit 
[`e5c2cf3`](https://github.com/apache/spark/commit/e5c2cf369912583b273ed573e3be4fdc5b9fb78d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...

2017-11-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19747
  
**[Test build #83969 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83969/testReport)**
 for PR 19747 at commit 
[`3be7b47`](https://github.com/apache/spark/commit/3be7b4736c93c6171677f6488c5a623c2eb38ad9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19767: [SPARK-22543][SQL] fix java 64kb compile error for deepl...

2017-11-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19767
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83963/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...

2017-11-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19630
  
**[Test build #83965 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83965/testReport)**
 for PR 19630 at commit 
[`cf1d1ca`](https://github.com/apache/spark/commit/cf1d1caa4f41c6bcf565cfc5b9e9901d94f56af3).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...

2017-11-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19630
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83965/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...

2017-11-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19630
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19630: [SPARK-22409] Introduce function type argument in...

2017-11-17 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19630


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17436#discussion_r151765263
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadBenchmark.scala
 ---
@@ -260,6 +261,7 @@ object ParquetReadBenchmark {
   def stringWithNullsScanBenchmark(values: Int, fractionOfNulls: Double): 
Unit = {
 withTempPath { dir =>
   withTempTable("t1", "tempTable") {
+val enableOffHeapColumnVector = 
spark.sqlContext.conf.offHeapColumnVectorEnabled
--- End diff --

nit: spark.sessionState.conf.offHeapColumnVectorEnabled


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "spark.m...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/17436
  
LGTM except a few minor comment, please update the PR title and 
description, thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17436#discussion_r151764870
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
@@ -62,7 +69,11 @@ case class InMemoryTableScanExec(
 
   private def createAndDecompressColumn(cachedColumnarBatch: CachedBatch): 
ColumnarBatch = {
 val rowCount = cachedColumnarBatch.numRows
-val columnVectors = OnHeapColumnVector.allocateColumns(rowCount, 
columnarBatchSchema)
+val columnVectors = if (!conf.offHeapColumnVectorEnabled) {
--- End diff --

only enable it when `TaskContext.get != null`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19390: [SPARK-18935][MESOS] Fix dynamic reservations on mesos

2017-11-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19390
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19390: [SPARK-18935][MESOS] Fix dynamic reservations on mesos

2017-11-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19390
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83967/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19772: [SPARK-22538][ML] SQLTransformer should not unper...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19772#discussion_r151732579
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala ---
@@ -70,7 +70,8 @@ class SQLTransformer @Since("1.6.0") (@Since("1.6.0") 
override val uid: String)
 dataset.createOrReplaceTempView(tableName)
 val realStatement = $(statement).replace(tableIdentifier, tableName)
 val result = dataset.sparkSession.sql(realStatement)
-dataset.sparkSession.catalog.dropTempView(tableName)
--- End diff --

It seems like a bug: when you cache a dataframe, create a view from the 
dataframe, and drop the view, Spark should not uncache the original dataframe. 
We can discuss more later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19775: [SPARK-22343][core] Add support for publishing Spark met...

2017-11-17 Thread erikerlandson

Github user erikerlandson commented on the issue:

https://github.com/apache/spark/pull/19775
  
@matyix thanks for re-submitting!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17436#discussion_r151764498
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java
 ---
@@ -101,9 +101,13 @@
   private boolean returnColumnarBatch;
 
   /**
-   * The default config on whether columnarBatch should be offheap.
+   * The config on whether columnarBatch should be offheap.
--- End diff --

nit: the memory mode of the columnarBatch


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17436#discussion_r151764317
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -140,6 +140,13 @@ object SQLConf {
   .booleanConf
   .createWithDefault(true)
 
+  val COLUMN_VECTOR_OFFHEAP_ENABLED =
+buildConf("spark.sql.columnVector.offheap.enable")
+  .internal()
+  .doc("When true, use OffHeapColumnVector in ColumnarBatch.")
+  .booleanConf
+  .createWithDefault(true)
--- End diff --

hey let's not change the existing behavior.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19728: [SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem w...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19728
  
can you split it into 3 PRs? The approaches for these 3 expression are 
quite different.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19388: [SPARK-22162] Executors and the driver should use consis...

2017-11-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19388
  
**[Test build #83976 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83976/testReport)**
 for PR 19388 at commit 
[`500c73c`](https://github.com/apache/spark/commit/500c73cc96290efe0194e371ab84e0cda863347d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19730: [SPARK-22500][SQL] Fix 64KB JVM bytecode limit pr...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19730#discussion_r151723565
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala
 ---
@@ -827,4 +827,34 @@ class CastSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 
 checkEvaluation(cast(Literal.create(input, from), to), input)
   }
+
+  test("SPARK-22500: cast for struct should not generate codes beyond 
64KB") {
+val N = 1000
+
+val from1 = new StructType(
+  (1 to N).map(i => StructField(s"s$i", StringType)).toArray)
+val to1 = new StructType(
+  (1 to N).map(i => StructField(s"i$i", IntegerType)).toArray)
+val input1 = Row.fromSeq((1 to N).map(i => i.toString))
+val output1 = Row.fromSeq((1 to N))
+checkEvaluation(cast(Literal.create(input1, from1), to1), output1)
+
+val from2 = new StructType(
+  (1 to N).map(i => StructField(s"a$i", ArrayType(StringType, 
containsNull = false))).toArray)
--- End diff --

or just test this case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19730: [SPARK-22500][SQL] Fix 64KB JVM bytecode limit pr...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19730#discussion_r151725673
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 ---
@@ -1039,13 +1039,19 @@ case class Cast(child: Expression, dataType: 
DataType, timeZoneId: Option[String
   }
 }
"""
-}.mkString("\n")
+}
+val fieldsEvalCodes = if (ctx.INPUT_ROW != null && ctx.currentVars == 
null) {
+  ctx.splitExpressions(fieldsEvalCode, "castStruct",
+("InternalRow", ctx.INPUT_ROW) :: (rowClass, result) :: 
("InternalRow", tmpRow) :: Nil)
--- End diff --

I mean, we don't need to pass in `ctx.INPUT_ROW` to the split functions.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19257: [SPARK-22042] [SQL] ReorderJoinPredicates can break when...

2017-11-17 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19257
  
Thank you for the decision, @cloud-fan . It's great to see the progress on 
this!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19730: [SPARK-22500][SQL] Fix 64KB JVM bytecode limit pr...

2017-11-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19730#discussion_r151723430
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala
 ---
@@ -827,4 +827,34 @@ class CastSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 
 checkEvaluation(cast(Literal.create(input, from), to), input)
   }
+
+  test("SPARK-22500: cast for struct should not generate codes beyond 
64KB") {
+val N = 1000
+
+val from1 = new StructType(
+  (1 to N).map(i => StructField(s"s$i", StringType)).toArray)
+val to1 = new StructType(
+  (1 to N).map(i => StructField(s"i$i", IntegerType)).toArray)
+val input1 = Row.fromSeq((1 to N).map(i => i.toString))
+val output1 = Row.fromSeq((1 to N))
+checkEvaluation(cast(Literal.create(input1, from1), to1), output1)
+
+val from2 = new StructType(
+  (1 to N).map(i => StructField(s"a$i", ArrayType(StringType, 
containsNull = false))).toArray)
--- End diff --

I'd expect something like
```
val from2 = new StructType(
  (1 to N).map(i => StructField(s"s$i", from1)).toArray)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...

2017-11-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19747
  
**[Test build #83968 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83968/testReport)**
 for PR 19747 at commit 
[`e5c2cf3`](https://github.com/apache/spark/commit/e5c2cf369912583b273ed573e3be4fdc5b9fb78d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19772: [SPARK-22538][ML] SQLTransformer should not unper...

2017-11-17 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19772


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...

2017-11-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19747
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83969/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...

2017-11-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19747
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 212 matches

Mail list logo