date:20161011

[GitHub] spark issue #14959: [SPARK-17387][PYSPARK] Creating SparkContext() from pyth...

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14959
  
**[Test build #66758 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66758/consoleFull)**
 for PR 14959 at commit 
[`d692e71`](https://github.com/apache/spark/commit/d692e715e9c4da3a00f18c1693b60772492a80e4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15425: [SPARK-17816] [Core] [Branch-2.0] Fix ConcurrentM...

2016-10-11 Thread seyfe

Github user seyfe closed the pull request at:

https://github.com/apache/spark/pull/15425


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15425: [SPARK-17816] [Core] [Branch-2.0] Fix ConcurrentModifica...

2016-10-11 Thread seyfe

Github user seyfe commented on the issue:

https://github.com/apache/spark/pull/15425
  
Closing it since it's merged into 2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15382: [SPARK-17810] [SQL] Default spark.sql.warehouse.d...

2016-10-11 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15382#discussion_r82878712
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -757,7 +758,10 @@ private[sql] class SQLConf extends Serializable with 
CatalystConf with Logging {
 
   def variableSubstituteDepth: Int = getConf(VARIABLE_SUBSTITUTE_DEPTH)
 
-  def warehousePath: String = new Path(getConf(WAREHOUSE_PATH)).toString
+  def warehousePath: String = {
+val path = new Path(getConf(WAREHOUSE_PATH))
+FileSystem.get(path.toUri, new 
Configuration()).makeQualified(path).toString
--- End diff --

There I'm looking at, for example, 
https://github.com/avulanov/spark/blob/ea24b59fe83c37dbab27579141b5c63cccee138d/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L141
 in the test code.  In non-test code I think it's the same source you copied, 
in SessionCatalog, line 154. 

Although these code locations can also deal with a scheme vs no scheme, it 
seemed to be easier to deal with it upfront, where it's returned to the rest of 
the code from the conf object. I think it'll be the same code, same complexity 
either place.

The fact that `resolveURI` doesn't quite do what we want here suggests, I 
suppose, that lots of things in Spark aren't going to play well with a Windows 
path with spaces. See:

```
scala> resolveURI("file:///C:/My Programs/path")
res14: java.net.URI = file:/Users/srowen/file:/C:/My%20Programs/path

scala> resolveURI("/C:/My Programs/path")
res15: java.net.URI = file:/C:/My%20Programs/path

scala> resolveURI("C:/My Programs/path")
res16: java.net.URI = file:/Users/srowen/C:/My%20Programs/path

scala> resolveURI("/My Programs/path")
res17: java.net.URI = file:/My%20Programs/path
```

The second (possibly alternative absolute Windows path with space) and 
fourth examples (Linux path with space) happen to come out right. If we're 
willing to also accept here that it's just what's going to work and not work, 
then, at least there is a working syntax for all scenarios when using this 
method.

I believe there's an argument for further changing resolveURI to work with 
more variants here, but I'd have to even figure out first what is supposed to 
work and not work!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15338: [SPARK-11653][Deploy] Allow spark-daemon.sh to run in th...

2016-10-11 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/15338
  
Jenkins add to whitelist


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15338: [SPARK-11653][Deploy] Allow spark-daemon.sh to run in th...

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15338
  
**[Test build #66759 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66759/consoleFull)**
 for PR 15338 at commit 
[`cb89755`](https://github.com/apache/spark/commit/cb89755fd99ca52a926b89fedbc24fbca43dc0ad).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15436: [SPARK-17875] [BUILD] Remove unneeded direct dependence ...

2016-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15436
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66752/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14788: [SPARK-17174][SQL] Add the support for TimestampType for...

2016-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14788
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66713/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15072: [SPARK-17123][SQL] Use type-widened encoder for D...

2016-10-11 Thread hvanhovell

Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15072#discussion_r82728542
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -53,7 +53,15 @@ import org.apache.spark.util.Utils
 
 private[sql] object Dataset {
   def apply[T: Encoder](sparkSession: SparkSession, logicalPlan: 
LogicalPlan): Dataset[T] = {
-new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]])
+val encoder = implicitly[Encoder[T]]
+if (encoder.clsTag.runtimeClass == classOf[Row]) {
+  // We should use the encoder generated from the executed plan rather 
than the existing
+  // encoder for DataFrame because the types of columns can be varied 
due to widening types.
+  // See SPARK-17123. This is a bit hacky. Maybe we should find a 
better way to do this.
+  ofRows(sparkSession, logicalPlan).asInstanceOf[Dataset[T]]
+} else {
+  new Dataset(sparkSession, logicalPlan, encoder)
+}
--- End diff --

Why would `except` fail? Only rows from the first dataset should be 
returned, or am I missing something?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14788: [SPARK-17174][SQL] Add the support for TimestampType for...

2016-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14788
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15427: [SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropd...

2016-10-11 Thread viirya

GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/15427

[SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropduplicates

## What changes were proposed in this pull request?

Two issues regarding Dataset.dropduplicates:

1. Dataset.dropDuplicates should consider the columns with same column name

We find and get the first resolved attribute from output with the given 
column name in `Dataset.dropDuplicates`. When we have the more than one columns 
with the same name. Other columns are put into aggregation columns, instead of 
grouping columns.

2. Dataset.dropDuplicates should not change the output of child plan

We create new `Alias` with new exprId in `Dataset.dropDuplicates` now. 
However it causes problem when we want to select the columns as follows:

val ds = Seq(("a", 1), ("a", 2), ("b", 1), ("a", 1)).toDS()
// ds("_2") will cause analysis exception
ds.dropDuplicates("_1").select(ds("_1").as[String], 
ds("_2").as[Int])


Because the two issues are both related to `Dataset.dropduplicates` and the 
code changes are not big, so submitting them together as one PR.

## How was this patch tested?

Jenkins tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 fix-dropduplicates

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15427.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15427


commit dd6405c003ea082b1c614f2efed4d1bcb2d6f5b9
Author: Liang-Chi Hsieh 
Date:   2016-10-11T06:08:44Z

Fix Dataset.dropduplicates.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log

2016-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15285
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log

2016-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15285
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66716/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15297: [WIP][SPARK-9862]Handling data skew

2016-10-11 Thread SaintBacchus

Github user SaintBacchus commented on a diff in the pull request:

https://github.com/apache/spark/pull/15297#discussion_r82730696
  
--- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
@@ -138,13 +138,16 @@ private[spark] abstract class MapOutputTracker(conf: 
SparkConf) extends Logging
* and the second item is a sequence of (shuffle block id, 
shuffle block size) tuples
* describing the shuffle blocks that are stored at that block 
manager.
*/
-  def getMapSizesByExecutorId(shuffleId: Int, startPartition: Int, 
endPartition: Int)
+  def getMapSizesByExecutorId(shuffleId: Int, startPartition: Int, 
endPartition: Int,
+  mapid: Int = -1)
--- End diff --

It's better to use `Seq[Int]` to fetch many maps in one time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15360: [SPARK-17073] [SQL] [FOLLOWUP] generate column-le...

2016-10-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15360#discussion_r82730881
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -358,50 +358,180 @@ class StatisticsSuite extends QueryTest with 
TestHiveSingleton with SQLTestUtils
 }
   }
 
-  test("generate column-level statistics and load them from hive 
metastore") {
+  test("test refreshing table stats of cached data source table by 
`ANALYZE TABLE` statement") {
+val tableName = "tbl"
+withTable(tableName) {
+  val tableIndent = TableIdentifier(tableName, Some("default"))
+  val catalog = 
spark.sessionState.catalog.asInstanceOf[HiveSessionCatalog]
+  sql(s"CREATE TABLE $tableName (key int) USING PARQUET")
+  sql(s"INSERT INTO $tableName SELECT 1")
+  sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS")
+  // Table lookup will make the table cached.
+  catalog.lookupRelation(tableIndent)
+  val stats1 = 
catalog.getCachedDataSourceTable(tableIndent).asInstanceOf[LogicalRelation]
+.catalogTable.get.stats.get
+  assert(stats1.sizeInBytes > 0)
+  assert(stats1.rowCount.contains(1))
+
+  sql(s"INSERT INTO $tableName SELECT 2")
+  sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS")
+  catalog.lookupRelation(tableIndent)
+  val stats2 = 
catalog.getCachedDataSourceTable(tableIndent).asInstanceOf[LogicalRelation]
+.catalogTable.get.stats.get
+  assert(stats2.sizeInBytes > stats1.sizeInBytes)
+  assert(stats2.rowCount.contains(2))
+}
+  }
+
+  private def dataAndColStats(): (DataFrame, Seq[(StructField, 
ColumnStat)]) = {
 import testImplicits._
 
 val intSeq = Seq(1, 2)
 val stringSeq = Seq("a", "bb")
+val binarySeq = Seq("a", "bb").map(_.getBytes)
 val booleanSeq = Seq(true, false)
-
 val data = intSeq.indices.map { i =>
-  (intSeq(i), stringSeq(i), booleanSeq(i))
+  (intSeq(i), stringSeq(i), binarySeq(i), booleanSeq(i))
 }
-val tableName = "table"
-withTable(tableName) {
-  val df = data.toDF("c1", "c2", "c3")
-  df.write.format("parquet").saveAsTable(tableName)
-  val expectedColStatsSeq = df.schema.map { f =>
-val colStat = f.dataType match {
-  case IntegerType =>
-ColumnStat(InternalRow(0L, intSeq.max, intSeq.min, 
intSeq.distinct.length.toLong))
-  case StringType =>
-ColumnStat(InternalRow(0L, stringSeq.map(_.length).sum / 
stringSeq.length.toDouble,
-  stringSeq.map(_.length).max.toLong, 
stringSeq.distinct.length.toLong))
-  case BooleanType =>
-ColumnStat(InternalRow(0L, 
booleanSeq.count(_.equals(true)).toLong,
-  booleanSeq.count(_.equals(false)).toLong))
-}
-(f, colStat)
+val df = data.toDF("c1", "c2", "c3", "c4")
+val expectedColStatsSeq = df.schema.map { f =>
+  val colStat = f.dataType match {
+case IntegerType =>
+  ColumnStat(InternalRow(0L, intSeq.max, intSeq.min, 
intSeq.distinct.length.toLong))
+case StringType =>
+  ColumnStat(InternalRow(0L, stringSeq.map(_.length).sum / 
stringSeq.length.toDouble,
+stringSeq.map(_.length).max.toLong, 
stringSeq.distinct.length.toLong))
+case BinaryType =>
+  ColumnStat(InternalRow(0L, binarySeq.map(_.length).sum / 
binarySeq.length.toDouble,
+binarySeq.map(_.length).max.toLong))
+case BooleanType =>
+  ColumnStat(InternalRow(0L, 
booleanSeq.count(_.equals(true)).toLong,
+booleanSeq.count(_.equals(false)).toLong))
   }
+  (f, colStat)
+}
+(df, expectedColStatsSeq)
+  }
+
+  private def checkColStats(
+  tableName: String,
+  isDataSourceTable: Boolean,
+  expectedColStatsSeq: Seq[(StructField, ColumnStat)]): Unit = {
+val readback = spark.table(tableName)
+val stats = readback.queryExecution.analyzed.collect {
+  case rel: MetastoreRelation =>
+assert(!isDataSourceTable, "Expected a Hive serde table, but got a 
data source table")
+rel.catalogTable.stats.get
+  case rel: LogicalRelation =>
+assert(isDataSourceTable, "Expected a data source table, but got a 
Hive serde table")
+rel.catalogTable.get.stats.get
+}
+assert(stats.length == 1)
+val columnStats = stats.head.colStats
+assert(columnStats.size == expectedColStatsSeq.length)
+expectedColStatsSeq.foreach { case (field, expectedColStat) =>
+  StatisticsTest.checkColStat(
+dataType = field.dataType,

[GitHub] spark pull request #15386: [SPARK-17808][PYSPARK] Upgraded version of Pyroli...

2016-10-11 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15386


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15408: [SPARK-17839][CORE] Use Nio's directbuffer instead of Bu...

2016-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15408
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15408: [SPARK-17839][CORE] Use Nio's directbuffer instead of Bu...

2016-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15408
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66714/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15297: [WIP][SPARK-9862]Handling data skew

2016-10-11 Thread SaintBacchus

Github user SaintBacchus commented on a diff in the pull request:

https://github.com/apache/spark/pull/15297#discussion_r82728585
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SkewShuffleRowRDD.scala 
---
@@ -0,0 +1,147 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import java.util.Arrays
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+
+class SkewCoalescedPartitioner(
+val parent: Partitioner,
--- End diff --

Nit: code format


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14702: [SPARK-15694] Implement ScriptTransformation in sql/core...

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14702
  
**[Test build #66723 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66723/consoleFull)**
 for PR 14702 at commit 
[`c7741f9`](https://github.com/apache/spark/commit/c7741f9560e619c1b3c2a30750e1386963cf5ece).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15295: [SPARK-17720][SQL] introduce static SQL conf

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15295
  
**[Test build #66722 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66722/consoleFull)**
 for PR 15295 at commit 
[`0ad8815`](https://github.com/apache/spark/commit/0ad8815b9042aadefa506e1e106822aa4bee810f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15427: [SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropduplicat...

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15427
  
**[Test build #66724 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66724/consoleFull)**
 for PR 15427 at commit 
[`dd6405c`](https://github.com/apache/spark/commit/dd6405c003ea082b1c614f2efed4d1bcb2d6f5b9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15360: [SPARK-17073] [SQL] [FOLLOWUP] generate column-le...

2016-10-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15360#discussion_r82730527
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -358,50 +358,180 @@ class StatisticsSuite extends QueryTest with 
TestHiveSingleton with SQLTestUtils
 }
   }
 
-  test("generate column-level statistics and load them from hive 
metastore") {
+  test("test refreshing table stats of cached data source table by 
`ANALYZE TABLE` statement") {
+val tableName = "tbl"
+withTable(tableName) {
+  val tableIndent = TableIdentifier(tableName, Some("default"))
+  val catalog = 
spark.sessionState.catalog.asInstanceOf[HiveSessionCatalog]
+  sql(s"CREATE TABLE $tableName (key int) USING PARQUET")
+  sql(s"INSERT INTO $tableName SELECT 1")
+  sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS")
+  // Table lookup will make the table cached.
+  catalog.lookupRelation(tableIndent)
+  val stats1 = 
catalog.getCachedDataSourceTable(tableIndent).asInstanceOf[LogicalRelation]
+.catalogTable.get.stats.get
+  assert(stats1.sizeInBytes > 0)
+  assert(stats1.rowCount.contains(1))
+
+  sql(s"INSERT INTO $tableName SELECT 2")
+  sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS")
+  catalog.lookupRelation(tableIndent)
+  val stats2 = 
catalog.getCachedDataSourceTable(tableIndent).asInstanceOf[LogicalRelation]
+.catalogTable.get.stats.get
+  assert(stats2.sizeInBytes > stats1.sizeInBytes)
+  assert(stats2.rowCount.contains(2))
+}
+  }
+
+  private def dataAndColStats(): (DataFrame, Seq[(StructField, 
ColumnStat)]) = {
 import testImplicits._
 
 val intSeq = Seq(1, 2)
 val stringSeq = Seq("a", "bb")
+val binarySeq = Seq("a", "bb").map(_.getBytes)
 val booleanSeq = Seq(true, false)
-
 val data = intSeq.indices.map { i =>
-  (intSeq(i), stringSeq(i), booleanSeq(i))
+  (intSeq(i), stringSeq(i), binarySeq(i), booleanSeq(i))
 }
-val tableName = "table"
-withTable(tableName) {
-  val df = data.toDF("c1", "c2", "c3")
-  df.write.format("parquet").saveAsTable(tableName)
-  val expectedColStatsSeq = df.schema.map { f =>
-val colStat = f.dataType match {
-  case IntegerType =>
-ColumnStat(InternalRow(0L, intSeq.max, intSeq.min, 
intSeq.distinct.length.toLong))
-  case StringType =>
-ColumnStat(InternalRow(0L, stringSeq.map(_.length).sum / 
stringSeq.length.toDouble,
-  stringSeq.map(_.length).max.toLong, 
stringSeq.distinct.length.toLong))
-  case BooleanType =>
-ColumnStat(InternalRow(0L, 
booleanSeq.count(_.equals(true)).toLong,
-  booleanSeq.count(_.equals(false)).toLong))
-}
-(f, colStat)
+val df = data.toDF("c1", "c2", "c3", "c4")
+val expectedColStatsSeq = df.schema.map { f =>
+  val colStat = f.dataType match {
+case IntegerType =>
+  ColumnStat(InternalRow(0L, intSeq.max, intSeq.min, 
intSeq.distinct.length.toLong))
+case StringType =>
+  ColumnStat(InternalRow(0L, stringSeq.map(_.length).sum / 
stringSeq.length.toDouble,
+stringSeq.map(_.length).max.toLong, 
stringSeq.distinct.length.toLong))
+case BinaryType =>
+  ColumnStat(InternalRow(0L, binarySeq.map(_.length).sum / 
binarySeq.length.toDouble,
+binarySeq.map(_.length).max.toLong))
+case BooleanType =>
+  ColumnStat(InternalRow(0L, 
booleanSeq.count(_.equals(true)).toLong,
+booleanSeq.count(_.equals(false)).toLong))
   }
+  (f, colStat)
+}
+(df, expectedColStatsSeq)
+  }
+
+  private def checkColStats(
+  tableName: String,
+  isDataSourceTable: Boolean,
+  expectedColStatsSeq: Seq[(StructField, ColumnStat)]): Unit = {
+val readback = spark.table(tableName)
+val stats = readback.queryExecution.analyzed.collect {
+  case rel: MetastoreRelation =>
+assert(!isDataSourceTable, "Expected a Hive serde table, but got a 
data source table")
+rel.catalogTable.stats.get
+  case rel: LogicalRelation =>
+assert(isDataSourceTable, "Expected a data source table, but got a 
Hive serde table")
+rel.catalogTable.get.stats.get
+}
+assert(stats.length == 1)
+val columnStats = stats.head.colStats
+assert(columnStats.size == expectedColStatsSeq.length)
+expectedColStatsSeq.foreach { case (field, expectedColStat) =>
+  StatisticsTest.checkColStat(
+dataType = field.dataType,

[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15285
  
**[Test build #66725 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66725/consoleFull)**
 for PR 15285 at commit 
[`7cc6935`](https://github.com/apache/spark/commit/7cc6935cfade55a54d866bf2431bb28ef2f2544a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.

2016-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15377
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15377
  
**[Test build #66715 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66715/consoleFull)**
 for PR 15377 at commit 
[`7485ffa`](https://github.com/apache/spark/commit/7485ffaa3df508f35df4b878ed715eb1ece0f4db).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15360: [SPARK-17073] [SQL] [FOLLOWUP] generate column-le...

2016-10-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15360#discussion_r82731661
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala
 ---
@@ -62,7 +62,7 @@ case class AnalyzeColumnCommand(
   val statistics = Statistics(
 sizeInBytes = newTotalSize,
 rowCount = Some(rowCount),
-colStats = columnStats ++ 
catalogTable.stats.map(_.colStats).getOrElse(Map()))
+colStats = catalogTable.stats.map(_.colStats).getOrElse(Map()) ++ 
columnStats)
--- End diff --

Could you leave a code comment here to emphasize it?  I am just afraid this 
might be modified without notice. Newly computed stats should override the 
existing stats.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15408: [SPARK-17839][CORE] Use Nio's directbuffer instea...

2016-10-11 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15408#discussion_r82732276
  
--- Diff: 
core/src/main/java/org/apache/spark/io/NioBasedBufferedFileInputStream.java ---
@@ -0,0 +1,127 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.io;
+
+import org.apache.spark.storage.StorageUtils;
+
+import java.io.File;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.nio.channels.FileChannel;
+import java.nio.file.StandardOpenOption;
+
+/**
+ * {@link InputStream} implementation which uses direct buffer
+ * to read a file to avoid extra copy of data between Java and
+ * native memory which happens when using {@link 
java.io.BufferedInputStream}.
+ * Unfortunately, this is not something already available in JDK,
+ * {@link sun.nio.ch.ChannelInputStream} supports reading a file using nio,
+ * but does not support buffering.
+ *
+ */
+public final class NioBasedBufferedFileInputStream extends InputStream {
+
+  private static int DEFAULT_BUFFER_SIZE_BYTES = 8192;
+
+  private final ByteBuffer byteBuffer;
+
+  private final FileChannel fileChannel;
+
+  public NioBasedBufferedFileInputStream(File file, int bufferSizeInBytes) 
throws IOException {
+byteBuffer = ByteBuffer.allocateDirect(bufferSizeInBytes);
+fileChannel = FileChannel.open(file.toPath(), StandardOpenOption.READ);
+byteBuffer.flip();
+  }
+
+  public NioBasedBufferedFileInputStream(File file) throws IOException {
+this(file, DEFAULT_BUFFER_SIZE_BYTES);
+  }
+
+  /**
+   * Checks weather data is left to be read from the input stream.
+   * @return true if data is left, false otherwise
+   * @throws IOException
+   */
+  private boolean refill() throws IOException {
+if (!byteBuffer.hasRemaining()) {
+  byteBuffer.clear();
+  int nRead = fileChannel.read(byteBuffer);
+  if (nRead == -1) {
--- End diff --

It is possible to read 0 bytes and then next time able to read more than 0 
bytes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15324: [SPARK-16872][ML] Gaussian Naive Bayes Classifier

2016-10-11 Thread zhengruifeng

Github user zhengruifeng commented on the issue:

https://github.com/apache/spark/pull/15324
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15072: [SPARK-17123][SQL] Use type-widened encoder for D...

2016-10-11 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15072#discussion_r82730295
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -53,7 +53,15 @@ import org.apache.spark.util.Utils
 
 private[sql] object Dataset {
   def apply[T: Encoder](sparkSession: SparkSession, logicalPlan: 
LogicalPlan): Dataset[T] = {
-new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]])
+val encoder = implicitly[Encoder[T]]
+if (encoder.clsTag.runtimeClass == classOf[Row]) {
+  // We should use the encoder generated from the executed plan rather 
than the existing
+  // encoder for DataFrame because the types of columns can be varied 
due to widening types.
+  // See SPARK-17123. This is a bit hacky. Maybe we should find a 
better way to do this.
+  ofRows(sparkSession, logicalPlan).asInstanceOf[Dataset[T]]
+} else {
+  new Dataset(sparkSession, logicalPlan, encoder)
+}
--- End diff --

Ah, here is the codes I ran

```scala
val dates = Seq(
  (new Date(0), BigDecimal.valueOf(1), new Timestamp(2)),
  (new Date(3), BigDecimal.valueOf(4), new Timestamp(5))
).toDF("date", "timestamp", "decimal")

val widenTypedRows = Seq(
  (new Timestamp(2), 10.5D, "string")
).toDF("date", "timestamp", "decimal")

dates.except(widenTypedRows).collect()
```

and error message.

```java
23:10:05.331 ERROR 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', 
Line 30, Column 107: No applicable constructor/method found for actual 
parameters "long"; candidates are: "public static java.sql.Date 
org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)"
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificSafeProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private InternalRow mutableRow;
/* 009 */   private Object[] values;
/* 010 */   private org.apache.spark.sql.types.StructType schema;
/* 011 */
/* 012 */   public SpecificSafeProjection(Object[] references) {
/* 013 */ this.references = references;
/* 014 */ mutableRow = (InternalRow) references[references.length - 1];
/* 015 */
/* 016 */ this.schema = (org.apache.spark.sql.types.StructType) 
references[0];
/* 017 */
/* 018 */   }
/* 019 */
/* 020 */
/* 021 */
/* 022 */   public java.lang.Object apply(java.lang.Object _i) {
/* 023 */ InternalRow i = (InternalRow) _i;
/* 024 */
/* 025 */ values = new Object[3];
/* 026 */
/* 027 */ boolean isNull2 = i.isNullAt(0);
/* 028 */ long value2 = isNull2 ? -1L : (i.getLong(0));
/* 029 */ boolean isNull1 = isNull2;
/* 030 */ final java.sql.Date value1 = isNull1 ? null : 
org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(value2);
/* 031 */ isNull1 = value1 == null;
/* 032 */ if (isNull1) {
/* 033 */   values[0] = null;
/* 034 */ } else {
/* 035 */   values[0] = value1;
/* 036 */ }
/* 037 */
/* 038 */ boolean isNull4 = i.isNullAt(1);
/* 039 */ double value4 = isNull4 ? -1.0 : (i.getDouble(1));
/* 040 */
/* 041 */ boolean isNull3 = isNull4;
/* 042 */ java.math.BigDecimal value3 = null;
/* 043 */ if (!isNull3) {
/* 044 */
/* 045 */   Object funcResult = null;
/* 046 */   funcResult = value4.toJavaBigDecimal();
/* 047 */   if (funcResult == null) {
/* 048 */ isNull3 = true;
/* 049 */   } else {
/* 050 */ value3 = (java.math.BigDecimal) funcResult;
/* 051 */   }
/* 052 */
/* 053 */ }
/* 054 */ isNull3 = value3 == null;
/* 055 */ if (isNull3) {
/* 056 */   values[1] = null;
/* 057 */ } else {
/* 058 */   values[1] = value3;
/* 059 */ }
/* 060 */
/* 061 */ boolean isNull6 = i.isNullAt(2);
/* 062 */ UTF8String value6 = isNull6 ? null : (i.getUTF8String(2));
/* 063 */ boolean isNull5 = isNull6;
/* 064 */ final java.sql.Timestamp value5 = isNull5 ? null : 
org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaTimestamp(value6);
/* 065 */ isNull5 = value5 == null;
/* 066 */ if (isNull5) {
/* 067 */   values[2] = null;
/* 068 */ } else {
/* 069 */   values[2] = value5;
/* 070 */ }
/* 071 */
/* 072 */ final org.apache.spark.sql.Row value = new

[GitHub] spark issue #15386: [SPARK-17808][PYSPARK] Upgraded version of Pyrolite to 4...

2016-10-11 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/15386
  
Merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15360: [SPARK-17073] [SQL] [FOLLOWUP] generate column-le...

2016-10-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15360#discussion_r82731979
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -358,50 +358,180 @@ class StatisticsSuite extends QueryTest with 
TestHiveSingleton with SQLTestUtils
 }
   }
 
-  test("generate column-level statistics and load them from hive 
metastore") {
+  test("test refreshing table stats of cached data source table by 
`ANALYZE TABLE` statement") {
--- End diff --

Could you deduplicate the two test cases `refreshing table stats` and 
`refreshing column stats` by calling the same common function?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15408: [SPARK-17839][CORE] Use Nio's directbuffer instea...

2016-10-11 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15408#discussion_r82732392
  
--- Diff: 
core/src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java ---
@@ -0,0 +1,129 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.io;
+
+import org.apache.spark.storage.StorageUtils;
+
+import java.io.File;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.nio.channels.FileChannel;
+import java.nio.file.StandardOpenOption;
+
+/**
+ * {@link InputStream} implementation which uses direct buffer
+ * to read a file to avoid extra copy of data between Java and
+ * native memory which happens when using {@link 
java.io.BufferedInputStream}.
+ * Unfortunately, this is not something already available in JDK,
+ * {@link sun.nio.ch.ChannelInputStream} supports reading a file using nio,
+ * but does not support buffering.
+ *
+ * TODO: support {@link #mark(int)}/{@link #reset()}
+ *
+ */
+public final class NioBufferedFileInputStream extends InputStream {
+
+  private static int DEFAULT_BUFFER_SIZE_BYTES = 8192;
+
+  private final ByteBuffer byteBuffer;
+
+  private final FileChannel fileChannel;
+
+  public NioBufferedFileInputStream(File file, int bufferSizeInBytes) 
throws IOException {
+byteBuffer = ByteBuffer.allocateDirect(bufferSizeInBytes);
+fileChannel = FileChannel.open(file.toPath(), StandardOpenOption.READ);
+byteBuffer.flip();
+  }
+
+  public NioBufferedFileInputStream(File file) throws IOException {
+this(file, DEFAULT_BUFFER_SIZE_BYTES);
+  }
+
+  /**
+   * Checks weather data is left to be read from the input stream.
+   * @return true if data is left, false otherwise
+   * @throws IOException
+   */
+  private boolean refill() throws IOException {
+if (!byteBuffer.hasRemaining()) {
+  byteBuffer.clear();
+  int nRead = fileChannel.read(byteBuffer);
+  if (nRead <= 0) {
--- End diff --

Should only return false if nRead is -1, which is the contract specified by 
FileChannel.read().

We should make sure the thing still works if nRead is 0.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log

2016-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15285
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66718/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15285
  
**[Test build #66718 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66718/consoleFull)**
 for PR 15285 at commit 
[`e5676a6`](https://github.com/apache/spark/commit/e5676a6d4e60e7b7446bf525fb7003cb26efc448).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15377
  
**[Test build #66719 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66719/consoleFull)**
 for PR 15377 at commit 
[`df28bdd`](https://github.com/apache/spark/commit/df28bdddce5e4789a02cf7ef5dedab8b7c408630).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15408: [SPARK-17839][CORE] Use Nio's directbuffer instea...

2016-10-11 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15408#discussion_r82733349
  
--- Diff: 
core/src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java ---
@@ -0,0 +1,129 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.io;
+
+import org.apache.spark.storage.StorageUtils;
+
+import java.io.File;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.nio.channels.FileChannel;
+import java.nio.file.StandardOpenOption;
+
+/**
+ * {@link InputStream} implementation which uses direct buffer
+ * to read a file to avoid extra copy of data between Java and
+ * native memory which happens when using {@link 
java.io.BufferedInputStream}.
+ * Unfortunately, this is not something already available in JDK,
+ * {@link sun.nio.ch.ChannelInputStream} supports reading a file using nio,
+ * but does not support buffering.
+ *
+ * TODO: support {@link #mark(int)}/{@link #reset()}
+ *
+ */
+public final class NioBufferedFileInputStream extends InputStream {
+
+  private static int DEFAULT_BUFFER_SIZE_BYTES = 8192;
+
+  private final ByteBuffer byteBuffer;
+
+  private final FileChannel fileChannel;
+
+  public NioBufferedFileInputStream(File file, int bufferSizeInBytes) 
throws IOException {
+byteBuffer = ByteBuffer.allocateDirect(bufferSizeInBytes);
+fileChannel = FileChannel.open(file.toPath(), StandardOpenOption.READ);
+byteBuffer.flip();
+  }
+
+  public NioBufferedFileInputStream(File file) throws IOException {
+this(file, DEFAULT_BUFFER_SIZE_BYTES);
+  }
+
+  /**
+   * Checks weather data is left to be read from the input stream.
+   * @return true if data is left, false otherwise
+   * @throws IOException
+   */
+  private boolean refill() throws IOException {
+if (!byteBuffer.hasRemaining()) {
+  byteBuffer.clear();
+  int nRead = fileChannel.read(byteBuffer);
+  if (nRead <= 0) {
--- End diff --

The thing is, that will just result in an error on the next read(), because 
nothing was put in the buffer. Is that worse? I'm not even sure. But, calling 
it in a loop could result in an infinite loop. I'm accustomed to stopping 
reading when the result is 0 in cases like this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15285
  
**[Test build #66716 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66716/consoleFull)**
 for PR 15285 at commit 
[`ef4f2b9`](https://github.com/apache/spark/commit/ef4f2b9dc1be33d56d7d4c93bddcfcc2a69a44e9).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11459: [SPARK-13025] Allow users to set initial model in logist...

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/11459
  
**[Test build #66726 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66726/consoleFull)**
 for PR 11459 at commit 
[`ab05aa6`](https://github.com/apache/spark/commit/ab05aa61b21a23531e43d082757a40bf2ab750d8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.

2016-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15377
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66715/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15408: [SPARK-17839][CORE] Use Nio's directbuffer instead of Bu...

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15408
  
**[Test build #66714 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66714/consoleFull)**
 for PR 15408 at commit 
[`681ff62`](https://github.com/apache/spark/commit/681ff62409e1f6520057bdeafd991e2c12a0b232).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log

2016-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15285
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.

2016-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15377
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.

2016-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15377
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66719/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...

2016-10-11 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/15421
  
This fails on AppVeyor - any idea?
```
. Error: SPARK-17811: can create DataFrame containing NA as date and time 
(@test_sparkSQL.R#388) 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
in stage 105.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
105.0 (TID 109, localhost): java.lang.NegativeArraySizeException
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15230: [SPARK-17657] [SQL] Disallow Users to Change Table Type

2016-10-11 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15230
  
retest this please



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15421: [SPARK-17811] SparkR cannot parallelize data.fram...

2016-10-11 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15421#discussion_r82733869
  
--- Diff: R/pkg/DESCRIPTION ---
@@ -11,7 +11,8 @@ Authors@R: c(person("Shivaram", "Venkataraman", role = 
c("aut", "cre"),
 email = "felixche...@apache.org"),
  person(family = "The Apache Software Foundation", role = 
c("aut", "cph")))
 URL: http://www.apache.org/ http://spark.apache.org/
-BugReports: 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingBugReports
+BugReports: 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to
++Spark#ContributingtoSpark-ContributingBugReports
--- End diff --

Do we know if this works if broken up into 2 lines?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15324: [SPARK-16872][ML] Gaussian Naive Bayes Classifier

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15324
  
**[Test build #66727 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66727/consoleFull)**
 for PR 15324 at commit 
[`c6ad342`](https://github.com/apache/spark/commit/c6ad34258053a7df29e062ec6f1ab05a3b30302a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15230: [SPARK-17657] [SQL] Disallow Users to Change Table Type

2016-10-11 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15230
  
cc @cloud-fan Could you review it again? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15342: [SPARK-11560] [MLLIB] Optimize KMeans implementat...

2016-10-11 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15342#discussion_r82735096
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -499,13 +414,38 @@ object KMeans {
* @param data Training points as an `RDD` of `Vector` types.
* @param k Number of clusters to create.
* @param maxIterations Maximum number of iterations allowed.
+   * @param initializationMode The initialization algorithm. This can 
either be "random" or
+   *   "k-means||". (default: "k-means||")
+   * @param seed Random seed for cluster initialization. Default is to 
generate seed based
+   * on system time.
+   */
+  @Since("2.1.0")
+  def train(data: RDD[Vector],
--- End diff --

+1 @sethah 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14788: [SPARK-17174][SQL] Add the support for TimestampT...

2016-10-11 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14788#discussion_r82735390
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -2374,14 +2374,14 @@ object functions {
* @group datetime_funcs
* @since 1.5.0
*/
-  def date_add(start: Column, days: Int): Column = withExpr { 
DateAdd(start.expr, Literal(days)) }
+  def date_add(start: Column, days: Int): Column = withExpr { 
AddDays(start.expr, Literal(days)) }
--- End diff --

It's confusing to users that `date_add` add days to the given date and 
`add_months` add months to the given date. I think `add_days` and `add_months` 
are more consistent.

Other databases(e.g. MySQL, Postgres) only have `date_add` which adds 
interval to the given date, so that they don't need `add_days` and `add_months` 
respectively.

The function name is already realsed and maybe hard to change, but changing 
the expression name to match the really logic seems good.

@rxin any thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15424: [SPARK-17338][SQL][follow-up] add global temp view

2016-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15424
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66720/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15426: [SPARK-17864][SQL] Mark data type APIs as stable (not De...

2016-10-11 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15426
  
cc @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15424: [SPARK-17338][SQL][follow-up] add global temp vie...

2016-10-11 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15424


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15426: [SPARK-17864][SQL] Mark data type APIs as stable (not De...

2016-10-11 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/15426
  
merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15408: [SPARK-17839][CORE] Use Nio's directbuffer instea...

2016-10-11 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15408#discussion_r82737475
  
--- Diff: 
core/src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java ---
@@ -0,0 +1,129 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.io;
+
+import org.apache.spark.storage.StorageUtils;
+
+import java.io.File;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.nio.channels.FileChannel;
+import java.nio.file.StandardOpenOption;
+
+/**
+ * {@link InputStream} implementation which uses direct buffer
+ * to read a file to avoid extra copy of data between Java and
+ * native memory which happens when using {@link 
java.io.BufferedInputStream}.
+ * Unfortunately, this is not something already available in JDK,
+ * {@link sun.nio.ch.ChannelInputStream} supports reading a file using nio,
+ * but does not support buffering.
+ *
+ * TODO: support {@link #mark(int)}/{@link #reset()}
+ *
+ */
+public final class NioBufferedFileInputStream extends InputStream {
+
+  private static int DEFAULT_BUFFER_SIZE_BYTES = 8192;
+
+  private final ByteBuffer byteBuffer;
+
+  private final FileChannel fileChannel;
+
+  public NioBufferedFileInputStream(File file, int bufferSizeInBytes) 
throws IOException {
+byteBuffer = ByteBuffer.allocateDirect(bufferSizeInBytes);
+fileChannel = FileChannel.open(file.toPath(), StandardOpenOption.READ);
+byteBuffer.flip();
+  }
+
+  public NioBufferedFileInputStream(File file) throws IOException {
+this(file, DEFAULT_BUFFER_SIZE_BYTES);
+  }
+
+  /**
+   * Checks weather data is left to be read from the input stream.
+   * @return true if data is left, false otherwise
+   * @throws IOException
+   */
+  private boolean refill() throws IOException {
+if (!byteBuffer.hasRemaining()) {
+  byteBuffer.clear();
+  int nRead = fileChannel.read(byteBuffer);
+  if (nRead <= 0) {
--- End diff --

It is pretty standard to rely only in -1. It is especially common in 
networking code to get 0 bytes read.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11459: [SPARK-13025] Allow users to set initial model in logist...

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/11459
  
**[Test build #66726 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66726/consoleFull)**
 for PR 11459 at commit 
[`ab05aa6`](https://github.com/apache/spark/commit/ab05aa61b21a23531e43d082757a40bf2ab750d8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15426: [SPARK-17864][SQL] Mark data type APIs as stable ...

2016-10-11 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15426


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15072: [SPARK-17123][SQL] Use type-widened encoder for D...

2016-10-11 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15072#discussion_r82737858
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -53,7 +53,15 @@ import org.apache.spark.util.Utils
 
 private[sql] object Dataset {
   def apply[T: Encoder](sparkSession: SparkSession, logicalPlan: 
LogicalPlan): Dataset[T] = {
-new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]])
+val encoder = implicitly[Encoder[T]]
+if (encoder.clsTag.runtimeClass == classOf[Row]) {
+  // We should use the encoder generated from the executed plan rather 
than the existing
+  // encoder for DataFrame because the types of columns can be varied 
due to widening types.
+  // See SPARK-17123. This is a bit hacky. Maybe we should find a 
better way to do this.
+  ofRows(sparkSession, logicalPlan).asInstanceOf[Dataset[T]]
+} else {
+  new Dataset(sparkSession, logicalPlan, encoder)
+}
--- End diff --

Ah-ha, it seems we are widening types for set operators[1].

```scala
val dates = Seq(
  (new Date(0), BigDecimal.valueOf(1), new Timestamp(2)),
  (new Date(3), BigDecimal.valueOf(4), new Timestamp(5))
).toDF("date", "timestamp", "decimal")

val widenTypedRows = Seq(
  (new Timestamp(2), 10.5D, "string")
).toDF("date", "timestamp", "decimal")

dates.printSchema()
dates.except(widenTypedRows).toDF().printSchema()
```

prints

```
root
 |-- date: date (nullable = true)
 |-- timestamp: decimal(38,18) (nullable = true)
 |-- decimal: timestamp (nullable = true)

root
 |-- date: timestamp (nullable = true)
 |-- timestamp: double (nullable = true)
 |-- decimal: string (nullable = true)
```


[1]https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala#L232-L249


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14847: [SPARK-17254][SQL] Filter can stop when the condition is...

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14847
  
**[Test build #66730 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66730/consoleFull)**
 for PR 14847 at commit 
[`ccac04f`](https://github.com/apache/spark/commit/ccac04f1e788dfb278618a10ee9220c89df6a61d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15072: [SPARK-17123][SQL] Use type-widened encoder for D...

2016-10-11 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15072#discussion_r82738407
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -53,7 +53,15 @@ import org.apache.spark.util.Utils
 
 private[sql] object Dataset {
   def apply[T: Encoder](sparkSession: SparkSession, logicalPlan: 
LogicalPlan): Dataset[T] = {
-new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]])
+val encoder = implicitly[Encoder[T]]
+if (encoder.clsTag.runtimeClass == classOf[Row]) {
+  // We should use the encoder generated from the executed plan rather 
than the existing
+  // encoder for DataFrame because the types of columns can be varied 
due to widening types.
+  // See SPARK-17123. This is a bit hacky. Maybe we should find a 
better way to do this.
+  ofRows(sparkSession, logicalPlan).asInstanceOf[Dataset[T]]
+} else {
+  new Dataset(sparkSession, logicalPlan, encoder)
+}
--- End diff --

Ah, the reason why `intersect` was fine seems because it does contains any 
data. `intersect` seems also failed:

```
val dates = Seq(Tuple1(BigDecimal.valueOf(10.5))).toDF("decimal")
val widenTypedRows =  Seq(Tuple1(10.5D)).toDF("decimal")
dates.intersect(widenTypedRows).collect()
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15342: [SPARK-11560] [MLLIB] Optimize KMeans implementation / r...

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15342
  
**[Test build #66729 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66729/consoleFull)**
 for PR 15342 at commit 
[`ba52582`](https://github.com/apache/spark/commit/ba52582a1313be9d9febe215fe6f21b2d9be239f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15428: [SPARK-17219][ML] enchanced NaN value handling in Bucket...

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15428
  
**[Test build #66731 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66731/consoleFull)**
 for PR 15428 at commit 
[`a3e4308`](https://github.com/apache/spark/commit/a3e43086dcf6ecee20461567e2cc506db29f80a7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15429: [SPARK-17840] [DOCS] Add some pointers for wiki/C...

2016-10-11 Thread srowen

GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/15429

[SPARK-17840] [DOCS] Add some pointers for wiki/CONTRIBUTING.md in 
README.md and some warnings in PULL_REQUEST_TEMPLATE

## What changes were proposed in this pull request?

Link to contributing wiki in PR template, README.md

## How was this patch tested?

Doc-only change, tested by Jekyll

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-17840

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15429.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15429


commit 29941b9880719d410ba826f173739abd2091b463
Author: Sean Owen 
Date:   2016-10-11T07:51:16Z

Link to contributing wiki in PR template, README.md




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15428: [SPARK-17219][ML] enchanced NaN value handling in Bucket...

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15428
  
**[Test build #66733 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66733/consoleFull)**
 for PR 15428 at commit 
[`cd8113c`](https://github.com/apache/spark/commit/cd8113c456870dd89754d0d48bfe6d0931ad05bb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15429: [SPARK-17840] [DOCS] Add some pointers for wiki/CONTRIBU...

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15429
  
**[Test build #66732 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66732/consoleFull)**
 for PR 15429 at commit 
[`29941b9`](https://github.com/apache/spark/commit/29941b9880719d410ba826f173739abd2091b463).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15388: [SPARK-17821][SQL] Support And and Or in Expression Cano...

2016-10-11 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/15388
  
thanks, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15428: [SPARK-17219][ML] enchanced NaN value handling in...

2016-10-11 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r82741770
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -128,8 +145,9 @@ object Bucketizer extends 
DefaultParamsReadable[Bucketizer] {
* Binary searching in several buckets to place each data point.
* @throws SparkException if a feature is < splits.head or > splits.last
*/
-  private[feature] def binarySearchForBuckets(splits: Array[Double], 
feature: Double): Double = {
-if (feature.isNaN) {
+  private[feature] def binarySearchForBuckets
+  (splits: Array[Double], feature: Double, flag: String): Double = {
--- End diff --

Nit: I think the convention is to leave the open paren on the previous line

Doesn't this need to handle "skip" and "error"? throw an exception on NaN 
if "error" or ignore it if "skip"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15428: [SPARK-17219][ML] enchanced NaN value handling in...

2016-10-11 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15428#discussion_r82741203
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala ---
@@ -270,10 +270,10 @@ private[ml] trait HasFitIntercept extends Params {
 private[ml] trait HasHandleInvalid extends Params {
 
   /**
-   * Param for how to handle invalid entries. Options are skip (which will 
filter out rows with bad values), or error (which will throw an error). More 
options may be added later.
+   * Param for how to handle invalid entries. Options are skip (which will 
filter out rows with bad values), or error (which will throw an error), or keep 
(which will keep the bad values in certain way). More options may be added 
later.
--- End diff --

I'm neutral on the complexity that this adds, but not against it. It gets a 
little funny to say "keep invalid data" but I think we discussed that on the 
JIRA


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15297: [WIP][SPARK-9862]Handling data skew

2016-10-11 Thread witgo

Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/15297#discussion_r82742056
  
--- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
@@ -138,13 +138,16 @@ private[spark] abstract class MapOutputTracker(conf: 
SparkConf) extends Logging
* and the second item is a sequence of (shuffle block id, 
shuffle block size) tuples
* describing the shuffle blocks that are stored at that block 
manager.
*/
-  def getMapSizesByExecutorId(shuffleId: Int, startPartition: Int, 
endPartition: Int)
+  def getMapSizesByExecutorId(shuffleId: Int, startPartition: Int, 
endPartition: Int,
+  mapid: Int = -1)
--- End diff --

`mapid` => `mapId`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15388: [SPARK-17821][SQL] Support And and Or in Expressi...

2016-10-11 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15388


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15297: [WIP][SPARK-9862]Handling data skew

2016-10-11 Thread witgo

Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/15297#discussion_r82743168
  
--- Diff: core/src/main/scala/org/apache/spark/shuffle/ShuffleManager.scala 
---
@@ -48,7 +48,8 @@ private[spark] trait ShuffleManager {
   handle: ShuffleHandle,
   startPartition: Int,
   endPartition: Int,
-  context: TaskContext): ShuffleReader[K, C]
+  context: TaskContext,
+  mapid: Int = -1): ShuffleReader[K, C]
--- End diff --

`mapid` => `mapId`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15405: [SPARK-15917][CORE] Added support for number of executor...

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15405
  
**[Test build #3323 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3323/consoleFull)**
 for PR 15405 at commit 
[`bffedac`](https://github.com/apache/spark/commit/bffedac0756c98861f44dfa0967ec8477c63c4cc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15423: [SPARK-17860][SQL] SHOW COLUMN's database conflic...

2016-10-11 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15423#discussion_r82744230
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -168,17 +168,7 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder {
* }}}
*/
   override def visitShowColumns(ctx: ShowColumnsContext): LogicalPlan = 
withOrigin(ctx) {
-val table = visitTableIdentifier(ctx.tableIdentifier)
-
-val lookupTable = Option(ctx.db) match {
-  case None => table
-  case Some(db) if table.database.exists(_ != db) =>
-operationNotAllowed(
-  s"SHOW COLUMNS with conflicting databases: '$db' != 
'${table.database.get}'",
-  ctx)
-  case Some(db) => TableIdentifier(table.identifier, Some(db.getText))
-}
-ShowColumnsCommand(lookupTable)
+ShowColumnsCommand(Option(ctx.db).map(_.getText), 
visitTableIdentifier(ctx.tableIdentifier))
--- End diff --

FYI, MySQL will treat `SHOW COLUMNS FROM db1.tbl1 FROM db2` as `SHOW 
COLUMNS FROM tbl1 FROM db2`, i.e. if `FROM database` is specified, it will just 
ignore the database specified in table name, instead of reporting error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15430: [SPARK-15957][Follow-up][ML][PySpark] Add Python ...

2016-10-11 Thread yanboliang

GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/15430

[SPARK-15957][Follow-up][ML][PySpark] Add Python API for RFormula 
forceIndexLabel.

## What changes were proposed in this pull request?
Follow-up work of #13675, add Python API for ```RFormula forceIndexLabel```.

## How was this patch tested?
Unit test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-15957-python

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15430.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15430


commit e34610c216c45d9500423e8c49425ca5a327c52e
Author: Yanbo Liang 
Date:   2016-10-11T08:17:11Z

Add Python API for RFormula forceIndexLabel.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-11 Thread MLnick

Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/9
  
I misread DB's meaning in my previous comment.

I agree that the parameter settings of `initialModel`, if set, should take 
precedence. If it conflicts with an existing `k` then log a warning.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15423: [SPARK-17860][SQL] SHOW COLUMN's database conflic...

2016-10-11 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15423#discussion_r82745403
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -168,17 +168,7 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder {
* }}}
*/
   override def visitShowColumns(ctx: ShowColumnsContext): LogicalPlan = 
withOrigin(ctx) {
-val table = visitTableIdentifier(ctx.tableIdentifier)
-
-val lookupTable = Option(ctx.db) match {
-  case None => table
-  case Some(db) if table.database.exists(_ != db) =>
-operationNotAllowed(
-  s"SHOW COLUMNS with conflicting databases: '$db' != 
'${table.database.get}'",
-  ctx)
-  case Some(db) => TableIdentifier(table.identifier, Some(db.getText))
-}
-ShowColumnsCommand(lookupTable)
+ShowColumnsCommand(Option(ctx.db).map(_.getText), 
visitTableIdentifier(ctx.tableIdentifier))
--- End diff --

Good  point!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15427: [SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropduplicat...

2016-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15427
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66724/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15427: [SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropduplicat...

2016-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15427
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15428: [SPARK-17219][ML] enchanced NaN value handling in Bucket...

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15428
  
**[Test build #66735 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66735/consoleFull)**
 for PR 15428 at commit 
[`5cd58b7`](https://github.com/apache/spark/commit/5cd58b776883d5185f2b050317d260b924ebef5e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15342: [SPARK-11560] [MLLIB] Optimize KMeans implementat...

2016-10-11 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15342#discussion_r82734529
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -258,149 +252,106 @@ class KMeans private (
 }
 }
 val initTimeInSeconds = (System.nanoTime() - initStartTime) / 1e9
-logInfo(s"Initialization with $initializationMode took " + 
"%.3f".format(initTimeInSeconds) +
-  " seconds.")
+logInfo(f"Initialization with $initializationMode took 
$initTimeInSeconds%.3f seconds.")
 
-val active = Array.fill(numRuns)(true)
-val costs = Array.fill(numRuns)(0.0)
-
-var activeRuns = new ArrayBuffer[Int] ++ (0 until numRuns)
+var converged = false
+var cost = 0.0
 var iteration = 0
 
 val iterationStartTime = System.nanoTime()
 
-instr.foreach(_.logNumFeatures(centers(0)(0).vector.size))
+instr.foreach(_.logNumFeatures(centers.head.vector.size))
 
-// Execute iterations of Lloyd's algorithm until all runs have 
converged
-while (iteration < maxIterations && !activeRuns.isEmpty) {
-  type WeightedPoint = (Vector, Long)
-  def mergeContribs(x: WeightedPoint, y: WeightedPoint): WeightedPoint 
= {
-axpy(1.0, x._1, y._1)
-(y._1, x._2 + y._2)
-  }
-
-  val activeCenters = activeRuns.map(r => centers(r)).toArray
-  val costAccums = activeRuns.map(_ => sc.doubleAccumulator)
-
-  val bcActiveCenters = sc.broadcast(activeCenters)
+// Execute iterations of Lloyd's algorithm until converged
+while (iteration < maxIterations && !converged) {
+  val costAccum = sc.doubleAccumulator
+  val bcCenters = sc.broadcast(centers)
 
   // Find the sum and count of points mapping to each center
   val totalContribs = data.mapPartitions { points =>
-val thisActiveCenters = bcActiveCenters.value
-val runs = thisActiveCenters.length
-val k = thisActiveCenters(0).length
-val dims = thisActiveCenters(0)(0).vector.size
+val thisCenters = bcCenters.value
+val dims = thisCenters.head.vector.size
 
-val sums = Array.fill(runs, k)(Vectors.zeros(dims))
-val counts = Array.fill(runs, k)(0L)
+val sums = Array.fill(thisCenters.length)(Vectors.zeros(dims))
+val counts = Array.fill(thisCenters.length)(0L)
 
 points.foreach { point =>
-  (0 until runs).foreach { i =>
-val (bestCenter, cost) = 
KMeans.findClosest(thisActiveCenters(i), point)
-costAccums(i).add(cost)
-val sum = sums(i)(bestCenter)
-axpy(1.0, point.vector, sum)
-counts(i)(bestCenter) += 1
-  }
+  val (bestCenter, cost) = KMeans.findClosest(thisCenters, point)
+  costAccum.add(cost)
+  val sum = sums(bestCenter)
+  axpy(1.0, point.vector, sum)
+  counts(bestCenter) += 1
 }
 
-val contribs = for (i <- 0 until runs; j <- 0 until k) yield {
-  ((i, j), (sums(i)(j), counts(i)(j)))
-}
-contribs.iterator
-  }.reduceByKey(mergeContribs).collectAsMap()
-
-  bcActiveCenters.destroy(blocking = false)
-
-  // Update the cluster centers and costs for each active run
-  for ((run, i) <- activeRuns.zipWithIndex) {
-var changed = false
-var j = 0
-while (j < k) {
-  val (sum, count) = totalContribs((i, j))
-  if (count != 0) {
-scal(1.0 / count, sum)
-val newCenter = new VectorWithNorm(sum)
-if (KMeans.fastSquaredDistance(newCenter, centers(run)(j)) > 
epsilon * epsilon) {
-  changed = true
-}
-centers(run)(j) = newCenter
-  }
-  j += 1
-}
-if (!changed) {
-  active(run) = false
-  logInfo("Run " + run + " finished in " + (iteration + 1) + " 
iterations")
+counts.indices.filter(counts(_) > 0).map(j => (j, (sums(j), 
counts(j.iterator
+  }.reduceByKey { case ((sum1, count1), (sum2, count2)) =>
+axpy(1.0, sum2, sum1)
+(sum1, count1 + count2)
+  }.collectAsMap()
+
+  bcCenters.destroy(blocking = false)
+
+  // Update the cluster centers and costs
+  converged = true
+  totalContribs.foreach { case (j, (sum, count)) =>
+scal(1.0 / count, sum)
+val newCenter = new VectorWithNorm(sum)
+if (converged && KMeans.fastSquaredDistance(newCenter, centers(j)) 
> epsilon * epsilon) {
+  converged =

[GitHub] spark issue #15424: [SPARK-17338][SQL][follow-up] add global temp view

2016-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15424
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15426: [SPARK-17864][SQL] Mark data type APIs as stable (not De...

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15426
  
**[Test build #66721 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66721/consoleFull)**
 for PR 15426 at commit 
[`0cf7e72`](https://github.com/apache/spark/commit/0cf7e7211f4b8112c776f1ac6bc06d6d204e6fd8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15426: [SPARK-17864][SQL] Mark data type APIs as stable (not De...

2016-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15426
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66721/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15424: [SPARK-17338][SQL][follow-up] add global temp view

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15424
  
**[Test build #66720 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66720/consoleFull)**
 for PR 15424 at commit 
[`0ff26d0`](https://github.com/apache/spark/commit/0ff26d0050b12917f0c801ba61d43d0ae4970f81).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15426: [SPARK-17864][SQL] Mark data type APIs as stable (not De...

2016-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15426
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15342: [SPARK-11560] [MLLIB] Optimize KMeans implementat...

2016-10-11 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15342#discussion_r82736118
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -531,6 +471,7 @@ object KMeans {
*   "k-means||". (default: "k-means||")
*/
   @Since("0.8.0")
+  @deprecated("Use train method without 'runs'", "2.1.0")
   def train(
--- End diff --

+1 @sethah 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15408: [SPARK-17839][CORE] Use Nio's directbuffer instea...

2016-10-11 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15408#discussion_r82737052
  
--- Diff: 
core/src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java ---
@@ -0,0 +1,129 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.io;
+
+import org.apache.spark.storage.StorageUtils;
+
+import java.io.File;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.nio.channels.FileChannel;
+import java.nio.file.StandardOpenOption;
+
+/**
+ * {@link InputStream} implementation which uses direct buffer
+ * to read a file to avoid extra copy of data between Java and
+ * native memory which happens when using {@link 
java.io.BufferedInputStream}.
+ * Unfortunately, this is not something already available in JDK,
+ * {@link sun.nio.ch.ChannelInputStream} supports reading a file using nio,
+ * but does not support buffering.
+ *
+ * TODO: support {@link #mark(int)}/{@link #reset()}
+ *
+ */
+public final class NioBufferedFileInputStream extends InputStream {
+
+  private static int DEFAULT_BUFFER_SIZE_BYTES = 8192;
+
+  private final ByteBuffer byteBuffer;
+
+  private final FileChannel fileChannel;
+
+  public NioBufferedFileInputStream(File file, int bufferSizeInBytes) 
throws IOException {
+byteBuffer = ByteBuffer.allocateDirect(bufferSizeInBytes);
+fileChannel = FileChannel.open(file.toPath(), StandardOpenOption.READ);
+byteBuffer.flip();
+  }
+
+  public NioBufferedFileInputStream(File file) throws IOException {
+this(file, DEFAULT_BUFFER_SIZE_BYTES);
+  }
+
+  /**
+   * Checks weather data is left to be read from the input stream.
+   * @return true if data is left, false otherwise
+   * @throws IOException
+   */
+  private boolean refill() throws IOException {
+if (!byteBuffer.hasRemaining()) {
+  byteBuffer.clear();
+  int nRead = fileChannel.read(byteBuffer);
+  if (nRead <= 0) {
--- End diff --

It could be 0 forever though (dunno, FS error?) and then this creates an 
infinite loop. I went hunting through the SDK for some examples of dealing with 
this, because this has always been a tricky part of even `InputStream`.

`java.nio.Files`:

```
private static long copy(InputStream source, OutputStream sink)
throws IOException
{
long nread = 0L;
byte[] buf = new byte[BUFFER_SIZE];
int n;
while ((n = source.read(buf)) > 0) {
sink.write(buf, 0, n);
nread += n;
}
return nread;
}

...

private static byte[] read(InputStream source, int initialSize) throws 
IOException {
int capacity = initialSize;
byte[] buf = new byte[capacity];
int nread = 0;
int n;
for (;;) {
// read to EOF which may read more or less than initialSize 
(eg: file
// is truncated while we are reading)
while ((n = source.read(buf, nread, capacity - nread)) > 0)
nread += n;

// if last call to source.read() returned -1, we are done
// otherwise, try to read one more byte; if that failed we're 
done too
if (n < 0 || (n = source.read()) < 0)
break;

// one more byte was read; need to allocate a larger buffer
if (capacity <= MAX_BUFFER_SIZE - capacity) {
capacity = Math.max(capacity << 1, BUFFER_SIZE);
} else {
if (capacity == MAX_BUFFER_SIZE)
throw new OutOfMemoryError("Required array size too 
large");
capacity = MAX_BUFFER_SIZE;
}
buf = Arrays.copyOf(buf, capacity);
buf[nread++] = (byte)n;
}
return (capacity == nread) ? buf : Arrays.copyOf(buf, nread);
}
```

The first instance seems to assume that 0 bytes means it should stop.
In the second instance it forces the `InputStream` to give a definitive 
answer by

[GitHub] spark pull request #14788: [SPARK-17174][SQL] Add the support for TimestampT...

2016-10-11 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14788#discussion_r82737073
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -2374,14 +2374,14 @@ object functions {
* @group datetime_funcs
* @since 1.5.0
*/
-  def date_add(start: Column, days: Int): Column = withExpr { 
DateAdd(start.expr, Literal(days)) }
+  def date_add(start: Column, days: Int): Column = withExpr { 
AddDays(start.expr, Literal(days)) }
--- End diff --

I don't get what you were suggesting here. Wouldn't it make more sense to 
make DateAdd expression support both adding Interval type and adding 
IntegralType (for days)?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14788: [SPARK-17174][SQL] Add the support for TimestampType for...

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14788
  
**[Test build #66728 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66728/consoleFull)**
 for PR 14788 at commit 
[`cd78330`](https://github.com/apache/spark/commit/cd783307fafa7987505906101e8e90148c589214).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15426: [SPARK-17864][SQL] Mark data type APIs as stable (not De...

2016-10-11 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/15426
  
LGTM



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11459: [SPARK-13025] Allow users to set initial model in logist...

2016-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/11459
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66726/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11459: [SPARK-13025] Allow users to set initial model in logist...

2016-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/11459
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14702: [SPARK-15694] Implement ScriptTransformation in sql/core...

2016-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14702
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66723/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14702: [SPARK-15694] Implement ScriptTransformation in sql/core...

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14702
  
**[Test build #66723 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66723/consoleFull)**
 for PR 14702 at commit 
[`c7741f9`](https://github.com/apache/spark/commit/c7741f9560e619c1b3c2a30750e1386963cf5ece).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15428: [SPARK-17219][ML] enchanced NaN value handling in...

2016-10-11 Thread VinceShieh

GitHub user VinceShieh opened a pull request:

https://github.com/apache/spark/pull/15428

[SPARK-17219][ML] enchanced NaN value handling in Bucketizer

## What changes were proposed in this pull request?

This PR is an enhancement of PR with commit 
ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2.
NaN is a special type of value which is commonly seen as invalid. But We 
find that there are certain cases where NaN are also valuable, thus need 
special handling. We provided user when dealing NaN values with 3 options, to 
either reserve an extra bucket for NaN values, or remove the NaN values, or 
report an error, by passing "keep", "skip", or "error"(default) to 
setHandleInvalid.

'''Before:
val bucketizer: Bucketizer = new Bucketizer()
  .setInputCol("feature")
  .setOutputCol("result")
  .setSplits(splits)
'''After:
val bucketizer: Bucketizer = new Bucketizer()
  .setInputCol("feature")
  .setOutputCol("result")
  .setSplits(splits)
  .setHandleInvalid("keep")

## How was this patch tested?
Tests added in QuantileDiscretizerSuite and BucketizerSuite

Signed-off-by: VinceShieh 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/VinceShieh/spark spark-17219_followup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15428.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15428


commit a3e43086dcf6ecee20461567e2cc506db29f80a7
Author: VinceShieh 
Date:   2016-10-10T02:33:09Z

[SPARK-17219][ML] enchance NaN value handling in Bucketizer

This PR is an enhancement of PR with commit 
ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2.
We provided user when dealing NaN value in the dataset with 3 options, to 
either reserve an extra
bucket for NaN values, or remove the NaN values, or report an error, by 
passing "keep", "skip",
or "error"(default) to setHandleInvalid.

'''Before:
val bucketizer: Bucketizer = new Bucketizer()
  .setInputCol("feature")
  .setOutputCol("result")
  .setSplits(splits)
'''After:
val bucketizer: Bucketizer = new Bucketizer()
  .setInputCol("feature")
  .setOutputCol("result")
  .setSplits(splits)
  .setHandleInvalid("skip")

Signed-off-by: VinceShieh 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15428: [SPARK-17219][ML] enchanced NaN value handling in Bucket...

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15428
  
**[Test build #66731 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66731/consoleFull)**
 for PR 15428 at commit 
[`a3e4308`](https://github.com/apache/spark/commit/a3e43086dcf6ecee20461567e2cc506db29f80a7).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15425: [SPARK-17816] [Core] [Branch-2.0] Fix ConcurrentModifica...

2016-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15425
  
**[Test build #3321 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3321/consoleFull)**
 for PR 15425 at commit 
[`678ee6b`](https://github.com/apache/spark/commit/678ee6b1d6308a81a5c2d83a196144f29c80434b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14847: [SPARK-17254][SQL] Filter can stop when the condition is...

2016-10-11 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/14847
  
@viirya can you try to create a new operator for this optimization and make 
it work with whole-stage-codegen? thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15072: [SPARK-17123][SQL] Use type-widened encoder for D...

2016-10-11 Thread hvanhovell

Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15072#discussion_r82743571
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -53,7 +53,15 @@ import org.apache.spark.util.Utils
 
 private[sql] object Dataset {
   def apply[T: Encoder](sparkSession: SparkSession, logicalPlan: 
LogicalPlan): Dataset[T] = {
-new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]])
+val encoder = implicitly[Encoder[T]]
+if (encoder.clsTag.runtimeClass == classOf[Row]) {
+  // We should use the encoder generated from the executed plan rather 
than the existing
+  // encoder for DataFrame because the types of columns can be varied 
due to widening types.
+  // See SPARK-17123. This is a bit hacky. Maybe we should find a 
better way to do this.
+  ofRows(sparkSession, logicalPlan).asInstanceOf[Dataset[T]]
+} else {
+  new Dataset(sparkSession, logicalPlan, encoder)
+}
--- End diff --

Yeah, I forgot about type widening.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6 7 >

501 - 600 of 661 matches

Mail list logo