[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...

2016-12-22 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/14452
  
Revisit this by rebasing with master.

BTW, in 500+ LOC changes, actually there are 200+ LOC changes are test 
cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...

2016-12-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14452
  
**[Test build #70541 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70541/testReport)**
 for PR 14452 at commit 
[`9faf90a`](https://github.com/apache/spark/commit/9faf90a346909b27aa7365bc42cd139c7d0fb3a7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16232: [SPARK-18800][SQL] Correct the assert in UnsafeKVExterna...

2016-12-22 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16232
  
ping @davies 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...

2016-12-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13909
  
**[Test build #70540 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70540/testReport)**
 for PR 13909 at commit 
[`0af0828`](https://github.com/apache/spark/commit/0af08282f4f1d72d205442ba66d6964cd1ac0599).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16337: [SPARK-18871][SQL] New test cases for IN/NOT IN subquery

2016-12-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16337
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70535/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...

2016-12-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15666
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70534/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...

2016-12-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15666
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16337: [SPARK-18871][SQL] New test cases for IN/NOT IN subquery

2016-12-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16337
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...

2016-12-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15666
  
**[Test build #70534 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70534/testReport)**
 for PR 15666 at commit 
[`73df5a4`](https://github.com/apache/spark/commit/73df5a4f5a961e558588b3462e7744a1c9c1266a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16337: [SPARK-18871][SQL] New test cases for IN/NOT IN subquery

2016-12-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16337
  
**[Test build #70535 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70535/testReport)**
 for PR 16337 at commit 
[`1c1900a`](https://github.com/apache/spark/commit/1c1900a261b12e95a8a53892017294df3c21b317).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...

2016-12-22 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13909
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-22 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/15211
  
I've sent a new update addressing most of the comments. The only exception 
is about `SetWeightCol` in `LinearSVCModel`. cc @jkbradley.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...

2016-12-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13909
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70537/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...

2016-12-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13909
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15211
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15211
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70539/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15211
  
**[Test build #70539 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70539/testReport)**
 for PR 15211 at commit 
[`21ecbf0`](https://github.com/apache/spark/commit/21ecbf08ed03f9b69f4cbec7380e547f146acec7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class LinearSVC @Since(\"2.2.0\") (`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...

2016-12-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13909
  
**[Test build #70537 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70537/testReport)**
 for PR 13909 at commit 
[`0af0828`](https://github.com/apache/spark/commit/0af08282f4f1d72d205442ba66d6964cd1ac0599).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2016-12-22 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16386#discussion_r93733483
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala
 ---
@@ -36,29 +31,31 @@ import org.apache.spark.sql.sources._
 import org.apache.spark.sql.types.StructType
 import org.apache.spark.util.SerializableConfiguration
 
+object JsonFileFormat {
+  def parseJsonOptions(sparkSession: SparkSession, options: Map[String, 
String]): JSONOptions = {
--- End diff --

I think I disagree with passing whole `SparkSession` because apparently we 
only need `SQLConf` or the value of `spark.sql.columnNameOfCorruptRecord`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15212: [SPARK-17645][MLLIB][ML]add feature selector method base...

2016-12-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15212
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15212: [SPARK-17645][MLLIB][ML]add feature selector method base...

2016-12-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15212
  
**[Test build #70536 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70536/testReport)**
 for PR 15212 at commit 
[`5a7cc2c`](https://github.com/apache/spark/commit/5a7cc2ca9e81ade4d430411ab6e314ae5010169f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15212: [SPARK-17645][MLLIB][ML]add feature selector method base...

2016-12-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15212
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70536/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2016-12-22 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16386#discussion_r93732800
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala
 ---
@@ -0,0 +1,204 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.json
+
+import java.io.InputStream
+
+import scala.reflect.ClassTag
+
+import com.fasterxml.jackson.core.{JsonFactory, JsonParser}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.io.{LongWritable, Text}
+import org.apache.hadoop.mapreduce.Job
+import org.apache.hadoop.mapreduce.lib.input.{FileInputFormat, 
TextInputFormat}
+
+import org.apache.spark.TaskContext
+import org.apache.spark.input.{PortableDataStream, StreamInputFormat}
+import org.apache.spark.rdd.{BinaryFileRDD, RDD}
+import org.apache.spark.sql.{AnalysisException, SparkSession}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.json.{CreateJacksonParser, 
JacksonParser, JSONOptions}
+import org.apache.spark.sql.execution.datasources.{CodecStreams, 
HadoopFileLinesReader, PartitionedFile}
+import org.apache.spark.sql.types.StructType
+
+/**
+ * Common functions for parsing JSON files
+ * @tparam T A datatype containing the unparsed JSON, such as [[Text]] or 
[[String]]
+ */
+abstract class JsonDataSource[T] extends Serializable {
+  def isSplitable: Boolean
+
+  /**
+   * Parse a [[PartitionedFile]] into 0 or more [[InternalRow]] instances
+   */
+  def readFile(
+conf: Configuration,
+file: PartitionedFile,
+parser: JacksonParser): Iterator[InternalRow]
+
+  /**
+   * Create an [[RDD]] that handles the preliminary parsing of [[T]] 
records
+   */
+  protected def createBaseRdd(
+sparkSession: SparkSession,
+inputPaths: Seq[FileStatus]): RDD[T]
+
+  /**
+   * A generic wrapper to invoke the correct [[JsonFactory]] method to 
allocate a [[JsonParser]]
+   * for an instance of [[T]]
+   */
+  def createParser(jsonFactory: JsonFactory, value: T): JsonParser
+
+  final def infer(
+  sparkSession: SparkSession,
+  inputPaths: Seq[FileStatus],
+  parsedOptions: JSONOptions): Option[StructType] = {
+val jsonSchema = InferSchema.infer(
+  createBaseRdd(sparkSession, inputPaths),
+  parsedOptions,
+  createParser)
+checkConstraints(jsonSchema)
+
+if (jsonSchema.fields.nonEmpty) {
--- End diff --

It seems this changes existing behaviour (not allowing empty schema).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15211
  
**[Test build #70539 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70539/testReport)**
 for PR 15211 at commit 
[`21ecbf0`](https://github.com/apache/spark/commit/21ecbf08ed03f9b69f4cbec7380e547f146acec7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16368: [SPARK-18958][SPARKR] R API toJSON on DataFrame

2016-12-22 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/16368
  
ah, it was merged https://git-wip-us.apache.org/repos/asf?p=spark.git



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16368: [SPARK-18958][SPARKR] R API toJSON on DataFrame

2016-12-22 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/16368
  
I kept getting error with the merge script - not sure if it went through. 
we are likely having some sync issue with github?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16312: [SPARK-18862][SPARKR][ML] Split SparkR mllib.R into mult...

2016-12-22 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/16312
  
ah, thank you @shivaram. sorry I couldn't get around to investigate earlier.

@yanboliang It looks like that is the design in the trait BaseReadWrite 
([here](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L80)),
 where it holds references to `sc`, `sqlContext` and `spark` session. Although, 
I see other MLReader/MLWriter calls `sc` directly whereas the design should 
allow for the `sc`/spark session to be updated? Specifically we could change to 
pass the spark session to the RWrapper for these calls but generally reusing 
`sc` is the design of BaseReadWrite/MLReader/MLWriter, and is not specific to R.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2016-12-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16386
  
> the corrupt column will contain the filename instead of the literal JSON 
if there is a parsing failure

I am worried of changing the behaviour. I understand why it had to be here 
as you described in the description but we have `input_file_name` functions for 
these. I would not expect, at least, there are file names in `_corrupt_record`.

We need to document this around `spark.sql.columnNameOfCorruptRecord` in 
`SQLConf` and `columnNameOfCorruptRecord` in read/writer in Python and Scala if 
this is acceptable.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16368: [SPARK-18958][SPARKR] R API toJSON on DataFrame

2016-12-22 Thread shivaram
Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/16368
  
Hmm looks like this is merged but not reflected on github ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16387: [SPARK-18986][Core] ExternalAppendOnlyMap shouldn...

2016-12-22 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16387#discussion_r93732158
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala
 ---
@@ -192,12 +193,16 @@ class ExternalAppendOnlyMap[K, V, C](
* It will be called by TaskMemoryManager when there is not enough 
memory for the task.
*/
   override protected[this] def forceSpill(): Boolean = {
-assert(readingIterator != null)
-val isSpilled = readingIterator.spill()
-if (isSpilled) {
-  currentMap = null
+if (isReadingIterator) {
+  assert(readingIterator != null)
+  val isSpilled = readingIterator.spill()
+  if (isSpilled) {
+currentMap = null
+  }
+  isSpilled
+} else {
+  false
--- End diff --

I choose to simply return false now. Another option is to actually spill 
the in-memory map.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16387: [SPARK-18986][Core] ExternalAppendOnlyMap shouldn't fail...

2016-12-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16387
  
**[Test build #70538 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70538/testReport)**
 for PR 16387 at commit 
[`03d4dc0`](https://github.com/apache/spark/commit/03d4dc0afbba0217a322a20a999894640f43aecc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13909: [SPARK-16213][SQL] Reduce runtime overhead of a p...

2016-12-22 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13909#discussion_r93732043
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -56,33 +58,100 @@ case class CreateArray(children: Seq[Expression]) 
extends Expression {
   }
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
-val arrayClass = classOf[GenericArrayData].getName
-val values = ctx.freshName("values")
-ctx.addMutableState("Object[]", values, s"this.$values = null;")
-
-ev.copy(code = s"""
-  this.$values = new Object[${children.size}];""" +
-  ctx.splitExpressions(
-ctx.INPUT_ROW,
-children.zipWithIndex.map { case (e, i) =>
-  val eval = e.genCode(ctx)
-  eval.code + s"""
-if (${eval.isNull}) {
-  $values[$i] = null;
-} else {
-  $values[$i] = ${eval.value};
-}
-   """
-}) +
-  s"""
-final ArrayData ${ev.value} = new $arrayClass($values);
-this.$values = null;
-  """, isNull = "false")
+val array = ctx.freshName("array")
+
+val et = dataType.elementType
+val evals = children.map(e => e.genCode(ctx))
+val isPrimitiveArray = ctx.isPrimitiveType(et)
+val primitiveTypeName = if (isPrimitiveArray) 
ctx.primitiveTypeName(et) else ""
+val (preprocess, arrayData, arrayWriter) =
+  GenArrayData.getCodeArrayData(ctx, et, children.size, 
isPrimitiveArray, array)
+
+val assigns = if (isPrimitiveArray) {
+  evals.zipWithIndex.map { case (eval, i) =>
+eval.code + s"""
+ if (${eval.isNull}) {
+   $arrayWriter.setNull$primitiveTypeName($i);
+ } else {
+   $arrayWriter.write($i, ${eval.value});
+ }
+   """
+  }
+} else {
+  evals.zipWithIndex.map { case (eval, i) =>
+eval.code + s"""
+ if (${eval.isNull}) {
+   $array[$i] = null;
+ } else {
+   $array[$i] = ${eval.value};
+ }
+   """
+  }
+}
+ev.copy(code =
+  preprocess +
+  ctx.splitExpressions(ctx.INPUT_ROW, assigns) +
+  s"\nfinal ArrayData ${ev.value} = $arrayData;\n",
+  isNull = "false")
   }
 
   override def prettyName: String = "array"
 }
 
+private [sql] object GenArrayData {
+  // This function returns Java code pieces based on DataType and 
isPrimitive
+  // for allocation of ArrayData class
+  def getCodeArrayData(
+  ctx: CodegenContext,
+  dt: DataType,
+  size: Int,
+  isPrimitive : Boolean,
+  array: String): (String, String, String) = {
+if (!isPrimitive) {
+  val arrayClass = classOf[GenericArrayData].getName
+  ctx.addMutableState("Object[]", array,
+s"this.$array = new Object[${size}];")
+  ("", s"new $arrayClass($array)", null)
+} else {
+  val row = ctx.freshName("row")
+  val holder = ctx.freshName("holder")
+  val rowWriter = ctx.freshName("createRowWriter")
+  val arrayWriter = ctx.freshName("createArrayWriter")
+  val unsafeRowClass = classOf[UnsafeRow].getName
+  val unsafeArrayClass = classOf[UnsafeArrayData].getName
+  val holderClass = classOf[BufferHolder].getName
+  val rowWriterClass = classOf[UnsafeRowWriter].getName
+  val arrayWriterClass = classOf[UnsafeArrayWriter].getName
+  ctx.addMutableState(unsafeRowClass, row, "")
+  ctx.addMutableState(unsafeArrayClass, array, "")
+  ctx.addMutableState(holderClass, holder, "")
+  ctx.addMutableState(rowWriterClass, rowWriter, "")
+  ctx.addMutableState(arrayWriterClass, arrayWriter, "")
+  val unsafeArraySizeInBytes =
+UnsafeArrayData.calculateHeaderPortionInBytes(size) +
+ByteArrayMethods.roundNumberOfBytesToNearestWord(dt.defaultSize * 
size)
+
+  // To write data to UnsafeArrayData, we create UnsafeRow with a 
single array field
+  // and then prepare BufferHolder for the array.
+  // In summary, this does not use UnsafeRow and wastes some bits in 
an byte array
+  (s"""
+$row = new $unsafeRowClass(1);
+$holder = new $holderClass($row, $unsafeArraySizeInBytes);
+$rowWriter = new $rowWriterClass($holder, 1);
--- End diff --

For now, let me make `ArrayData` mutable. If we have an agreement to make 
`UnsafeArrayData` mutable, I can do it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project 

[GitHub] spark pull request #16387: [SPARK-18986][Core] ExternalAppendOnlyMap shouldn...

2016-12-22 Thread viirya
GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/16387

[SPARK-18986][Core] ExternalAppendOnlyMap shouldn't fail when forced to 
spill before calling its iterator

## What changes were proposed in this pull request?

`ExternalAppendOnlyMap.forceSpill` now uses an assert to check if an 
iterator is not null in the map. However, the assertion is only true after the 
map is asked for iterator. Before it, if another memory consumer asks more 
memory than currently available, `ExternalAppendOnlyMap.forceSpill` is also be 
called too. In this case, we will see failure like this:

[info]   java.lang.AssertionError: assertion failed
[info]   at scala.Predef$.assert(Predef.scala:156)
[info]   at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.forceSpill(ExternalAppendOnlyMap.scala:196)
[info]   at 
org.apache.spark.util.collection.Spillable.spill(Spillable.scala:111)
[info]   at 
org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$13.apply$mcV$sp(ExternalAppendOnly
MapSuite.scala:294)

## How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 fix-externalappendonlymap

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16387.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16387


commit 2e4f34e54e92bfc47d817cb6392d89d660401b57
Author: Liang-Chi Hsieh 
Date:   2016-12-23T04:59:01Z

Return false when forceSpill is called before the map is asked for iterator.

commit 03d4dc0afbba0217a322a20a999894640f43aecc
Author: Liang-Chi Hsieh 
Date:   2016-12-23T05:45:54Z

Add test.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16368: [SPARK-18958][SPARKR] R API toJSON on DataFrame

2016-12-22 Thread shivaram
Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/16368
  
Merging this into master, branch-2.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16312: [SPARK-18862][SPARKR][ML] Split SparkR mllib.R into mult...

2016-12-22 Thread shivaram
Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/16312
  
I looked at this more closely and I think I found the problem - Not sure 
its easy to fix though.
What I traced here is:
- When we call sparkR.session.stop and sparkR.session the same JVM backend 
is reused and only the SparkContext is stopped / recreated
- Now the problem happens when we call `read.ml` to read a model after 
creating a new SparkSession. This in turn calls into RWrappers[1] which has an 
`sc` member variable
- My understanding is that the `sc` member variable is bound the first time 
we create a SparkSession and when we stop, restart it has a handle to the stale 
SparkContext
- Thus we see errors where it says `Cannot call methods on a stopped 
SparkContext`

I think the right fix here is to pass along a SparkContext into RWrappers 
and not rely on a prior initialization. However I'm not sure why that design 
decision was made before, so maybe I'm missing something.

[1] 
https://github.com/apache/spark/blob/f252cb5d161e064d39cc1ed1d9299307a0636174/mllib/src/main/scala/org/apache/spark/ml/r/RWrappers.scala#L36



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...

2016-12-22 Thread yhuai
Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/15996
  
ah 
https://github.com/apache/spark/commit/9a1ad71db44558bb6eb380dc23a1a1abbc2f3e98 
failed. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...

2016-12-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13909
  
**[Test build #70537 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70537/testReport)**
 for PR 13909 at commit 
[`0af0828`](https://github.com/apache/spark/commit/0af08282f4f1d72d205442ba66d6964cd1ac0599).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2016-12-22 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16386#discussion_r93731259
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -155,21 +155,24 @@ def load(self, path=None, format=None, schema=None, 
**options):
 return self._df(self._jreader.load())
 
 @since(1.4)
-def json(self, path, schema=None, primitivesAsString=None, 
prefersDecimal=None,
+def json(self, path, schema=None, wholeFile=None, 
primitivesAsString=None, prefersDecimal=None,
--- End diff --

we need to add this to the end; otherwise it breaks compatibility for 
positional arguments.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-22 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/15211#discussion_r93731229
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -0,0 +1,525 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+import scala.collection.mutable
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+
+/** Params for linear SVM Classifier. */
+private[ml] trait LinearSVCParams extends ClassifierParams with 
HasRegParam with HasMaxIter
+  with HasFitIntercept with HasTol with HasStandardization with 
HasWeightCol with HasThreshold
+  with HasAggregationDepth {
+
+}
+
+/**
+ * :: Experimental ::
+ * Linear SVM Classifier with Hinge Loss and OWLQN optimizer
+ */
+@Since("2.2.0")
+@Experimental
+class LinearSVC @Since("2.2.0")(
+@Since("2.2.0") override val uid: String)
+  extends Classifier[Vector, LinearSVC, LinearSVCModel]
+  with LinearSVCParams with DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("linearsvc"))
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   * Default is 1E-4.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setTol(value: Double): this.type = set(tol, value)
+
+  /**
+   * Whether to fit an intercept term.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
+
+  @Since("2.2.0")
+  override def copy(extra: ParamMap): LinearSVC = defaultCopy(extra)
+
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
+  setDefault(maxIter -> 100,
+regParam -> 0.0,
+threshold -> 0,
+tol -> 1E-6,
+fitIntercept -> true
+  )
+
+  /**
+   * Train a linear SVM Classifier Model with Hinge Loss and OWLQN 
optimizer
+   *
+   * @param dataset Training dataset
+   * @return Fitted model
+   */
+  override protected def train(dataset: Dataset[_]): LinearSVCModel = {
+val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) 
else col($(weightCol))
+val instances: RDD[Instance] =
+  dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map 

[GitHub] spark issue #15212: [SPARK-17645][MLLIB][ML]add feature selector method base...

2016-12-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15212
  
**[Test build #70536 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70536/testReport)**
 for PR 15212 at commit 
[`5a7cc2c`](https://github.com/apache/spark/commit/5a7cc2ca9e81ade4d430411ab6e314ae5010169f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...

2016-12-22 Thread yhuai
Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/15996
  
LGTM. Can you update the comment to address my last comment 
(https://github.com/apache/spark/pull/15996#discussion_r93730700)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTable...

2016-12-22 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15996#discussion_r93730700
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
 ---
@@ -643,6 +644,14 @@ class DataFrameReaderWriterSuite extends QueryTest 
with SharedSQLContext with Be
 withTable("t") {
   val provider = "org.apache.spark.sql.test.DefaultSource"
   sql(s"CREATE TABLE t USING $provider")
+
+  // make sure the data source doesn't provide `InsertableRelation`, 
so that we can only append
+  // data to it with `CreatableRelationProvider.createRelation`
--- End diff --

One last comment. Let's explicitly say that we want to test the case that a 
data source is a CreatableRelationProvider but its relation does not implement 
InsertableRelation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16337: [SPARK-18871][SQL] New test cases for IN/NOT IN subquery

2016-12-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16337
  
**[Test build #70535 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70535/testReport)**
 for PR 16337 at commit 
[`1c1900a`](https://github.com/apache/spark/commit/1c1900a261b12e95a8a53892017294df3c21b317).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16337: [SPARK-18871][SQL] New test cases for IN/NOT IN subquery

2016-12-22 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/16337
  
Retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...

2016-12-22 Thread mariusvniekerk
Github user mariusvniekerk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15666#discussion_r93730314
  
--- Diff: core/src/main/scala/org/apache/spark/TestUtils.scala ---
@@ -164,6 +164,27 @@ private[spark] object TestUtils {
 createCompiledClass(className, destDir, sourceFile, classpathUrls)
   }
 
+  /** Create a dummy compile jar for a given package, classname.  Jar will 
be placed in destDir */
+  def createDummyJar(destDir: String, packageName: String, className: 
String): String = {
--- End diff --

The R tests do indeed verify that they can call the internal functions.

I can revert that part of the changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16384: [BUILD] make-distribution support alternate pytho...

2016-12-22 Thread felixcheung
Github user felixcheung closed the pull request at:

https://github.com/apache/spark/pull/16384


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...

2016-12-22 Thread mariusvniekerk
Github user mariusvniekerk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15666#discussion_r93729928
  
--- Diff: core/src/main/scala/org/apache/spark/TestUtils.scala ---
@@ -164,6 +164,27 @@ private[spark] object TestUtils {
 createCompiledClass(className, destDir, sourceFile, classpathUrls)
   }
 
+  /** Create a dummy compile jar for a given package, classname.  Jar will 
be placed in destDir */
+  def createDummyJar(destDir: String, packageName: String, className: 
String): String = {
--- End diff --

Yeah when i wrote this that didn't exist yet.  Changing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...

2016-12-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15666
  
**[Test build #70534 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70534/testReport)**
 for PR 15666 at commit 
[`73df5a4`](https://github.com/apache/spark/commit/73df5a4f5a961e558588b3462e7744a1c9c1266a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2016-12-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16386
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2016-12-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16386
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70531/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2016-12-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16386
  
**[Test build #70531 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70531/testReport)**
 for PR 16386 at commit 
[`7ad5d5b`](https://github.com/apache/spark/commit/7ad5d5be0c7b41112f9f6ad3cb0cf9055de62695).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...

2016-12-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15996
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70532/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...

2016-12-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15996
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...

2016-12-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15996
  
**[Test build #70532 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70532/testReport)**
 for PR 15996 at commit 
[`9a1ad71`](https://github.com/apache/spark/commit/9a1ad71db44558bb6eb380dc23a1a1abbc2f3e98).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16337: [SPARK-18871][SQL] New test cases for IN/NOT IN subquery

2016-12-22 Thread kevinyu98
Github user kevinyu98 commented on the issue:

https://github.com/apache/spark/pull/16337
  
I just run build/sbt "test-only org.apache.spark.sql.streaming.StreamSuite" 
on my local machine, also the whole sql suite, it works fine. Can you re-run 
the test? Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16323: [SPARK-18911] [SQL] Define CatalogStatistics to i...

2016-12-22 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16323#discussion_r93726972
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala
 ---
@@ -41,13 +41,13 @@ import org.apache.spark.sql.types._
  * @param sizeInBytes Physical size in bytes. For leaf operators this 
defaults to 1, otherwise it
  *defaults to the product of children's `sizeInBytes`.
  * @param rowCount Estimated number of rows.
- * @param colStats Column-level statistics.
+ * @param attributeStats Statistics for Attributes.
  * @param isBroadcastable If true, output is small enough to be used in a 
broadcast join.
  */
 case class Statistics(
 sizeInBytes: BigInt,
 rowCount: Option[BigInt] = None,
-colStats: Map[String, ColumnStat] = Map.empty,
+attributeStats: AttributeMap[ColumnStat] = AttributeMap(Nil),
--- End diff --

Will we estimate statistics for all attributes in logical plan?

I meant if an attribute is not coming from a leaf node but from a later 
plan like `Join`, do we still have `ColumnStat` for it?

If not, I think we don't need to call this parameter as `attributeStats`, 
instead of original `colStats`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16228: [SPARK-17076] [SQL] Cardinality estimation for join base...

2016-12-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16228
  
**[Test build #70533 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70533/testReport)**
 for PR 16228 at commit 
[`c3e3a48`](https://github.com/apache/spark/commit/c3e3a48c930c9f00bf77a11dfe0ef819ca005b26).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class LeftSemiAntiEstimation(join: Join) `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16228: [SPARK-17076] [SQL] Cardinality estimation for join base...

2016-12-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16228
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70533/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16228: [SPARK-17076] [SQL] Cardinality estimation for join base...

2016-12-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16228
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16228: [SPARK-17076] [SQL] Cardinality estimation for join base...

2016-12-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16228
  
**[Test build #70533 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70533/testReport)**
 for PR 16228 at commit 
[`c3e3a48`](https://github.com/apache/spark/commit/c3e3a48c930c9f00bf77a11dfe0ef819ca005b26).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16323: [SPARK-18911] [SQL] Define CatalogStatistics to i...

2016-12-22 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16323#discussion_r93726768
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ---
@@ -237,6 +239,38 @@ case class CatalogTable(
 }
 
 
+/**
+ * This class of statistics is used in [[CatalogTable]] to interact with 
metastore.
--- End diff --

Can you add few words explaining why don't use `Statistics` for 
`CatalogTable`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13909: [SPARK-16213][SQL] Reduce runtime overhead of a p...

2016-12-22 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/13909#discussion_r93726522
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -56,33 +58,100 @@ case class CreateArray(children: Seq[Expression]) 
extends Expression {
   }
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
-val arrayClass = classOf[GenericArrayData].getName
-val values = ctx.freshName("values")
-ctx.addMutableState("Object[]", values, s"this.$values = null;")
-
-ev.copy(code = s"""
-  this.$values = new Object[${children.size}];""" +
-  ctx.splitExpressions(
-ctx.INPUT_ROW,
-children.zipWithIndex.map { case (e, i) =>
-  val eval = e.genCode(ctx)
-  eval.code + s"""
-if (${eval.isNull}) {
-  $values[$i] = null;
-} else {
-  $values[$i] = ${eval.value};
-}
-   """
-}) +
-  s"""
-final ArrayData ${ev.value} = new $arrayClass($values);
-this.$values = null;
-  """, isNull = "false")
+val array = ctx.freshName("array")
+
+val et = dataType.elementType
+val evals = children.map(e => e.genCode(ctx))
+val isPrimitiveArray = ctx.isPrimitiveType(et)
+val primitiveTypeName = if (isPrimitiveArray) 
ctx.primitiveTypeName(et) else ""
+val (preprocess, arrayData, arrayWriter) =
+  GenArrayData.getCodeArrayData(ctx, et, children.size, 
isPrimitiveArray, array)
+
+val assigns = if (isPrimitiveArray) {
+  evals.zipWithIndex.map { case (eval, i) =>
+eval.code + s"""
+ if (${eval.isNull}) {
+   $arrayWriter.setNull$primitiveTypeName($i);
+ } else {
+   $arrayWriter.write($i, ${eval.value});
+ }
+   """
+  }
+} else {
+  evals.zipWithIndex.map { case (eval, i) =>
+eval.code + s"""
+ if (${eval.isNull}) {
+   $array[$i] = null;
+ } else {
+   $array[$i] = ${eval.value};
+ }
+   """
+  }
+}
+ev.copy(code =
+  preprocess +
+  ctx.splitExpressions(ctx.INPUT_ROW, assigns) +
+  s"\nfinal ArrayData ${ev.value} = $arrayData;\n",
+  isNull = "false")
   }
 
   override def prettyName: String = "array"
 }
 
+private [sql] object GenArrayData {
+  // This function returns Java code pieces based on DataType and 
isPrimitive
+  // for allocation of ArrayData class
+  def getCodeArrayData(
+  ctx: CodegenContext,
+  dt: DataType,
+  size: Int,
+  isPrimitive : Boolean,
+  array: String): (String, String, String) = {
+if (!isPrimitive) {
+  val arrayClass = classOf[GenericArrayData].getName
+  ctx.addMutableState("Object[]", array,
+s"this.$array = new Object[${size}];")
+  ("", s"new $arrayClass($array)", null)
+} else {
+  val row = ctx.freshName("row")
+  val holder = ctx.freshName("holder")
+  val rowWriter = ctx.freshName("createRowWriter")
+  val arrayWriter = ctx.freshName("createArrayWriter")
+  val unsafeRowClass = classOf[UnsafeRow].getName
+  val unsafeArrayClass = classOf[UnsafeArrayData].getName
+  val holderClass = classOf[BufferHolder].getName
+  val rowWriterClass = classOf[UnsafeRowWriter].getName
+  val arrayWriterClass = classOf[UnsafeArrayWriter].getName
+  ctx.addMutableState(unsafeRowClass, row, "")
+  ctx.addMutableState(unsafeArrayClass, array, "")
+  ctx.addMutableState(holderClass, holder, "")
+  ctx.addMutableState(rowWriterClass, rowWriter, "")
+  ctx.addMutableState(arrayWriterClass, arrayWriter, "")
+  val unsafeArraySizeInBytes =
+UnsafeArrayData.calculateHeaderPortionInBytes(size) +
+ByteArrayMethods.roundNumberOfBytesToNearestWord(dt.defaultSize * 
size)
+
+  // To write data to UnsafeArrayData, we create UnsafeRow with a 
single array field
+  // and then prepare BufferHolder for the array.
+  // In summary, this does not use UnsafeRow and wastes some bits in 
an byte array
+  (s"""
+$row = new $unsafeRowClass(1);
+$holder = new $holderClass($row, $unsafeArraySizeInBytes);
+$rowWriter = new $rowWriterClass($holder, 1);
--- End diff --

About the scope of change, do we need to make `ArrayData` mutable or just 
`UnsafeArrayData`? Actually I don't see `GenericArrayData` needs this 
mutability now.


---
If your project is set up for it, you can reply to this email and have your

[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...

2016-12-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r93726073
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -171,11 +171,14 @@ object ChiSqSelectorModel extends 
Loader[ChiSqSelectorModel] {
 
 /**
  * Creates a ChiSquared feature selector.
- * The selector supports different selection methods: `numTopFeatures`, 
`percentile`, `fpr`.
+ * The selector supports different selection methods: `numTopFeatures`, 
`percentile`, `fpr`,
+ * `fdr`, `fwe`.
  *  - `numTopFeatures` chooses a fixed number of top features according to 
a chi-squared test.
  *  - `percentile` is similar but chooses a fraction of all features 
instead of a fixed number.
  *  - `fpr` chooses all features whose p-value is below a threshold, thus 
controlling the false
  *positive rate of selection.
+ *  - `fdr` chooses all features whose false discovery rate meets some 
threshold.
--- End diff --

Ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...

2016-12-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r93726194
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -255,19 +288,22 @@ class ChiSqSelector @Since("2.1.0") () extends 
Serializable {
 
 private[spark] object ChiSqSelector {
 
-  /**
-   * String name for `numTopFeatures` selector type.
-   */
+  /** String name for `numTopFeatures` selector type. */
   val NumTopFeatures: String = "numTopFeatures"
--- End diff --

```private[spark]```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...

2016-12-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r93725579
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -92,8 +92,36 @@ private[feature] trait ChiSqSelectorParams extends Params
   def getFpr: Double = $(fpr)
 
   /**
+   * The highest uncorrected p-value for features to be kept.
+   * Only applicable when selectorType = "fdr".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("2.1.0")
--- End diff --

Update version to 2.2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...

2016-12-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r93726048
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -111,11 +139,14 @@ private[feature] trait ChiSqSelectorParams extends 
Params
 /**
  * Chi-Squared feature selection, which selects categorical features to 
use for predicting a
  * categorical label.
- * The selector supports different selection methods: `numTopFeatures`, 
`percentile`, `fpr`.
+ * The selector supports different selection methods: `numTopFeatures`, 
`percentile`, `fpr`,
+ * `fdr`, `fwe`.
  *  - `numTopFeatures` chooses a fixed number of top features according to 
a chi-squared test.
  *  - `percentile` is similar but chooses a fraction of all features 
instead of a fixed number.
  *  - `fpr` chooses all features whose p-value is below a threshold, thus 
controlling the false
  *positive rate of selection.
+ *  - `fdr` chooses all features whose false discovery rate meets some 
threshold.
+ *  - `fwe` chooses all features whose family-wise error rate meets some 
threshold.
--- End diff --

Update according the above suggestion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...

2016-12-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r93725408
  
--- Diff: docs/mllib-feature-extraction.md ---
@@ -227,11 +227,13 @@ both speed and statistical learning behavior.
 
[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector)
 implements
 Chi-Squared feature selection. It operates on labeled data with 
categorical features. ChiSqSelector uses the
 [Chi-Squared test of 
independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
-features to choose. It supports three selection methods: `numTopFeatures`, 
`percentile`, `fpr`:
+features to choose. It supports five selection methods: `numTopFeatures`, 
`percentile`, `fpr`, `fdr`, `fwe`:
 
 * `numTopFeatures` chooses a fixed number of top features according to a 
chi-squared test. This is akin to yielding the features with the most 
predictive power.
 * `percentile` is similar to `numTopFeatures` but chooses a fraction of 
all features instead of a fixed number.
 * `fpr` chooses all features whose p-value is below a threshold, thus 
controlling the false positive rate of selection.
+* `fdr` chooses all features whose false discovery rate meets some 
threshold.
+* `fwe` chooses all features whose family-wise error rate meets some 
threshold.
--- End diff --

Update according the above suggestion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...

2016-12-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r93726001
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -92,8 +92,36 @@ private[feature] trait ChiSqSelectorParams extends Params
   def getFpr: Double = $(fpr)
 
   /**
+   * The highest uncorrected p-value for features to be kept.
+   * Only applicable when selectorType = "fdr".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("2.1.0")
+  final val fdr = new DoubleParam(this, "fdr",
+"The highest uncorrected p-value for features to be kept.", 
ParamValidators.inRange(0, 1))
+  setDefault(fdr -> 0.05)
+
+  /** @group getParam */
+  def getFdr: Double = $(fdr)
+
+  /**
+   * The highest uncorrected p-value for features to be kept.
--- End diff --

Ditto, ```The upper bound of the expected family-wise error rate```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...

2016-12-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r93725173
  
--- Diff: docs/ml-features.md ---
@@ -1423,12 +1423,12 @@ for more details on the API.
 `ChiSqSelector` stands for Chi-Squared feature selection. It operates on 
labeled data with
 categorical features. ChiSqSelector uses the
 [Chi-Squared test of 
independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
-features to choose. It supports three selection methods: `numTopFeatures`, 
`percentile`, `fpr`:
-
+features to choose. It supports five selection methods: `numTopFeatures`, 
`percentile`, `fpr`, `fdr`, `fwe`:
 * `numTopFeatures` chooses a fixed number of top features according to a 
chi-squared test. This is akin to yielding the features with the most 
predictive power.
 * `percentile` is similar to `numTopFeatures` but chooses a fraction of 
all features instead of a fixed number.
 * `fpr` chooses all features whose p-value is below a threshold, thus 
controlling the false positive rate of selection.
-
+* `fdr` chooses all features whose false discovery rate meets some 
threshold.
+* `fwe` chooses all features whose family-wise error rate meets some 
threshold.
--- End diff --

```whose p-values is below a threshold, thus controlling the family-wise 
error rate of selection```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...

2016-12-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r93726320
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala ---
@@ -27,61 +27,240 @@ class ChiSqSelectorSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 
   /*
*  Contingency tables
-   *  feature0 = {8.0, 0.0}
+   *  feature0 = {6.0, 0.0, 8.0}
*  class  0 1 2
-   *8.0||1|0|1|
-   *0.0||0|2|0|
+   *6.0||1|0|0|
+   *0.0||0|3|0|
+   *8.0||0|0|2|
+   *  degree of freedom = 4, statistic = 12, pValue = 0.017
*
*  feature1 = {7.0, 9.0}
*  class  0 1 2
*7.0||1|0|0|
-   *9.0||0|2|1|
+   *9.0||0|3|2|
+   *  degree of freedom = 2, statistic = 6, pValue = 0.049
*
-   *  feature2 = {0.0, 6.0, 8.0, 5.0}
+   *  feature2 = {0.0, 6.0, 3.0, 8.0}
*  class  0 1 2
*0.0||1|0|0|
-   *6.0||0|1|0|
+   *6.0||0|1|2|
+   *3.0||0|1|0|
*8.0||0|1|0|
-   *5.0||0|0|1|
+   *  degree of freedom = 6, statistic = 8.66, pValue = 0.193
+   *
+   *  feature3 = {7.0, 0.0, 5.0, 4.0}
+   *  class  0 1 2
+   *7.0||1|0|0|
+   *0.0||0|2|0|
+   *5.0||0|1|1|
+   *4.0||0|0|1|
+   *  degree of freedom = 6, statistic = 9.5, pValue = 0.147
+   *
+   *  feature4 = {6.0, 5.0, 4.0, 0.0}
+   *  class  0 1 2
+   *6.0||1|1|0|
+   *5.0||0|2|0|
+   *4.0||0|0|1|
+   *0.0||0|0|1|
+   *  degree of freedom = 6, statistic = 8.0, pValue = 0.238
+   *
+   *  feature5 = {0.0, 9.0, 5.0, 4.0}
+   *  class  0 1 2
+   *0.0||1|0|1|
+   *9.0||0|1|0|
+   *5.0||0|1|0|
+   *4.0||0|1|1|
+   *  degree of freedom = 6, statistic = 5, pValue = 0.54
*
*  Use chi-squared calculator from Internet
*/
 
-  test("ChiSqSelector transform test (sparse & dense vector)") {
+  test("ChiSqSelector transform by KBest test (sparse & dense vector)") {
 val labeledDiscreteData = sc.parallelize(
--- End diff --

Many test functions need ```labeledDiscreteData```, we can refactor it out 
of the function and make other functions shared the same dataset instance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...

2016-12-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r93726203
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -255,19 +288,22 @@ class ChiSqSelector @Since("2.1.0") () extends 
Serializable {
 
 private[spark] object ChiSqSelector {
 
-  /**
-   * String name for `numTopFeatures` selector type.
-   */
+  /** String name for `numTopFeatures` selector type. */
   val NumTopFeatures: String = "numTopFeatures"
 
-  /**
-   * String name for `percentile` selector type.
-   */
+  /** String name for `percentile` selector type. */
   val Percentile: String = "percentile"
--- End diff --

```private[spark]```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...

2016-12-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r93726092
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -245,6 +264,20 @@ class ChiSqSelector @Since("2.1.0") () extends 
Serializable {
   case ChiSqSelector.FPR =>
 chiSqTestResult
   .filter { case (res, _) => res.pValue < fpr }
+  case ChiSqSelector.FDR =>
+// This uses the Benjamini-Hochberg procedure.
--- End diff --

Add link to explain ```Benjamini-Hochberg procedure```: 
https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...

2016-12-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r93725098
  
--- Diff: docs/ml-features.md ---
@@ -1423,12 +1423,12 @@ for more details on the API.
 `ChiSqSelector` stands for Chi-Squared feature selection. It operates on 
labeled data with
 categorical features. ChiSqSelector uses the
 [Chi-Squared test of 
independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
-features to choose. It supports three selection methods: `numTopFeatures`, 
`percentile`, `fpr`:
-
+features to choose. It supports five selection methods: `numTopFeatures`, 
`percentile`, `fpr`, `fdr`, `fwe`:
 * `numTopFeatures` chooses a fixed number of top features according to a 
chi-squared test. This is akin to yielding the features with the most 
predictive power.
 * `percentile` is similar to `numTopFeatures` but chooses a fraction of 
all features instead of a fixed number.
 * `fpr` chooses all features whose p-value is below a threshold, thus 
controlling the false positive rate of selection.
-
+* `fdr` chooses all features whose false discovery rate meets some 
threshold.
--- End diff --

``` `fdr` uses the [Benjamini-Hochberg 
procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
 to choose all features whose false discovery rate is below a threshold``` 
should be better?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...

2016-12-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r93725546
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -92,8 +92,36 @@ private[feature] trait ChiSqSelectorParams extends Params
   def getFpr: Double = $(fpr)
 
   /**
+   * The highest uncorrected p-value for features to be kept.
--- End diff --

I think the doc is incorrect even it's consistent with sklearn, actually we 
don't compare ```fdr``` value with ```p-value``` directly. I'm more prefer to 
change as ```The upper bound of the expected false discovery rate``` which is 
more accuracy and easy to understand.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16291: [SPARK-18838][CORE] Use separate executor service for ea...

2016-12-22 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/16291
  

I agree with @markhamstra and @vanzin - having ability to tag listeners  
into groups (default = spark listener group) and preserving current 
synchronized behavior within group would be ensure backward compatibility at 
fairly minimal additional complexity.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2016-12-22 Thread NathanHowell
Github user NathanHowell commented on the issue:

https://github.com/apache/spark/pull/16386
  
Hello recent JacksonGenerator.scala commiters, please take a look.

cc/ @rxin @hvanhovell @clockfly @hyukjinkwon @cloud-fan


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...

2016-12-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15996
  
**[Test build #70532 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70532/testReport)**
 for PR 15996 at commit 
[`9a1ad71`](https://github.com/apache/spark/commit/9a1ad71db44558bb6eb380dc23a1a1abbc2f3e98).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2016-12-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16386
  
**[Test build #70531 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70531/testReport)**
 for PR 16386 at commit 
[`7ad5d5b`](https://github.com/apache/spark/commit/7ad5d5be0c7b41112f9f6ad3cb0cf9055de62695).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2016-12-22 Thread NathanHowell
GitHub user NathanHowell opened a pull request:

https://github.com/apache/spark/pull/16386

[SPARK-18352][SQL] Support parsing multiline json files

## What changes were proposed in this pull request?

If a new option `wholeFile` is set to `true` the JSON reader will parse 
each file (instead of a single line) as a value. This is done with Jackson 
streaming and it should be capable of parsing very large documents, assuming 
the row will fit in memory.

Because the file is not buffered in memory the corrupt record handling is 
also slightly different when `wholeFile` is enabled: the corrupt column will 
contain the filename instead of the literal JSON if there is a parsing failure. 
It would be easy to extend this to add the parser location (line, column and 
byte offsets) to the output if desired.

I've also included a few other changes that generate slightly better 
bytecode and (imo) make it more obvious when and where boxing is occurring in 
the parser. These are included as separate commits, let me know if they should 
be flattened into this PR or moved to a new one.

## How was this patch tested?

New and existing unit tests. No performance or load tests have been run.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/NathanHowell/spark SPARK-18352

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16386.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16386


commit 740620210b30ef02e280d161d6b08088d07300fa
Author: Nathan Howell 
Date:   2016-12-22T22:16:49Z

[SPARK-18352][SQL] Support parsing multiline json files

commit 7902255a79fc2581214a09ccd38437cebd19d862
Author: Nathan Howell 
Date:   2016-12-22T00:27:19Z

JacksonParser.parseJsonToken should be explicit about nulls and boxing

commit 149418647c9831e88af866d44d31496940c02162
Author: Nathan Howell 
Date:   2016-12-21T23:49:37Z

Increase type safety of makeRootConverter, remove runtime type tests

commit 7ad5d5be0c7b41112f9f6ad3cb0cf9055de62695
Author: Nathan Howell 
Date:   2016-12-23T02:13:59Z

Field converter lookups should be O(1)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16383: [SPARK-18980][SQL] implement Aggregator with Type...

2016-12-22 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16383#discussion_r93725196
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala
 ---
@@ -143,15 +197,96 @@ case class TypedAggregateExpression(
 }
   }
 
-  override def toString: String = {
-val input = inputDeserializer match {
-  case Some(UnresolvedDeserializer(deserializer, _)) => 
deserializer.dataType.simpleString
-  case Some(deserializer) => deserializer.dataType.simpleString
-  case _ => "unknown"
+  override def withInputInfo(
+  deser: Expression,
+  cls: Class[_],
+  schema: StructType): TypedAggregateExpression = {
+copy(inputDeserializer = Some(deser), inputClass = Some(cls), 
inputSchema = Some(schema))
+  }
+}
+
+case class ComplexTypedAggregateExpression(
+aggregator: Aggregator[Any, Any, Any],
+inputDeserializer: Option[Expression],
+inputClass: Option[Class[_]],
+inputSchema: Option[StructType],
+bufferSerializer: Seq[NamedExpression],
+bufferDeserializer: Expression,
+outputSerializer: Seq[Expression],
+dataType: DataType,
+nullable: Boolean,
+mutableAggBufferOffset: Int = 0,
+inputAggBufferOffset: Int = 0)
+  extends TypedImperativeAggregate[Any] with TypedAggregateExpression with 
NonSQLExpression {
+
+  override def deterministic: Boolean = true
+
+  override def children: Seq[Expression] = inputDeserializer.toSeq
+
+  override lazy val resolved: Boolean = inputDeserializer.isDefined && 
childrenResolved
+
+  override def references: AttributeSet = 
AttributeSet(inputDeserializer.toSeq)
+
+  override def createAggregationBuffer(): Any = aggregator.zero
+
+  private lazy val inputRowToObj = 
GenerateSafeProjection.generate(inputDeserializer.get :: Nil)
+
+  override def update(buffer: Any, input: InternalRow): Any = {
+val inputObj = inputRowToObj(input).get(0, ObjectType(classOf[Any]))
+if (inputObj != null) {
+  aggregator.reduce(buffer, inputObj)
+} else {
+  buffer
+}
+  }
+
+  override def merge(buffer: Any, input: Any): Any = {
+aggregator.merge(buffer, input)
+  }
+
+  private lazy val resultObjToRow = dataType match {
+case _: StructType =>
+  UnsafeProjection.create(CreateStruct(outputSerializer))
+case _ =>
+  assert(outputSerializer.length == 1)
+  UnsafeProjection.create(outputSerializer.head)
+  }
+
+  override def eval(buffer: Any): Any = {
+val resultObj = aggregator.finish(buffer)
+if (resultObj == null) {
+  null
+} else {
+  resultObjToRow(InternalRow(resultObj)).get(0, dataType)
 }
+  }
 
-s"$nodeName($input)"
+  private lazy val bufferObjToRow = 
UnsafeProjection.create(bufferSerializer)
+
+  override def serialize(buffer: Any): Array[Byte] = {
+bufferObjToRow(InternalRow(buffer)).getBytes
   }
 
-  override def nodeName: String = 
aggregator.getClass.getSimpleName.stripSuffix("$")
+  private lazy val bufferRow = new UnsafeRow(bufferSerializer.length)
+  private lazy val bufferRowToObject = 
GenerateSafeProjection.generate(bufferDeserializer :: Nil)
+
+  override def deserialize(storageFormat: Array[Byte]): Any = {
+bufferRow.pointTo(storageFormat, storageFormat.length)
+bufferRowToObject(bufferRow).get(0, ObjectType(classOf[Any]))
+  }
+
+  override def withNewMutableAggBufferOffset(
+  newMutableAggBufferOffset: Int): ComplexTypedAggregateExpression =
+copy(mutableAggBufferOffset = newMutableAggBufferOffset)
+
+  override def withNewInputAggBufferOffset(
+  newInputAggBufferOffset: Int): ComplexTypedAggregateExpression =
+copy(inputAggBufferOffset = newInputAggBufferOffset)
+
+  override def withInputInfo(
+  deser: Expression,
+  cls: Class[_],
+  schema: StructType): TypedAggregateExpression = {
+copy(inputDeserializer = Some(deser), inputClass = Some(cls), 
inputSchema = Some(schema))
--- End diff --

Where do we need to use `inputClass`? `TypedAggregateExpression` has this 
parameter but I don't see it is used anywhere. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional 

[GitHub] spark issue #16119: [SPARK-18687][Pyspark][SQL]Backward compatibility - crea...

2016-12-22 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16119
  
@vijoshi do you mind updating your PR according to the dicussion? i.e. 
simplify the fix and test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...

2016-12-22 Thread lirui-intel
Github user lirui-intel commented on the issue:

https://github.com/apache/spark/pull/12775
  
Not sure if my patch makes the tests unstable. But I can't figure out why.
@kayousterhout @mridulm any ideas?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16383: [SPARK-18980][SQL] implement Aggregator with Type...

2016-12-22 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16383#discussion_r93724428
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala
 ---
@@ -505,19 +511,18 @@ abstract class TypedImperativeAggregate[T] extends 
ImperativeAggregate {
   def deserialize(storageFormat: Array[Byte]): T
 
   final override def initialize(buffer: InternalRow): Unit = {
-val bufferObject = createAggregationBuffer()
-buffer.update(mutableAggBufferOffset, bufferObject)
+buffer(mutableAggBufferOffset) = createAggregationBuffer()
   }
 
   final override def update(buffer: InternalRow, input: InternalRow): Unit 
= {
-update(getBufferObject(buffer), input)
+buffer(mutableAggBufferOffset) = update(getBufferObject(buffer), input)
--- End diff --

I do not find `InternalRow` implements `apply(int)`, is it an implicit cast 
here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16383: [SPARK-18980][SQL] implement Aggregator with Type...

2016-12-22 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16383#discussion_r93724370
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala
 ---
@@ -471,23 +471,29 @@ abstract class TypedImperativeAggregate[T] extends 
ImperativeAggregate {
   def createAggregationBuffer(): T
 
   /**
-   * In-place updates the aggregation buffer object with an input row. 
buffer = buffer + input.
+   * Updates the aggregation buffer object with an input row and returns a 
new buffer object. For
+   * performance, the function may do in-place update and return it 
instead of constructing new
--- End diff --

oh. got it. that makes sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14627: [SPARK-16975][SQL][FOLLOWUP] Do not duplicately check fi...

2016-12-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/14627
  
@rxin, it does not fix any bug but just gets rid of duplicated logics. I 
will try to open a separate JIRA in this case in the future to prevent 
confusion. Thank you/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16371: [SPARK-18932][SQL] Support partial aggregation for colle...

2016-12-22 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16371
  
@hvanhovell Got it. Thanks for review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16361: [SPARK-18952] Regex strings not properly escaped in code...

2016-12-22 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16361
  
it seems to that the grouping key alias is only used for execution(logical 
Aggregate node doesn't need grouping expression to be named), can we just alias 
them with k1,k2, ... with avoid this problem?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16294: [SPARK-18669][SS][DOCS] Update Apache docs for Structure...

2016-12-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16294
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16294: [SPARK-18669][SS][DOCS] Update Apache docs for Structure...

2016-12-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16294
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70530/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16294: [SPARK-18669][SS][DOCS] Update Apache docs for Structure...

2016-12-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16294
  
**[Test build #70530 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70530/testReport)**
 for PR 16294 at commit 
[`576b432`](https://github.com/apache/spark/commit/576b432f4eb90dae4f9c3573a5b6bd665ab1d8a9).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTable...

2016-12-22 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15996#discussion_r93723071
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionProviderCompatibilitySuite.scala
 ---
@@ -195,12 +195,25 @@ class PartitionProviderCompatibilitySuite
   withTempDir { dir =>
 setupPartitionedDatasourceTable("test", dir)
 if (enabled) {
-  spark.sql("msck repair table test")
+  assert(spark.table("test").count() == 0)
+} else {
+  assert(spark.table("test").count() == 5)
 }
-assert(spark.sql("select * from test").count() == 5)
-spark.range(10).selectExpr("id as fieldOne", "id as partCol")
+
+spark.range(3, 13).selectExpr("id as fieldOne", "id as 
partCol")
   
.write.partitionBy("partCol").mode("append").saveAsTable("test")
-assert(spark.sql("select * from test").count() == 15)
+
+if (enabled) {
+  // Only the newly written partitions are visible, which 
means the partitions
--- End diff --

to be consistent with the behavior of `InsertItoTable`. I'll add that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTable...

2016-12-22 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15996#discussion_r93723027
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
 ---
@@ -635,4 +638,13 @@ class DataFrameReaderWriterSuite extends QueryTest 
with SharedSQLContext with Be
   checkAnswer(spark.table("t"), Row(1, "a") :: Row(2, "b") :: Nil)
 }
   }
+
+  test("use saveAsTable to append to a data source table implementing 
CreatableRelationProvider") {
+withTable("t") {
+  val provider = "org.apache.spark.sql.test.DefaultSource"
--- End diff --

The data source is defined in this file: 
https://github.com/apache/spark/pull/15996/files#diff-b9ddfbc9be8d83ecf100b3b8ff9610b9R48

I think it's easy to tell that it extends `CreatableRelationProvider` but 
not return a `InsertableRelation`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16371: [SPARK-18932][SQL] Support partial aggregation for colle...

2016-12-22 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16371
  
sounds good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16368: [SPARK-18958][SPARKR] R API toJSON on DataFrame

2016-12-22 Thread shivaram
Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/16368
  
LGTM. Thanks @felixcheung 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16294: [SPARK-18669][SS][DOCS] Update Apache docs for Structure...

2016-12-22 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/16294
  
LGTM pending tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTable...

2016-12-22 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15996#discussion_r93722426
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionProviderCompatibilitySuite.scala
 ---
@@ -195,12 +195,25 @@ class PartitionProviderCompatibilitySuite
   withTempDir { dir =>
 setupPartitionedDatasourceTable("test", dir)
 if (enabled) {
-  spark.sql("msck repair table test")
--- End diff --

yep


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTable...

2016-12-22 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15996#discussion_r93722334
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -140,153 +140,55 @@ case class CreateDataSourceTableAsSelectCommand(
 val tableIdentWithDB = table.identifier.copy(database = Some(db))
 val tableName = tableIdentWithDB.unquotedString
 
-var createMetastoreTable = false
-// We may need to reorder the columns of the query to match the 
existing table.
-var reorderedColumns = Option.empty[Seq[NamedExpression]]
 if (sessionState.catalog.tableExists(tableIdentWithDB)) {
-  // Check if we need to throw an exception or just return.
-  mode match {
-case SaveMode.ErrorIfExists =>
-  throw new AnalysisException(s"Table $tableName already exists. " 
+
-s"If you are using saveAsTable, you can set SaveMode to 
SaveMode.Append to " +
-s"insert data into the table or set SaveMode to 
SaveMode.Overwrite to overwrite" +
-s"the existing data. " +
-s"Or, if you are using SQL CREATE TABLE, you need to drop 
$tableName first.")
-case SaveMode.Ignore =>
-  // Since the table already exists and the save mode is Ignore, 
we will just return.
-  return Seq.empty[Row]
-case SaveMode.Append =>
-  val existingTable = 
sessionState.catalog.getTableMetadata(tableIdentWithDB)
-
-  if (existingTable.provider.get == DDLUtils.HIVE_PROVIDER) {
-throw new AnalysisException(s"Saving data in the Hive serde 
table $tableName is " +
-  "not supported yet. Please use the insertInto() API as an 
alternative.")
-  }
-
-  // Check if the specified data source match the data source of 
the existing table.
-  val existingProvider = 
DataSource.lookupDataSource(existingTable.provider.get)
-  val specifiedProvider = 
DataSource.lookupDataSource(table.provider.get)
-  // TODO: Check that options from the resolved relation match the 
relation that we are
-  // inserting into (i.e. using the same compression).
-  if (existingProvider != specifiedProvider) {
-throw new AnalysisException(s"The format of the existing table 
$tableName is " +
-  s"`${existingProvider.getSimpleName}`. It doesn't match the 
specified format " +
-  s"`${specifiedProvider.getSimpleName}`.")
-  }
-
-  if (query.schema.length != existingTable.schema.length) {
-throw new AnalysisException(
-  s"The column number of the existing table $tableName" +
-s"(${existingTable.schema.catalogString}) doesn't match 
the data schema" +
-s"(${query.schema.catalogString})")
-  }
-
-  val resolver = sessionState.conf.resolver
-  val tableCols = existingTable.schema.map(_.name)
-
-  reorderedColumns = Some(existingTable.schema.map { f =>
-query.resolve(Seq(f.name), resolver).getOrElse {
-  val inputColumns = query.schema.map(_.name).mkString(", ")
-  throw new AnalysisException(
-s"cannot resolve '${f.name}' given input columns: 
[$inputColumns]")
-}
-  })
-
-  // In `AnalyzeCreateTable`, we verified the consistency between 
the user-specified table
-  // definition(partition columns, bucketing) and the SELECT 
query, here we also need to
-  // verify the the consistency between the user-specified table 
definition and the existing
-  // table definition.
-
-  // Check if the specified partition columns match the existing 
table.
-  val specifiedPartCols = CatalogUtils.normalizePartCols(
-tableName, tableCols, table.partitionColumnNames, resolver)
-  if (specifiedPartCols != existingTable.partitionColumnNames) {
-throw new AnalysisException(
-  s"""
-|Specified partitioning does not match that of the 
existing table $tableName.
-|Specified partition columns: 
[${specifiedPartCols.mkString(", ")}]
-|Existing partition columns: 
[${existingTable.partitionColumnNames.mkString(", ")}]
-  """.stripMargin)
-  }
-
-  // Check if the specified bucketing match the existing table.
-  val specifiedBucketSpec = table.bucketSpec.map { bucketSpec =>
-CatalogUtils.normalizeBucketSpec(tableName, tableCols, 
bucketSpec, resolver)
-  }
-  if (specifiedBucketSpec != 

[GitHub] spark pull request #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTable...

2016-12-22 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15996#discussion_r93722277
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -363,48 +365,125 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
   throw new AnalysisException("Cannot create hive serde table with 
saveAsTable API")
 }
 
-val tableExists = 
df.sparkSession.sessionState.catalog.tableExists(tableIdent)
-
-(tableExists, mode) match {
-  case (true, SaveMode.Ignore) =>
-// Do nothing
-
-  case (true, SaveMode.ErrorIfExists) =>
-throw new AnalysisException(s"Table $tableIdent already exists.")
-
-  case _ =>
-val existingTable = if (tableExists) {
-  
Some(df.sparkSession.sessionState.catalog.getTableMetadata(tableIdent))
-} else {
-  None
-}
-val storage = if (tableExists) {
-  existingTable.get.storage
-} else {
-  DataSource.buildStorageFormatFromOptions(extraOptions.toMap)
-}
-val tableType = if (tableExists) {
-  existingTable.get.tableType
-} else if (storage.locationUri.isDefined) {
-  CatalogTableType.EXTERNAL
-} else {
-  CatalogTableType.MANAGED
+val catalog = df.sparkSession.sessionState.catalog
+val db = tableIdent.database.getOrElse(catalog.getCurrentDatabase)
+val tableIdentWithDB = tableIdent.copy(database = Some(db))
+val tableName = tableIdent.unquotedString
+
+catalog.getTableMetadataOption(tableIdent) match {
+  // If the table already exists...
+  case Some(tableMeta) =>
+mode match {
+  case SaveMode.Ignore => // Do nothing
+
+  case SaveMode.ErrorIfExists =>
+throw new AnalysisException(s"Table $tableName already exists. 
You can set SaveMode " +
+  "to SaveMode.Append to insert data into the table or set 
SaveMode to " +
+  "SaveMode.Overwrite to overwrite the existing data.")
+
+  case SaveMode.Append =>
+// Check if the specified data source match the data source of 
the existing table.
+val specifiedProvider = DataSource.lookupDataSource(source)
+// TODO: Check that options from the resolved relation match 
the relation that we are
+// inserting into (i.e. using the same compression).
+
+// Pass a table identifier with database part, so that 
`lookupRelation` won't get temp
+// views unexpectedly.
+
EliminateSubqueryAliases(catalog.lookupRelation(tableIdentWithDB)) match {
+  case l @ LogicalRelation(_: InsertableRelation | _: 
HadoopFsRelation, _, _) =>
+// check if the file formats match
+l.relation match {
+  case r: HadoopFsRelation if r.fileFormat.getClass != 
specifiedProvider =>
+throw new AnalysisException(
+  s"The file format of the existing table $tableName 
is " +
+s"`${r.fileFormat.getClass.getName}`. It doesn't 
match the specified " +
+s"format `$source`")
+  case _ =>
+}
+  case s: SimpleCatalogRelation if 
DDLUtils.isDatasourceTable(s.metadata) => // OK.
+  case c: CatalogRelation if c.catalogTable.provider == 
Some(DDLUtils.HIVE_PROVIDER) =>
+throw new AnalysisException(s"Saving data in the Hive 
serde table $tableName " +
+  s"is not supported yet. Please use the insertInto() API 
as an alternative.")
+  case o =>
+throw new AnalysisException(s"Saving data in ${o.toString} 
is not supported.")
+}
+
+val existingSchema = tableMeta.schema
+if (df.logicalPlan.schema.size != existingSchema.size) {
+  throw new AnalysisException(
+s"The column number of the existing table $tableName" +
+  s"(${existingSchema.catalogString}) doesn't match the 
data schema" +
+  s"(${df.logicalPlan.schema.catalogString})")
+}
+
+if (partitioningColumns.isDefined) {
+  logWarning("append to an existing table, the specified 
partition columns " +
+s"[${partitioningColumns.get.mkString(", ")}] will be 
ignored.")
+}
+
+val specifiedBucketSpec = getBucketSpec
+if (specifiedBucketSpec.isDefined) {
+  logWarning("append to an existing table, the specified 
bucketing " +
+ 

[GitHub] spark issue #16370: [SPARK-18960][SQL][SS] Avoid double reading file which i...

2016-12-22 Thread uncleGen
Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/16370
  
@zsxwing Thanks for your reminder!!
In some ways, we really can evade this issue, just like not use `-cp`. But 
this is an user-side behaviour, we can not ensure every users know and use 
correct ways to move data. It may confuse users if they do not know this issue. 
Besides, current changes is so tiny that do not have any harm to codes. So, 
IMHO, it is OK to add the check for protection. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16294: [SPARK-18669][SS][DOCS] Update Apache docs for Structure...

2016-12-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16294
  
**[Test build #70530 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70530/testReport)**
 for PR 16294 at commit 
[`576b432`](https://github.com/apache/spark/commit/576b432f4eb90dae4f9c3573a5b6bd665ab1d8a9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   >