[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14452 Revisit this by rebasing with master. BTW, in 500+ LOC changes, actually there are 200+ LOC changes are test cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14452 **[Test build #70541 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70541/testReport)** for PR 14452 at commit [`9faf90a`](https://github.com/apache/spark/commit/9faf90a346909b27aa7365bc42cd139c7d0fb3a7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16232: [SPARK-18800][SQL] Correct the assert in UnsafeKVExterna...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16232 ping @davies --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13909 **[Test build #70540 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70540/testReport)** for PR 13909 at commit [`0af0828`](https://github.com/apache/spark/commit/0af08282f4f1d72d205442ba66d6964cd1ac0599). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16337: [SPARK-18871][SQL] New test cases for IN/NOT IN subquery
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16337 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70535/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15666 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70534/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15666 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16337: [SPARK-18871][SQL] New test cases for IN/NOT IN subquery
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16337 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15666 **[Test build #70534 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70534/testReport)** for PR 15666 at commit [`73df5a4`](https://github.com/apache/spark/commit/73df5a4f5a961e558588b3462e7744a1c9c1266a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16337: [SPARK-18871][SQL] New test cases for IN/NOT IN subquery
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16337 **[Test build #70535 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70535/testReport)** for PR 16337 at commit [`1c1900a`](https://github.com/apache/spark/commit/1c1900a261b12e95a8a53892017294df3c21b317). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13909 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15211: [SPARK-14709][ML] spark.ml API for linear SVM
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/15211 I've sent a new update addressing most of the comments. The only exception is about `SetWeightCol` in `LinearSVCModel`. cc @jkbradley. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13909 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70537/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13909 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15211: [SPARK-14709][ML] spark.ml API for linear SVM
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15211 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15211: [SPARK-14709][ML] spark.ml API for linear SVM
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15211 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70539/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15211: [SPARK-14709][ML] spark.ml API for linear SVM
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15211 **[Test build #70539 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70539/testReport)** for PR 15211 at commit [`21ecbf0`](https://github.com/apache/spark/commit/21ecbf08ed03f9b69f4cbec7380e547f146acec7). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class LinearSVC @Since(\"2.2.0\") (` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13909 **[Test build #70537 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70537/testReport)** for PR 13909 at commit [`0af0828`](https://github.com/apache/spark/commit/0af08282f4f1d72d205442ba66d6964cd1ac0599). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r93733483 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala --- @@ -36,29 +31,31 @@ import org.apache.spark.sql.sources._ import org.apache.spark.sql.types.StructType import org.apache.spark.util.SerializableConfiguration +object JsonFileFormat { + def parseJsonOptions(sparkSession: SparkSession, options: Map[String, String]): JSONOptions = { --- End diff -- I think I disagree with passing whole `SparkSession` because apparently we only need `SQLConf` or the value of `spark.sql.columnNameOfCorruptRecord`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15212: [SPARK-17645][MLLIB][ML]add feature selector method base...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15212 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15212: [SPARK-17645][MLLIB][ML]add feature selector method base...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15212 **[Test build #70536 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70536/testReport)** for PR 15212 at commit [`5a7cc2c`](https://github.com/apache/spark/commit/5a7cc2ca9e81ade4d430411ab6e314ae5010169f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15212: [SPARK-17645][MLLIB][ML]add feature selector method base...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15212 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70536/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r93732800 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala --- @@ -0,0 +1,204 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.json + +import java.io.InputStream + +import scala.reflect.ClassTag + +import com.fasterxml.jackson.core.{JsonFactory, JsonParser} +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.io.{LongWritable, Text} +import org.apache.hadoop.mapreduce.Job +import org.apache.hadoop.mapreduce.lib.input.{FileInputFormat, TextInputFormat} + +import org.apache.spark.TaskContext +import org.apache.spark.input.{PortableDataStream, StreamInputFormat} +import org.apache.spark.rdd.{BinaryFileRDD, RDD} +import org.apache.spark.sql.{AnalysisException, SparkSession} +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.json.{CreateJacksonParser, JacksonParser, JSONOptions} +import org.apache.spark.sql.execution.datasources.{CodecStreams, HadoopFileLinesReader, PartitionedFile} +import org.apache.spark.sql.types.StructType + +/** + * Common functions for parsing JSON files + * @tparam T A datatype containing the unparsed JSON, such as [[Text]] or [[String]] + */ +abstract class JsonDataSource[T] extends Serializable { + def isSplitable: Boolean + + /** + * Parse a [[PartitionedFile]] into 0 or more [[InternalRow]] instances + */ + def readFile( +conf: Configuration, +file: PartitionedFile, +parser: JacksonParser): Iterator[InternalRow] + + /** + * Create an [[RDD]] that handles the preliminary parsing of [[T]] records + */ + protected def createBaseRdd( +sparkSession: SparkSession, +inputPaths: Seq[FileStatus]): RDD[T] + + /** + * A generic wrapper to invoke the correct [[JsonFactory]] method to allocate a [[JsonParser]] + * for an instance of [[T]] + */ + def createParser(jsonFactory: JsonFactory, value: T): JsonParser + + final def infer( + sparkSession: SparkSession, + inputPaths: Seq[FileStatus], + parsedOptions: JSONOptions): Option[StructType] = { +val jsonSchema = InferSchema.infer( + createBaseRdd(sparkSession, inputPaths), + parsedOptions, + createParser) +checkConstraints(jsonSchema) + +if (jsonSchema.fields.nonEmpty) { --- End diff -- It seems this changes existing behaviour (not allowing empty schema). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15211: [SPARK-14709][ML] spark.ml API for linear SVM
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15211 **[Test build #70539 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70539/testReport)** for PR 15211 at commit [`21ecbf0`](https://github.com/apache/spark/commit/21ecbf08ed03f9b69f4cbec7380e547f146acec7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16368: [SPARK-18958][SPARKR] R API toJSON on DataFrame
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16368 ah, it was merged https://git-wip-us.apache.org/repos/asf?p=spark.git --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16368: [SPARK-18958][SPARKR] R API toJSON on DataFrame
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16368 I kept getting error with the merge script - not sure if it went through. we are likely having some sync issue with github? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16312: [SPARK-18862][SPARKR][ML] Split SparkR mllib.R into mult...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16312 ah, thank you @shivaram. sorry I couldn't get around to investigate earlier. @yanboliang It looks like that is the design in the trait BaseReadWrite ([here](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L80)), where it holds references to `sc`, `sqlContext` and `spark` session. Although, I see other MLReader/MLWriter calls `sc` directly whereas the design should allow for the `sc`/spark session to be updated? Specifically we could change to pass the spark session to the RWrapper for these calls but generally reusing `sc` is the design of BaseReadWrite/MLReader/MLWriter, and is not specific to R. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16386 > the corrupt column will contain the filename instead of the literal JSON if there is a parsing failure I am worried of changing the behaviour. I understand why it had to be here as you described in the description but we have `input_file_name` functions for these. I would not expect, at least, there are file names in `_corrupt_record`. We need to document this around `spark.sql.columnNameOfCorruptRecord` in `SQLConf` and `columnNameOfCorruptRecord` in read/writer in Python and Scala if this is acceptable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16368: [SPARK-18958][SPARKR] R API toJSON on DataFrame
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/16368 Hmm looks like this is merged but not reflected on github ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16387: [SPARK-18986][Core] ExternalAppendOnlyMap shouldn...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16387#discussion_r93732158 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala --- @@ -192,12 +193,16 @@ class ExternalAppendOnlyMap[K, V, C]( * It will be called by TaskMemoryManager when there is not enough memory for the task. */ override protected[this] def forceSpill(): Boolean = { -assert(readingIterator != null) -val isSpilled = readingIterator.spill() -if (isSpilled) { - currentMap = null +if (isReadingIterator) { + assert(readingIterator != null) + val isSpilled = readingIterator.spill() + if (isSpilled) { +currentMap = null + } + isSpilled +} else { + false --- End diff -- I choose to simply return false now. Another option is to actually spill the in-memory map. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16387: [SPARK-18986][Core] ExternalAppendOnlyMap shouldn't fail...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16387 **[Test build #70538 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70538/testReport)** for PR 16387 at commit [`03d4dc0`](https://github.com/apache/spark/commit/03d4dc0afbba0217a322a20a999894640f43aecc). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13909: [SPARK-16213][SQL] Reduce runtime overhead of a p...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13909#discussion_r93732043 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala --- @@ -56,33 +58,100 @@ case class CreateArray(children: Seq[Expression]) extends Expression { } override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { -val arrayClass = classOf[GenericArrayData].getName -val values = ctx.freshName("values") -ctx.addMutableState("Object[]", values, s"this.$values = null;") - -ev.copy(code = s""" - this.$values = new Object[${children.size}];""" + - ctx.splitExpressions( -ctx.INPUT_ROW, -children.zipWithIndex.map { case (e, i) => - val eval = e.genCode(ctx) - eval.code + s""" -if (${eval.isNull}) { - $values[$i] = null; -} else { - $values[$i] = ${eval.value}; -} - """ -}) + - s""" -final ArrayData ${ev.value} = new $arrayClass($values); -this.$values = null; - """, isNull = "false") +val array = ctx.freshName("array") + +val et = dataType.elementType +val evals = children.map(e => e.genCode(ctx)) +val isPrimitiveArray = ctx.isPrimitiveType(et) +val primitiveTypeName = if (isPrimitiveArray) ctx.primitiveTypeName(et) else "" +val (preprocess, arrayData, arrayWriter) = + GenArrayData.getCodeArrayData(ctx, et, children.size, isPrimitiveArray, array) + +val assigns = if (isPrimitiveArray) { + evals.zipWithIndex.map { case (eval, i) => +eval.code + s""" + if (${eval.isNull}) { + $arrayWriter.setNull$primitiveTypeName($i); + } else { + $arrayWriter.write($i, ${eval.value}); + } + """ + } +} else { + evals.zipWithIndex.map { case (eval, i) => +eval.code + s""" + if (${eval.isNull}) { + $array[$i] = null; + } else { + $array[$i] = ${eval.value}; + } + """ + } +} +ev.copy(code = + preprocess + + ctx.splitExpressions(ctx.INPUT_ROW, assigns) + + s"\nfinal ArrayData ${ev.value} = $arrayData;\n", + isNull = "false") } override def prettyName: String = "array" } +private [sql] object GenArrayData { + // This function returns Java code pieces based on DataType and isPrimitive + // for allocation of ArrayData class + def getCodeArrayData( + ctx: CodegenContext, + dt: DataType, + size: Int, + isPrimitive : Boolean, + array: String): (String, String, String) = { +if (!isPrimitive) { + val arrayClass = classOf[GenericArrayData].getName + ctx.addMutableState("Object[]", array, +s"this.$array = new Object[${size}];") + ("", s"new $arrayClass($array)", null) +} else { + val row = ctx.freshName("row") + val holder = ctx.freshName("holder") + val rowWriter = ctx.freshName("createRowWriter") + val arrayWriter = ctx.freshName("createArrayWriter") + val unsafeRowClass = classOf[UnsafeRow].getName + val unsafeArrayClass = classOf[UnsafeArrayData].getName + val holderClass = classOf[BufferHolder].getName + val rowWriterClass = classOf[UnsafeRowWriter].getName + val arrayWriterClass = classOf[UnsafeArrayWriter].getName + ctx.addMutableState(unsafeRowClass, row, "") + ctx.addMutableState(unsafeArrayClass, array, "") + ctx.addMutableState(holderClass, holder, "") + ctx.addMutableState(rowWriterClass, rowWriter, "") + ctx.addMutableState(arrayWriterClass, arrayWriter, "") + val unsafeArraySizeInBytes = +UnsafeArrayData.calculateHeaderPortionInBytes(size) + +ByteArrayMethods.roundNumberOfBytesToNearestWord(dt.defaultSize * size) + + // To write data to UnsafeArrayData, we create UnsafeRow with a single array field + // and then prepare BufferHolder for the array. + // In summary, this does not use UnsafeRow and wastes some bits in an byte array + (s""" +$row = new $unsafeRowClass(1); +$holder = new $holderClass($row, $unsafeArraySizeInBytes); +$rowWriter = new $rowWriterClass($holder, 1); --- End diff -- For now, let me make `ArrayData` mutable. If we have an agreement to make `UnsafeArrayData` mutable, I can do it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project
[GitHub] spark pull request #16387: [SPARK-18986][Core] ExternalAppendOnlyMap shouldn...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/16387 [SPARK-18986][Core] ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its iterator ## What changes were proposed in this pull request? `ExternalAppendOnlyMap.forceSpill` now uses an assert to check if an iterator is not null in the map. However, the assertion is only true after the map is asked for iterator. Before it, if another memory consumer asks more memory than currently available, `ExternalAppendOnlyMap.forceSpill` is also be called too. In this case, we will see failure like this: [info] java.lang.AssertionError: assertion failed [info] at scala.Predef$.assert(Predef.scala:156) [info] at org.apache.spark.util.collection.ExternalAppendOnlyMap.forceSpill(ExternalAppendOnlyMap.scala:196) [info] at org.apache.spark.util.collection.Spillable.spill(Spillable.scala:111) [info] at org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$13.apply$mcV$sp(ExternalAppendOnly MapSuite.scala:294) ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 fix-externalappendonlymap Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16387.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16387 commit 2e4f34e54e92bfc47d817cb6392d89d660401b57 Author: Liang-Chi HsiehDate: 2016-12-23T04:59:01Z Return false when forceSpill is called before the map is asked for iterator. commit 03d4dc0afbba0217a322a20a999894640f43aecc Author: Liang-Chi Hsieh Date: 2016-12-23T05:45:54Z Add test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16368: [SPARK-18958][SPARKR] R API toJSON on DataFrame
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/16368 Merging this into master, branch-2.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16312: [SPARK-18862][SPARKR][ML] Split SparkR mllib.R into mult...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/16312 I looked at this more closely and I think I found the problem - Not sure its easy to fix though. What I traced here is: - When we call sparkR.session.stop and sparkR.session the same JVM backend is reused and only the SparkContext is stopped / recreated - Now the problem happens when we call `read.ml` to read a model after creating a new SparkSession. This in turn calls into RWrappers[1] which has an `sc` member variable - My understanding is that the `sc` member variable is bound the first time we create a SparkSession and when we stop, restart it has a handle to the stale SparkContext - Thus we see errors where it says `Cannot call methods on a stopped SparkContext` I think the right fix here is to pass along a SparkContext into RWrappers and not rely on a prior initialization. However I'm not sure why that design decision was made before, so maybe I'm missing something. [1] https://github.com/apache/spark/blob/f252cb5d161e064d39cc1ed1d9299307a0636174/mllib/src/main/scala/org/apache/spark/ml/r/RWrappers.scala#L36 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/15996 ah https://github.com/apache/spark/commit/9a1ad71db44558bb6eb380dc23a1a1abbc2f3e98 failed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13909 **[Test build #70537 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70537/testReport)** for PR 13909 at commit [`0af0828`](https://github.com/apache/spark/commit/0af08282f4f1d72d205442ba66d6964cd1ac0599). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r93731259 --- Diff: python/pyspark/sql/readwriter.py --- @@ -155,21 +155,24 @@ def load(self, path=None, format=None, schema=None, **options): return self._df(self._jreader.load()) @since(1.4) -def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None, +def json(self, path, schema=None, wholeFile=None, primitivesAsString=None, prefersDecimal=None, --- End diff -- we need to add this to the end; otherwise it breaks compatibility for positional arguments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15211: [SPARK-14709][ML] spark.ml API for linear SVM
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/15211#discussion_r93731229 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala --- @@ -0,0 +1,525 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.classification + +import breeze.linalg.{DenseVector => BDV} +import breeze.optimize.{CachedDiffFunction, DiffFunction, OWLQN => BreezeOWLQN} +import org.apache.hadoop.fs.Path +import scala.collection.mutable + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.internal.Logging +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.linalg.BLAS._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.linalg.VectorImplicits._ +import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{Dataset, Row} +import org.apache.spark.sql.functions.{col, lit} + +/** Params for linear SVM Classifier. */ +private[ml] trait LinearSVCParams extends ClassifierParams with HasRegParam with HasMaxIter + with HasFitIntercept with HasTol with HasStandardization with HasWeightCol with HasThreshold + with HasAggregationDepth { + +} + +/** + * :: Experimental :: + * Linear SVM Classifier with Hinge Loss and OWLQN optimizer + */ +@Since("2.2.0") +@Experimental +class LinearSVC @Since("2.2.0")( +@Since("2.2.0") override val uid: String) + extends Classifier[Vector, LinearSVC, LinearSVCModel] + with LinearSVCParams with DefaultParamsWritable { + + @Since("2.2.0") + def this() = this(Identifiable.randomUID("linearsvc")) + + /** + * Set the maximum number of iterations. + * Default is 100. + * + * @group setParam + */ + @Since("2.2.0") + def setMaxIter(value: Int): this.type = set(maxIter, value) + + + /** + * Set the regularization parameter. + * Default is 0.0. + * + * @group setParam + */ + @Since("2.2.0") + def setRegParam(value: Double): this.type = set(regParam, value) + + /** + * Set the convergence tolerance of iterations. + * Smaller value will lead to higher accuracy with the cost of more iterations. + * Default is 1E-4. + * + * @group setParam + */ + @Since("2.2.0") + def setTol(value: Double): this.type = set(tol, value) + + /** + * Whether to fit an intercept term. + * Default is true. + * + * @group setParam + */ + @Since("2.2.0") + def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value) + + @Since("2.2.0") + override def copy(extra: ParamMap): LinearSVC = defaultCopy(extra) + + /** + * Sets the value of param [[weightCol]]. + * If this is not set or empty, we treat all instance weights as 1.0. + * Default is not set, so all instances have weight one. + * + * @group setParam + */ + @Since("2.2.0") + def setWeightCol(value: String): this.type = set(weightCol, value) + + setDefault(maxIter -> 100, +regParam -> 0.0, +threshold -> 0, +tol -> 1E-6, +fitIntercept -> true + ) + + /** + * Train a linear SVM Classifier Model with Hinge Loss and OWLQN optimizer + * + * @param dataset Training dataset + * @return Fitted model + */ + override protected def train(dataset: Dataset[_]): LinearSVCModel = { +val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol)) +val instances: RDD[Instance] = + dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map
[GitHub] spark issue #15212: [SPARK-17645][MLLIB][ML]add feature selector method base...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15212 **[Test build #70536 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70536/testReport)** for PR 15212 at commit [`5a7cc2c`](https://github.com/apache/spark/commit/5a7cc2ca9e81ade4d430411ab6e314ae5010169f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/15996 LGTM. Can you update the comment to address my last comment (https://github.com/apache/spark/pull/15996#discussion_r93730700)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTable...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/15996#discussion_r93730700 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala --- @@ -643,6 +644,14 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSQLContext with Be withTable("t") { val provider = "org.apache.spark.sql.test.DefaultSource" sql(s"CREATE TABLE t USING $provider") + + // make sure the data source doesn't provide `InsertableRelation`, so that we can only append + // data to it with `CreatableRelationProvider.createRelation` --- End diff -- One last comment. Let's explicitly say that we want to test the case that a data source is a CreatableRelationProvider but its relation does not implement InsertableRelation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16337: [SPARK-18871][SQL] New test cases for IN/NOT IN subquery
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16337 **[Test build #70535 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70535/testReport)** for PR 16337 at commit [`1c1900a`](https://github.com/apache/spark/commit/1c1900a261b12e95a8a53892017294df3c21b317). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16337: [SPARK-18871][SQL] New test cases for IN/NOT IN subquery
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16337 Retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r93730314 --- Diff: core/src/main/scala/org/apache/spark/TestUtils.scala --- @@ -164,6 +164,27 @@ private[spark] object TestUtils { createCompiledClass(className, destDir, sourceFile, classpathUrls) } + /** Create a dummy compile jar for a given package, classname. Jar will be placed in destDir */ + def createDummyJar(destDir: String, packageName: String, className: String): String = { --- End diff -- The R tests do indeed verify that they can call the internal functions. I can revert that part of the changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16384: [BUILD] make-distribution support alternate pytho...
Github user felixcheung closed the pull request at: https://github.com/apache/spark/pull/16384 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r93729928 --- Diff: core/src/main/scala/org/apache/spark/TestUtils.scala --- @@ -164,6 +164,27 @@ private[spark] object TestUtils { createCompiledClass(className, destDir, sourceFile, classpathUrls) } + /** Create a dummy compile jar for a given package, classname. Jar will be placed in destDir */ + def createDummyJar(destDir: String, packageName: String, className: String): String = { --- End diff -- Yeah when i wrote this that didn't exist yet. Changing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15666 **[Test build #70534 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70534/testReport)** for PR 15666 at commit [`73df5a4`](https://github.com/apache/spark/commit/73df5a4f5a961e558588b3462e7744a1c9c1266a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16386 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16386 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70531/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16386 **[Test build #70531 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70531/testReport)** for PR 16386 at commit [`7ad5d5b`](https://github.com/apache/spark/commit/7ad5d5be0c7b41112f9f6ad3cb0cf9055de62695). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15996 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70532/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15996 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15996 **[Test build #70532 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70532/testReport)** for PR 15996 at commit [`9a1ad71`](https://github.com/apache/spark/commit/9a1ad71db44558bb6eb380dc23a1a1abbc2f3e98). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16337: [SPARK-18871][SQL] New test cases for IN/NOT IN subquery
Github user kevinyu98 commented on the issue: https://github.com/apache/spark/pull/16337 I just run build/sbt "test-only org.apache.spark.sql.streaming.StreamSuite" on my local machine, also the whole sql suite, it works fine. Can you re-run the test? Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16323: [SPARK-18911] [SQL] Define CatalogStatistics to i...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16323#discussion_r93726972 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala --- @@ -41,13 +41,13 @@ import org.apache.spark.sql.types._ * @param sizeInBytes Physical size in bytes. For leaf operators this defaults to 1, otherwise it *defaults to the product of children's `sizeInBytes`. * @param rowCount Estimated number of rows. - * @param colStats Column-level statistics. + * @param attributeStats Statistics for Attributes. * @param isBroadcastable If true, output is small enough to be used in a broadcast join. */ case class Statistics( sizeInBytes: BigInt, rowCount: Option[BigInt] = None, -colStats: Map[String, ColumnStat] = Map.empty, +attributeStats: AttributeMap[ColumnStat] = AttributeMap(Nil), --- End diff -- Will we estimate statistics for all attributes in logical plan? I meant if an attribute is not coming from a leaf node but from a later plan like `Join`, do we still have `ColumnStat` for it? If not, I think we don't need to call this parameter as `attributeStats`, instead of original `colStats`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16228: [SPARK-17076] [SQL] Cardinality estimation for join base...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16228 **[Test build #70533 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70533/testReport)** for PR 16228 at commit [`c3e3a48`](https://github.com/apache/spark/commit/c3e3a48c930c9f00bf77a11dfe0ef819ca005b26). * This patch **fails to build**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class LeftSemiAntiEstimation(join: Join) ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16228: [SPARK-17076] [SQL] Cardinality estimation for join base...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16228 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70533/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16228: [SPARK-17076] [SQL] Cardinality estimation for join base...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16228 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16228: [SPARK-17076] [SQL] Cardinality estimation for join base...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16228 **[Test build #70533 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70533/testReport)** for PR 16228 at commit [`c3e3a48`](https://github.com/apache/spark/commit/c3e3a48c930c9f00bf77a11dfe0ef819ca005b26). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16323: [SPARK-18911] [SQL] Define CatalogStatistics to i...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16323#discussion_r93726768 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -237,6 +239,38 @@ case class CatalogTable( } +/** + * This class of statistics is used in [[CatalogTable]] to interact with metastore. --- End diff -- Can you add few words explaining why don't use `Statistics` for `CatalogTable`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13909: [SPARK-16213][SQL] Reduce runtime overhead of a p...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/13909#discussion_r93726522 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala --- @@ -56,33 +58,100 @@ case class CreateArray(children: Seq[Expression]) extends Expression { } override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { -val arrayClass = classOf[GenericArrayData].getName -val values = ctx.freshName("values") -ctx.addMutableState("Object[]", values, s"this.$values = null;") - -ev.copy(code = s""" - this.$values = new Object[${children.size}];""" + - ctx.splitExpressions( -ctx.INPUT_ROW, -children.zipWithIndex.map { case (e, i) => - val eval = e.genCode(ctx) - eval.code + s""" -if (${eval.isNull}) { - $values[$i] = null; -} else { - $values[$i] = ${eval.value}; -} - """ -}) + - s""" -final ArrayData ${ev.value} = new $arrayClass($values); -this.$values = null; - """, isNull = "false") +val array = ctx.freshName("array") + +val et = dataType.elementType +val evals = children.map(e => e.genCode(ctx)) +val isPrimitiveArray = ctx.isPrimitiveType(et) +val primitiveTypeName = if (isPrimitiveArray) ctx.primitiveTypeName(et) else "" +val (preprocess, arrayData, arrayWriter) = + GenArrayData.getCodeArrayData(ctx, et, children.size, isPrimitiveArray, array) + +val assigns = if (isPrimitiveArray) { + evals.zipWithIndex.map { case (eval, i) => +eval.code + s""" + if (${eval.isNull}) { + $arrayWriter.setNull$primitiveTypeName($i); + } else { + $arrayWriter.write($i, ${eval.value}); + } + """ + } +} else { + evals.zipWithIndex.map { case (eval, i) => +eval.code + s""" + if (${eval.isNull}) { + $array[$i] = null; + } else { + $array[$i] = ${eval.value}; + } + """ + } +} +ev.copy(code = + preprocess + + ctx.splitExpressions(ctx.INPUT_ROW, assigns) + + s"\nfinal ArrayData ${ev.value} = $arrayData;\n", + isNull = "false") } override def prettyName: String = "array" } +private [sql] object GenArrayData { + // This function returns Java code pieces based on DataType and isPrimitive + // for allocation of ArrayData class + def getCodeArrayData( + ctx: CodegenContext, + dt: DataType, + size: Int, + isPrimitive : Boolean, + array: String): (String, String, String) = { +if (!isPrimitive) { + val arrayClass = classOf[GenericArrayData].getName + ctx.addMutableState("Object[]", array, +s"this.$array = new Object[${size}];") + ("", s"new $arrayClass($array)", null) +} else { + val row = ctx.freshName("row") + val holder = ctx.freshName("holder") + val rowWriter = ctx.freshName("createRowWriter") + val arrayWriter = ctx.freshName("createArrayWriter") + val unsafeRowClass = classOf[UnsafeRow].getName + val unsafeArrayClass = classOf[UnsafeArrayData].getName + val holderClass = classOf[BufferHolder].getName + val rowWriterClass = classOf[UnsafeRowWriter].getName + val arrayWriterClass = classOf[UnsafeArrayWriter].getName + ctx.addMutableState(unsafeRowClass, row, "") + ctx.addMutableState(unsafeArrayClass, array, "") + ctx.addMutableState(holderClass, holder, "") + ctx.addMutableState(rowWriterClass, rowWriter, "") + ctx.addMutableState(arrayWriterClass, arrayWriter, "") + val unsafeArraySizeInBytes = +UnsafeArrayData.calculateHeaderPortionInBytes(size) + +ByteArrayMethods.roundNumberOfBytesToNearestWord(dt.defaultSize * size) + + // To write data to UnsafeArrayData, we create UnsafeRow with a single array field + // and then prepare BufferHolder for the array. + // In summary, this does not use UnsafeRow and wastes some bits in an byte array + (s""" +$row = new $unsafeRowClass(1); +$holder = new $holderClass($row, $unsafeArraySizeInBytes); +$rowWriter = new $rowWriterClass($holder, 1); --- End diff -- About the scope of change, do we need to make `ArrayData` mutable or just `UnsafeArrayData`? Actually I don't see `GenericArrayData` needs this mutability now. --- If your project is set up for it, you can reply to this email and have your
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r93726073 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -171,11 +171,14 @@ object ChiSqSelectorModel extends Loader[ChiSqSelectorModel] { /** * Creates a ChiSquared feature selector. - * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`. + * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`, + * `fdr`, `fwe`. * - `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. * - `percentile` is similar but chooses a fraction of all features instead of a fixed number. * - `fpr` chooses all features whose p-value is below a threshold, thus controlling the false *positive rate of selection. + * - `fdr` chooses all features whose false discovery rate meets some threshold. --- End diff -- Ditto --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r93726194 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -255,19 +288,22 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { private[spark] object ChiSqSelector { - /** - * String name for `numTopFeatures` selector type. - */ + /** String name for `numTopFeatures` selector type. */ val NumTopFeatures: String = "numTopFeatures" --- End diff -- ```private[spark]``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r93725579 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -92,8 +92,36 @@ private[feature] trait ChiSqSelectorParams extends Params def getFpr: Double = $(fpr) /** + * The highest uncorrected p-value for features to be kept. + * Only applicable when selectorType = "fdr". + * Default value is 0.05. + * @group param + */ + @Since("2.1.0") --- End diff -- Update version to 2.2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r93726048 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -111,11 +139,14 @@ private[feature] trait ChiSqSelectorParams extends Params /** * Chi-Squared feature selection, which selects categorical features to use for predicting a * categorical label. - * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`. + * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`, + * `fdr`, `fwe`. * - `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. * - `percentile` is similar but chooses a fraction of all features instead of a fixed number. * - `fpr` chooses all features whose p-value is below a threshold, thus controlling the false *positive rate of selection. + * - `fdr` chooses all features whose false discovery rate meets some threshold. + * - `fwe` chooses all features whose family-wise error rate meets some threshold. --- End diff -- Update according the above suggestion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r93725408 --- Diff: docs/mllib-feature-extraction.md --- @@ -227,11 +227,13 @@ both speed and statistical learning behavior. [`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) implements Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the [Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which -features to choose. It supports three selection methods: `numTopFeatures`, `percentile`, `fpr`: +features to choose. It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`: * `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. * `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number. * `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection. +* `fdr` chooses all features whose false discovery rate meets some threshold. +* `fwe` chooses all features whose family-wise error rate meets some threshold. --- End diff -- Update according the above suggestion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r93726001 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -92,8 +92,36 @@ private[feature] trait ChiSqSelectorParams extends Params def getFpr: Double = $(fpr) /** + * The highest uncorrected p-value for features to be kept. + * Only applicable when selectorType = "fdr". + * Default value is 0.05. + * @group param + */ + @Since("2.1.0") + final val fdr = new DoubleParam(this, "fdr", +"The highest uncorrected p-value for features to be kept.", ParamValidators.inRange(0, 1)) + setDefault(fdr -> 0.05) + + /** @group getParam */ + def getFdr: Double = $(fdr) + + /** + * The highest uncorrected p-value for features to be kept. --- End diff -- Ditto, ```The upper bound of the expected family-wise error rate```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r93725173 --- Diff: docs/ml-features.md --- @@ -1423,12 +1423,12 @@ for more details on the API. `ChiSqSelector` stands for Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the [Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which -features to choose. It supports three selection methods: `numTopFeatures`, `percentile`, `fpr`: - +features to choose. It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`: * `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. * `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number. * `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection. - +* `fdr` chooses all features whose false discovery rate meets some threshold. +* `fwe` chooses all features whose family-wise error rate meets some threshold. --- End diff -- ```whose p-values is below a threshold, thus controlling the family-wise error rate of selection``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r93726320 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala --- @@ -27,61 +27,240 @@ class ChiSqSelectorSuite extends SparkFunSuite with MLlibTestSparkContext { /* * Contingency tables - * feature0 = {8.0, 0.0} + * feature0 = {6.0, 0.0, 8.0} * class 0 1 2 - *8.0||1|0|1| - *0.0||0|2|0| + *6.0||1|0|0| + *0.0||0|3|0| + *8.0||0|0|2| + * degree of freedom = 4, statistic = 12, pValue = 0.017 * * feature1 = {7.0, 9.0} * class 0 1 2 *7.0||1|0|0| - *9.0||0|2|1| + *9.0||0|3|2| + * degree of freedom = 2, statistic = 6, pValue = 0.049 * - * feature2 = {0.0, 6.0, 8.0, 5.0} + * feature2 = {0.0, 6.0, 3.0, 8.0} * class 0 1 2 *0.0||1|0|0| - *6.0||0|1|0| + *6.0||0|1|2| + *3.0||0|1|0| *8.0||0|1|0| - *5.0||0|0|1| + * degree of freedom = 6, statistic = 8.66, pValue = 0.193 + * + * feature3 = {7.0, 0.0, 5.0, 4.0} + * class 0 1 2 + *7.0||1|0|0| + *0.0||0|2|0| + *5.0||0|1|1| + *4.0||0|0|1| + * degree of freedom = 6, statistic = 9.5, pValue = 0.147 + * + * feature4 = {6.0, 5.0, 4.0, 0.0} + * class 0 1 2 + *6.0||1|1|0| + *5.0||0|2|0| + *4.0||0|0|1| + *0.0||0|0|1| + * degree of freedom = 6, statistic = 8.0, pValue = 0.238 + * + * feature5 = {0.0, 9.0, 5.0, 4.0} + * class 0 1 2 + *0.0||1|0|1| + *9.0||0|1|0| + *5.0||0|1|0| + *4.0||0|1|1| + * degree of freedom = 6, statistic = 5, pValue = 0.54 * * Use chi-squared calculator from Internet */ - test("ChiSqSelector transform test (sparse & dense vector)") { + test("ChiSqSelector transform by KBest test (sparse & dense vector)") { val labeledDiscreteData = sc.parallelize( --- End diff -- Many test functions need ```labeledDiscreteData```, we can refactor it out of the function and make other functions shared the same dataset instance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r93726203 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -255,19 +288,22 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { private[spark] object ChiSqSelector { - /** - * String name for `numTopFeatures` selector type. - */ + /** String name for `numTopFeatures` selector type. */ val NumTopFeatures: String = "numTopFeatures" - /** - * String name for `percentile` selector type. - */ + /** String name for `percentile` selector type. */ val Percentile: String = "percentile" --- End diff -- ```private[spark]``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r93726092 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -245,6 +264,20 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { case ChiSqSelector.FPR => chiSqTestResult .filter { case (res, _) => res.pValue < fpr } + case ChiSqSelector.FDR => +// This uses the Benjamini-Hochberg procedure. --- End diff -- Add link to explain ```Benjamini-Hochberg procedure```: https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r93725098 --- Diff: docs/ml-features.md --- @@ -1423,12 +1423,12 @@ for more details on the API. `ChiSqSelector` stands for Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the [Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which -features to choose. It supports three selection methods: `numTopFeatures`, `percentile`, `fpr`: - +features to choose. It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`: * `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. * `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number. * `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection. - +* `fdr` chooses all features whose false discovery rate meets some threshold. --- End diff -- ``` `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold``` should be better? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML]add feature selector meth...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r93725546 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -92,8 +92,36 @@ private[feature] trait ChiSqSelectorParams extends Params def getFpr: Double = $(fpr) /** + * The highest uncorrected p-value for features to be kept. --- End diff -- I think the doc is incorrect even it's consistent with sklearn, actually we don't compare ```fdr``` value with ```p-value``` directly. I'm more prefer to change as ```The upper bound of the expected false discovery rate``` which is more accuracy and easy to understand. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16291: [SPARK-18838][CORE] Use separate executor service for ea...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/16291 I agree with @markhamstra and @vanzin - having ability to tag listeners into groups (default = spark listener group) and preserving current synchronized behavior within group would be ensure backward compatibility at fairly minimal additional complexity. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16386 Hello recent JacksonGenerator.scala commiters, please take a look. cc/ @rxin @hvanhovell @clockfly @hyukjinkwon @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15996 **[Test build #70532 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70532/testReport)** for PR 15996 at commit [`9a1ad71`](https://github.com/apache/spark/commit/9a1ad71db44558bb6eb380dc23a1a1abbc2f3e98). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16386 **[Test build #70531 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70531/testReport)** for PR 16386 at commit [`7ad5d5b`](https://github.com/apache/spark/commit/7ad5d5be0c7b41112f9f6ad3cb0cf9055de62695). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...
GitHub user NathanHowell opened a pull request: https://github.com/apache/spark/pull/16386 [SPARK-18352][SQL] Support parsing multiline json files ## What changes were proposed in this pull request? If a new option `wholeFile` is set to `true` the JSON reader will parse each file (instead of a single line) as a value. This is done with Jackson streaming and it should be capable of parsing very large documents, assuming the row will fit in memory. Because the file is not buffered in memory the corrupt record handling is also slightly different when `wholeFile` is enabled: the corrupt column will contain the filename instead of the literal JSON if there is a parsing failure. It would be easy to extend this to add the parser location (line, column and byte offsets) to the output if desired. I've also included a few other changes that generate slightly better bytecode and (imo) make it more obvious when and where boxing is occurring in the parser. These are included as separate commits, let me know if they should be flattened into this PR or moved to a new one. ## How was this patch tested? New and existing unit tests. No performance or load tests have been run. You can merge this pull request into a Git repository by running: $ git pull https://github.com/NathanHowell/spark SPARK-18352 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16386.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16386 commit 740620210b30ef02e280d161d6b08088d07300fa Author: Nathan HowellDate: 2016-12-22T22:16:49Z [SPARK-18352][SQL] Support parsing multiline json files commit 7902255a79fc2581214a09ccd38437cebd19d862 Author: Nathan Howell Date: 2016-12-22T00:27:19Z JacksonParser.parseJsonToken should be explicit about nulls and boxing commit 149418647c9831e88af866d44d31496940c02162 Author: Nathan Howell Date: 2016-12-21T23:49:37Z Increase type safety of makeRootConverter, remove runtime type tests commit 7ad5d5be0c7b41112f9f6ad3cb0cf9055de62695 Author: Nathan Howell Date: 2016-12-23T02:13:59Z Field converter lookups should be O(1) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16383: [SPARK-18980][SQL] implement Aggregator with Type...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16383#discussion_r93725196 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala --- @@ -143,15 +197,96 @@ case class TypedAggregateExpression( } } - override def toString: String = { -val input = inputDeserializer match { - case Some(UnresolvedDeserializer(deserializer, _)) => deserializer.dataType.simpleString - case Some(deserializer) => deserializer.dataType.simpleString - case _ => "unknown" + override def withInputInfo( + deser: Expression, + cls: Class[_], + schema: StructType): TypedAggregateExpression = { +copy(inputDeserializer = Some(deser), inputClass = Some(cls), inputSchema = Some(schema)) + } +} + +case class ComplexTypedAggregateExpression( +aggregator: Aggregator[Any, Any, Any], +inputDeserializer: Option[Expression], +inputClass: Option[Class[_]], +inputSchema: Option[StructType], +bufferSerializer: Seq[NamedExpression], +bufferDeserializer: Expression, +outputSerializer: Seq[Expression], +dataType: DataType, +nullable: Boolean, +mutableAggBufferOffset: Int = 0, +inputAggBufferOffset: Int = 0) + extends TypedImperativeAggregate[Any] with TypedAggregateExpression with NonSQLExpression { + + override def deterministic: Boolean = true + + override def children: Seq[Expression] = inputDeserializer.toSeq + + override lazy val resolved: Boolean = inputDeserializer.isDefined && childrenResolved + + override def references: AttributeSet = AttributeSet(inputDeserializer.toSeq) + + override def createAggregationBuffer(): Any = aggregator.zero + + private lazy val inputRowToObj = GenerateSafeProjection.generate(inputDeserializer.get :: Nil) + + override def update(buffer: Any, input: InternalRow): Any = { +val inputObj = inputRowToObj(input).get(0, ObjectType(classOf[Any])) +if (inputObj != null) { + aggregator.reduce(buffer, inputObj) +} else { + buffer +} + } + + override def merge(buffer: Any, input: Any): Any = { +aggregator.merge(buffer, input) + } + + private lazy val resultObjToRow = dataType match { +case _: StructType => + UnsafeProjection.create(CreateStruct(outputSerializer)) +case _ => + assert(outputSerializer.length == 1) + UnsafeProjection.create(outputSerializer.head) + } + + override def eval(buffer: Any): Any = { +val resultObj = aggregator.finish(buffer) +if (resultObj == null) { + null +} else { + resultObjToRow(InternalRow(resultObj)).get(0, dataType) } + } -s"$nodeName($input)" + private lazy val bufferObjToRow = UnsafeProjection.create(bufferSerializer) + + override def serialize(buffer: Any): Array[Byte] = { +bufferObjToRow(InternalRow(buffer)).getBytes } - override def nodeName: String = aggregator.getClass.getSimpleName.stripSuffix("$") + private lazy val bufferRow = new UnsafeRow(bufferSerializer.length) + private lazy val bufferRowToObject = GenerateSafeProjection.generate(bufferDeserializer :: Nil) + + override def deserialize(storageFormat: Array[Byte]): Any = { +bufferRow.pointTo(storageFormat, storageFormat.length) +bufferRowToObject(bufferRow).get(0, ObjectType(classOf[Any])) + } + + override def withNewMutableAggBufferOffset( + newMutableAggBufferOffset: Int): ComplexTypedAggregateExpression = +copy(mutableAggBufferOffset = newMutableAggBufferOffset) + + override def withNewInputAggBufferOffset( + newInputAggBufferOffset: Int): ComplexTypedAggregateExpression = +copy(inputAggBufferOffset = newInputAggBufferOffset) + + override def withInputInfo( + deser: Expression, + cls: Class[_], + schema: StructType): TypedAggregateExpression = { +copy(inputDeserializer = Some(deser), inputClass = Some(cls), inputSchema = Some(schema)) --- End diff -- Where do we need to use `inputClass`? `TypedAggregateExpression` has this parameter but I don't see it is used anywhere. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional
[GitHub] spark issue #16119: [SPARK-18687][Pyspark][SQL]Backward compatibility - crea...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16119 @vijoshi do you mind updating your PR according to the dicussion? i.e. simplify the fix and test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...
Github user lirui-intel commented on the issue: https://github.com/apache/spark/pull/12775 Not sure if my patch makes the tests unstable. But I can't figure out why. @kayousterhout @mridulm any ideas? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16383: [SPARK-18980][SQL] implement Aggregator with Type...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16383#discussion_r93724428 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala --- @@ -505,19 +511,18 @@ abstract class TypedImperativeAggregate[T] extends ImperativeAggregate { def deserialize(storageFormat: Array[Byte]): T final override def initialize(buffer: InternalRow): Unit = { -val bufferObject = createAggregationBuffer() -buffer.update(mutableAggBufferOffset, bufferObject) +buffer(mutableAggBufferOffset) = createAggregationBuffer() } final override def update(buffer: InternalRow, input: InternalRow): Unit = { -update(getBufferObject(buffer), input) +buffer(mutableAggBufferOffset) = update(getBufferObject(buffer), input) --- End diff -- I do not find `InternalRow` implements `apply(int)`, is it an implicit cast here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16383: [SPARK-18980][SQL] implement Aggregator with Type...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16383#discussion_r93724370 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala --- @@ -471,23 +471,29 @@ abstract class TypedImperativeAggregate[T] extends ImperativeAggregate { def createAggregationBuffer(): T /** - * In-place updates the aggregation buffer object with an input row. buffer = buffer + input. + * Updates the aggregation buffer object with an input row and returns a new buffer object. For + * performance, the function may do in-place update and return it instead of constructing new --- End diff -- oh. got it. that makes sense. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14627: [SPARK-16975][SQL][FOLLOWUP] Do not duplicately check fi...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14627 @rxin, it does not fix any bug but just gets rid of duplicated logics. I will try to open a separate JIRA in this case in the future to prevent confusion. Thank you/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16371: [SPARK-18932][SQL] Support partial aggregation for colle...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16371 @hvanhovell Got it. Thanks for review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16361: [SPARK-18952] Regex strings not properly escaped in code...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16361 it seems to that the grouping key alias is only used for execution(logical Aggregate node doesn't need grouping expression to be named), can we just alias them with k1,k2, ... with avoid this problem? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16294: [SPARK-18669][SS][DOCS] Update Apache docs for Structure...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16294 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16294: [SPARK-18669][SS][DOCS] Update Apache docs for Structure...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16294 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70530/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16294: [SPARK-18669][SS][DOCS] Update Apache docs for Structure...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16294 **[Test build #70530 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70530/testReport)** for PR 16294 at commit [`576b432`](https://github.com/apache/spark/commit/576b432f4eb90dae4f9c3573a5b6bd665ab1d8a9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTable...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15996#discussion_r93723071 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionProviderCompatibilitySuite.scala --- @@ -195,12 +195,25 @@ class PartitionProviderCompatibilitySuite withTempDir { dir => setupPartitionedDatasourceTable("test", dir) if (enabled) { - spark.sql("msck repair table test") + assert(spark.table("test").count() == 0) +} else { + assert(spark.table("test").count() == 5) } -assert(spark.sql("select * from test").count() == 5) -spark.range(10).selectExpr("id as fieldOne", "id as partCol") + +spark.range(3, 13).selectExpr("id as fieldOne", "id as partCol") .write.partitionBy("partCol").mode("append").saveAsTable("test") -assert(spark.sql("select * from test").count() == 15) + +if (enabled) { + // Only the newly written partitions are visible, which means the partitions --- End diff -- to be consistent with the behavior of `InsertItoTable`. I'll add that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTable...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15996#discussion_r93723027 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala --- @@ -635,4 +638,13 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSQLContext with Be checkAnswer(spark.table("t"), Row(1, "a") :: Row(2, "b") :: Nil) } } + + test("use saveAsTable to append to a data source table implementing CreatableRelationProvider") { +withTable("t") { + val provider = "org.apache.spark.sql.test.DefaultSource" --- End diff -- The data source is defined in this file: https://github.com/apache/spark/pull/15996/files#diff-b9ddfbc9be8d83ecf100b3b8ff9610b9R48 I think it's easy to tell that it extends `CreatableRelationProvider` but not return a `InsertableRelation` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16371: [SPARK-18932][SQL] Support partial aggregation for colle...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16371 sounds good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16368: [SPARK-18958][SPARKR] R API toJSON on DataFrame
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/16368 LGTM. Thanks @felixcheung --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16294: [SPARK-18669][SS][DOCS] Update Apache docs for Structure...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/16294 LGTM pending tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTable...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15996#discussion_r93722426 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionProviderCompatibilitySuite.scala --- @@ -195,12 +195,25 @@ class PartitionProviderCompatibilitySuite withTempDir { dir => setupPartitionedDatasourceTable("test", dir) if (enabled) { - spark.sql("msck repair table test") --- End diff -- yep --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTable...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15996#discussion_r93722334 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -140,153 +140,55 @@ case class CreateDataSourceTableAsSelectCommand( val tableIdentWithDB = table.identifier.copy(database = Some(db)) val tableName = tableIdentWithDB.unquotedString -var createMetastoreTable = false -// We may need to reorder the columns of the query to match the existing table. -var reorderedColumns = Option.empty[Seq[NamedExpression]] if (sessionState.catalog.tableExists(tableIdentWithDB)) { - // Check if we need to throw an exception or just return. - mode match { -case SaveMode.ErrorIfExists => - throw new AnalysisException(s"Table $tableName already exists. " + -s"If you are using saveAsTable, you can set SaveMode to SaveMode.Append to " + -s"insert data into the table or set SaveMode to SaveMode.Overwrite to overwrite" + -s"the existing data. " + -s"Or, if you are using SQL CREATE TABLE, you need to drop $tableName first.") -case SaveMode.Ignore => - // Since the table already exists and the save mode is Ignore, we will just return. - return Seq.empty[Row] -case SaveMode.Append => - val existingTable = sessionState.catalog.getTableMetadata(tableIdentWithDB) - - if (existingTable.provider.get == DDLUtils.HIVE_PROVIDER) { -throw new AnalysisException(s"Saving data in the Hive serde table $tableName is " + - "not supported yet. Please use the insertInto() API as an alternative.") - } - - // Check if the specified data source match the data source of the existing table. - val existingProvider = DataSource.lookupDataSource(existingTable.provider.get) - val specifiedProvider = DataSource.lookupDataSource(table.provider.get) - // TODO: Check that options from the resolved relation match the relation that we are - // inserting into (i.e. using the same compression). - if (existingProvider != specifiedProvider) { -throw new AnalysisException(s"The format of the existing table $tableName is " + - s"`${existingProvider.getSimpleName}`. It doesn't match the specified format " + - s"`${specifiedProvider.getSimpleName}`.") - } - - if (query.schema.length != existingTable.schema.length) { -throw new AnalysisException( - s"The column number of the existing table $tableName" + -s"(${existingTable.schema.catalogString}) doesn't match the data schema" + -s"(${query.schema.catalogString})") - } - - val resolver = sessionState.conf.resolver - val tableCols = existingTable.schema.map(_.name) - - reorderedColumns = Some(existingTable.schema.map { f => -query.resolve(Seq(f.name), resolver).getOrElse { - val inputColumns = query.schema.map(_.name).mkString(", ") - throw new AnalysisException( -s"cannot resolve '${f.name}' given input columns: [$inputColumns]") -} - }) - - // In `AnalyzeCreateTable`, we verified the consistency between the user-specified table - // definition(partition columns, bucketing) and the SELECT query, here we also need to - // verify the the consistency between the user-specified table definition and the existing - // table definition. - - // Check if the specified partition columns match the existing table. - val specifiedPartCols = CatalogUtils.normalizePartCols( -tableName, tableCols, table.partitionColumnNames, resolver) - if (specifiedPartCols != existingTable.partitionColumnNames) { -throw new AnalysisException( - s""" -|Specified partitioning does not match that of the existing table $tableName. -|Specified partition columns: [${specifiedPartCols.mkString(", ")}] -|Existing partition columns: [${existingTable.partitionColumnNames.mkString(", ")}] - """.stripMargin) - } - - // Check if the specified bucketing match the existing table. - val specifiedBucketSpec = table.bucketSpec.map { bucketSpec => -CatalogUtils.normalizeBucketSpec(tableName, tableCols, bucketSpec, resolver) - } - if (specifiedBucketSpec !=
[GitHub] spark pull request #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTable...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15996#discussion_r93722277 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -363,48 +365,125 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { throw new AnalysisException("Cannot create hive serde table with saveAsTable API") } -val tableExists = df.sparkSession.sessionState.catalog.tableExists(tableIdent) - -(tableExists, mode) match { - case (true, SaveMode.Ignore) => -// Do nothing - - case (true, SaveMode.ErrorIfExists) => -throw new AnalysisException(s"Table $tableIdent already exists.") - - case _ => -val existingTable = if (tableExists) { - Some(df.sparkSession.sessionState.catalog.getTableMetadata(tableIdent)) -} else { - None -} -val storage = if (tableExists) { - existingTable.get.storage -} else { - DataSource.buildStorageFormatFromOptions(extraOptions.toMap) -} -val tableType = if (tableExists) { - existingTable.get.tableType -} else if (storage.locationUri.isDefined) { - CatalogTableType.EXTERNAL -} else { - CatalogTableType.MANAGED +val catalog = df.sparkSession.sessionState.catalog +val db = tableIdent.database.getOrElse(catalog.getCurrentDatabase) +val tableIdentWithDB = tableIdent.copy(database = Some(db)) +val tableName = tableIdent.unquotedString + +catalog.getTableMetadataOption(tableIdent) match { + // If the table already exists... + case Some(tableMeta) => +mode match { + case SaveMode.Ignore => // Do nothing + + case SaveMode.ErrorIfExists => +throw new AnalysisException(s"Table $tableName already exists. You can set SaveMode " + + "to SaveMode.Append to insert data into the table or set SaveMode to " + + "SaveMode.Overwrite to overwrite the existing data.") + + case SaveMode.Append => +// Check if the specified data source match the data source of the existing table. +val specifiedProvider = DataSource.lookupDataSource(source) +// TODO: Check that options from the resolved relation match the relation that we are +// inserting into (i.e. using the same compression). + +// Pass a table identifier with database part, so that `lookupRelation` won't get temp +// views unexpectedly. + EliminateSubqueryAliases(catalog.lookupRelation(tableIdentWithDB)) match { + case l @ LogicalRelation(_: InsertableRelation | _: HadoopFsRelation, _, _) => +// check if the file formats match +l.relation match { + case r: HadoopFsRelation if r.fileFormat.getClass != specifiedProvider => +throw new AnalysisException( + s"The file format of the existing table $tableName is " + +s"`${r.fileFormat.getClass.getName}`. It doesn't match the specified " + +s"format `$source`") + case _ => +} + case s: SimpleCatalogRelation if DDLUtils.isDatasourceTable(s.metadata) => // OK. + case c: CatalogRelation if c.catalogTable.provider == Some(DDLUtils.HIVE_PROVIDER) => +throw new AnalysisException(s"Saving data in the Hive serde table $tableName " + + s"is not supported yet. Please use the insertInto() API as an alternative.") + case o => +throw new AnalysisException(s"Saving data in ${o.toString} is not supported.") +} + +val existingSchema = tableMeta.schema +if (df.logicalPlan.schema.size != existingSchema.size) { + throw new AnalysisException( +s"The column number of the existing table $tableName" + + s"(${existingSchema.catalogString}) doesn't match the data schema" + + s"(${df.logicalPlan.schema.catalogString})") +} + +if (partitioningColumns.isDefined) { + logWarning("append to an existing table, the specified partition columns " + +s"[${partitioningColumns.get.mkString(", ")}] will be ignored.") +} + +val specifiedBucketSpec = getBucketSpec +if (specifiedBucketSpec.isDefined) { + logWarning("append to an existing table, the specified bucketing " + +
[GitHub] spark issue #16370: [SPARK-18960][SQL][SS] Avoid double reading file which i...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/16370 @zsxwing Thanks for your reminder!! In some ways, we really can evade this issue, just like not use `-cp`. But this is an user-side behaviour, we can not ensure every users know and use correct ways to move data. It may confuse users if they do not know this issue. Besides, current changes is so tiny that do not have any harm to codes. So, IMHO, it is OK to add the check for protection. What do you thinkï¼ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16294: [SPARK-18669][SS][DOCS] Update Apache docs for Structure...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16294 **[Test build #70530 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70530/testReport)** for PR 16294 at commit [`576b432`](https://github.com/apache/spark/commit/576b432f4eb90dae4f9c3573a5b6bd665ab1d8a9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org