[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14151 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user ScrapCodes commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r155706338 --- Diff: python/pyspark/sql/readwriter.py --- @@ -313,11 +313,16 @@ def text(self, paths): Each line in the text file is a new row in the resulting DataFrame. :param paths: string, or list of strings, for input path(s). +:param wholetext: if true, read each file from input path(s) as a single row. >>> df = spark.read.text('python/test_support/sql/text-test.txt') >>> df.collect() [Row(value=u'hello'), Row(value=u'this')] +>>> df = spark.read.text('python/test_support/sql/text-test.txt', wholetext=True) +>>> df.collect() +[Row(value=u'hello\nthis')] --- End diff -- That would fail the test, I suppose. I can give that a try though. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r155527201 --- Diff: python/pyspark/sql/readwriter.py --- @@ -313,11 +313,16 @@ def text(self, paths): Each line in the text file is a new row in the resulting DataFrame. :param paths: string, or list of strings, for input path(s). +:param wholetext: if true, read each file from input path(s) as a single row. >>> df = spark.read.text('python/test_support/sql/text-test.txt') >>> df.collect() [Row(value=u'hello'), Row(value=u'this')] +>>> df = spark.read.text('python/test_support/sql/text-test.txt', wholetext=True) +>>> df.collect() +[Row(value=u'hello\nthis')] --- End diff -- Hm, can't we just do `\\n`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r155143739 --- Diff: python/pyspark/sql/readwriter.py --- @@ -313,11 +313,13 @@ def text(self, paths): Each line in the text file is a new row in the resulting DataFrame. :param paths: string, or list of strings, for input path(s). +:param wholetext: if true, read each file from input path(s) as a single row. >>> df = spark.read.text('python/test_support/sql/text-test.txt') >>> df.collect() [Row(value=u'hello'), Row(value=u'this')] """ --- End diff -- Can you add a doctest for `wholetext` too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r15514 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala --- @@ -97,14 +109,26 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister { assert( requiredSchema.length <= 1, "Text data source only produces a single data column named \"value\".") - +val textOptions = new TextOptions(options) val broadcastedHadoopConf = sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf)) +readToUnsafeMem(broadcastedHadoopConf, requiredSchema, textOptions.wholeText) + } + + private def readToUnsafeMem(conf: Broadcast[SerializableConfiguration], + requiredSchema: StructType, wholeTextMode: Boolean): + (PartitionedFile) => Iterator[UnsafeRow] = { + (file: PartitionedFile) => { - val reader = new HadoopFileLinesReader(file, broadcastedHadoopConf.value.value) + val confValue = conf.value.value + var reader: Iterator[Text] with Closeable = null + if (!wholeTextMode) { +reader = new HadoopFileLinesReader(file, confValue) + } else { +reader = new HadoopFileWholeTextReader(file, confValue) + } --- End diff -- We can avoid using `var`: ```scala val reader = if (!wholeTextMode) { new HadoopFileLinesReader(file, confValue) } else { new HadoopFileWholeTextReader(file, confValue) } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r155139949 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala --- @@ -39,6 +39,54 @@ class TextSuite extends QueryTest with SharedSQLContext { verifyFrame(spark.read.text(testFile)) } + test("reading text file with option wholetext=true") { +val df = spark.read.option("wholetext", "true") + .format("text").load(testFile) +// schema +assert(df.schema == new StructType().add("value", StringType)) + +// verify content +val data = df.collect() +assert(data(0) == + Row( +// scalastyle:off nonascii +"""This is a test file for the text data source + |1+1 + |æ°æ®ç 头 + |"doh" + |""".stripMargin)) +// scalastyle:on --- End diff -- nit: // scalastyle:on nonascii --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r137601363 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala --- @@ -101,10 +111,24 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister { val broadcastedHadoopConf = sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf)) +val wholeText: Boolean = options.getOrElse("wholetext", "false").toBoolean + +readToUnsafeMem(broadcastedHadoopConf, requiredSchema, + wholeText) + } + + private[datasources] def readToUnsafeMem(conf: Broadcast[SerializableConfiguration], + requiredSchema: StructType, wholeTextMode: Boolean): --- End diff -- ```Scala private def readToUnsafeMem( conf: Broadcast[SerializableConfiguration], requiredSchema: StructType, wholeTextMode: Boolean): (PartitionedFile) => Iterator[UnsafeRow] ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r137601163 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala --- @@ -101,10 +111,24 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister { val broadcastedHadoopConf = sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf)) +val wholeText: Boolean = options.getOrElse("wholetext", "false").toBoolean + +readToUnsafeMem(broadcastedHadoopConf, requiredSchema, + wholeText) + } + + private[datasources] def readToUnsafeMem(conf: Broadcast[SerializableConfiguration], + requiredSchema: StructType, wholeTextMode: Boolean): --- End diff -- `private[datasources]` => `private` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r137601004 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala --- @@ -99,8 +100,22 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister { val broadcastedHadoopConf = sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf)) +val wholeText: Boolean = options.getOrElse("wholetext", "false").toBoolean --- End diff -- Move it to `TextOptions`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r137600781 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala --- @@ -53,6 +57,12 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister { } } + override def isSplitable(sparkSession: SparkSession, + options: Map[String, String], path: Path): Boolean = { --- End diff -- ``` override def isSplitable( sparkSession: SparkSession, options: Map[String, String], path: Path): Boolean = { ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r137595908 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala --- @@ -101,10 +111,24 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister { val broadcastedHadoopConf = sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf)) +val wholeText: Boolean = options.getOrElse("wholetext", "false").toBoolean + +readToUnsafeMem(broadcastedHadoopConf, requiredSchema, + wholeText) --- End diff -- It fits one line --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r137595716 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala --- @@ -185,8 +233,7 @@ class TextSuite extends QueryTest with SharedSQLContext { val data = df.collect() assert(data(0) == Row("This is a test file for the text data source")) assert(data(1) == Row("1+1")) -// non ascii characters are not allowed in the code, so we disable the scalastyle here. -// scalastyle:off +// scalastyle:off nonascii assert(data(2) == Row("æ°æ®ç 头")) // scalastyle:on --- End diff -- `// scalastyle:on nonascii` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user ScrapCodes commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r131100257 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileWholeTextReader.scala --- @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.Closeable +import java.net.URI + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.io.Text +import org.apache.hadoop.mapreduce._ +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit +import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl + +import org.apache.spark.input.WholeTextFileRecordReader + +/** + * An adaptor from a [[PartitionedFile]] to an [[Iterator]] of [[Text]], which is all of the lines + * in that file. + */ +class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration) --- End diff -- Thank you, for catching this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r130859352 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala --- @@ -185,11 +231,11 @@ class TextSuite extends QueryTest with SharedSQLContext { val data = df.collect() assert(data(0) == Row("This is a test file for the text data source")) assert(data(1) == Row("1+1")) -// non ascii characters are not allowed in the code, so we disable the scalastyle here. -// scalastyle:off +// scalastyle:off nonascii assert(data(2) == Row("æ°æ®ç 头")) // scalastyle:on assert(data(3) == Row("\"doh\"")) assert(data.length == 4) } + --- End diff -- nit: remove this empty line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r130859483 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala --- @@ -39,6 +39,52 @@ class TextSuite extends QueryTest with SharedSQLContext { verifyFrame(spark.read.text(testFile)) } + test("reading text file with option wholetext=true") { +val df = spark.read.option("wholetext", "true") + .format("text").load(testFile) +// schema +assert(df.schema == new StructType().add("value", StringType)) + +// verify content +val data = df.collect() +assert(data(0) == + Row( +// scalastyle:off nonascii +"""This is a test file for the text data source + |1+1 + |æ°æ®ç 头 + |"doh" + |""".stripMargin)) +// scalastyle:on +assert(data.length == 1) + } + + test("reading multiple text files with option wholetext=true") { +import org.apache.spark.sql.catalyst.util._ +withTempDir { dir => + val file1 = new File(dir, "text1.txt") + stringToFile(file1, +"""text file 1 contents. + |From: None to: ?? +""".stripMargin) + val file2 = new File(dir, "text2.txt") + stringToFile(file2, "text file 2 contents.") + val file3 = new File(dir, "text3.txt") + stringToFile(file3, "text file 3 contents.") + val df = spark.read.option("wholetext", "true").text(dir.getAbsolutePath) + // Since wholetext option reads each file into a single row, df.length should be no. of files. + val data = df.sort("value").collect() + assert(data.length == 3) + // Each files should represent a single Row/element in Dataframe/Dataset + assert(data(0) == Row( +"""text file 1 contents. + |From: None to: ?? +""".stripMargin)) + assert(data(1) == Row( +"""text file 2 contents.""".stripMargin)) --- End diff -- nit: should we also check for `data(2)`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r130859601 --- Diff: python/pyspark/sql/readwriter.py --- @@ -309,11 +309,13 @@ def text(self, paths): Each line in the text file is a new row in the resulting DataFrame. :param paths: string, or list of strings, for input path(s). +:param wholetext: if true, read each file from input paths as a single row. --- End diff -- nit: `paths` -> `path(s)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r130858731 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala --- @@ -101,10 +105,24 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister { val broadcastedHadoopConf = sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf)) +val wholeText: Boolean = options.getOrElse("wholetext", "false").toBoolean + +readToUnsafeMem(broadcastedHadoopConf, requiredSchema, + wholeText) + } + + private[datasources] def readToUnsafeMem(conf: Broadcast[SerializableConfiguration], + requiredSchema: StructType, wholeTextMode: Boolean): + (PartitionedFile) => Iterator[UnsafeRow] = { (file: PartitionedFile) => { - val reader = new HadoopFileLinesReader(file, broadcastedHadoopConf.value.value) - Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ => reader.close())) + var reader: Iterator[Text] with Closeable = null + if (!wholeTextMode) { +reader = new HadoopFileLinesReader(file, conf.value.value) --- End diff -- nit: We can extract the Configuration beforehand. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r130267182 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileWholeTextReader.scala --- @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.Closeable +import java.net.URI + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.io.Text +import org.apache.hadoop.mapreduce._ +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit +import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl + +import org.apache.spark.input.WholeTextFileRecordReader + +/** + * An adaptor from a [[PartitionedFile]] to an [[Iterator]] of [[Text]], which is all of the lines + * in that file. + */ +class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration) --- End diff -- We may want to override `isSplitable` of `TextFileFormat` and return false when `wholetext` option is enabled. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r130265982 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileWholeTextReader.scala --- @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.Closeable +import java.net.URI + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.io.Text +import org.apache.hadoop.mapreduce._ +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit +import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl + +import org.apache.spark.input.WholeTextFileRecordReader + +/** + * An adaptor from a [[PartitionedFile]] to an [[Iterator]] of [[Text]], which is all of the lines + * in that file. + */ +class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration) --- End diff -- I'd like to remind that a `PartitionedFile` can be just a part of a input file, instead of a whole file. So you cannot guarantee that in this case the reader reads all content of a file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user ScrapCodes commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r81292587 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala --- @@ -99,8 +100,22 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister { val broadcastedHadoopConf = sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf)) +val wholeText: Boolean = options.getOrElse("wholetext", "false").toBoolean --- End diff -- Good reminder ! @HyukjinKwon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r81289022 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala --- @@ -99,8 +100,22 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister { val broadcastedHadoopConf = sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf)) +val wholeText: Boolean = options.getOrElse("wholetext", "false").toBoolean --- End diff -- Actually, we need to document this within `readwriter.py` too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r77118505 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala --- @@ -99,8 +100,22 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister { val broadcastedHadoopConf = sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf)) +val wholeText: Boolean = options.getOrElse("wholetext", "false").toBoolean --- End diff -- Like what we did for csv and json, could you document this new option in `DataFrameReader`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user ScrapCodes commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r74903349 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -533,6 +533,12 @@ object SQLConf { .timeConf(TimeUnit.MILLISECONDS) .createWithDefault(10L) + val WHOLETEXT = +SQLConfigBuilder("spark.sql.wholetext") --- End diff -- They are removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user ScrapCodes commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r74902549 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala --- @@ -39,6 +39,11 @@ class TextSuite extends QueryTest with SharedSQLContext { verifyFrame(spark.read.text(testFile)) } + test("reading text file with wholetext option on") { --- End diff -- Thanks Fred, Good point !. I have added a test below. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r74831276 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -533,6 +533,12 @@ object SQLConf { .timeConf(TimeUnit.MILLISECONDS) .createWithDefault(10L) + val WHOLETEXT = +SQLConfigBuilder("spark.sql.wholetext") --- End diff -- Yea I don't think it should be a session wide config. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user frreiss commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r74805700 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala --- @@ -39,6 +39,11 @@ class TextSuite extends QueryTest with SharedSQLContext { verifyFrame(spark.read.text(testFile)) } + test("reading text file with wholetext option on") { --- End diff -- As far as I'm aware, the most common use case for reading entire files is using a glob to read a directory or directory tree containing multiple files. For example, one might download the Enron corpus (see [https://www.cs.cmu.edu/~./enron/]), which comes packaged with one file per email message. With a large number of files on the input, it's important that the work of processing the files be split among many cores. So the test for the `wholetext` option really should have multiple input files and verify that different files end up in different partitions of the resulting RDD or Dataframe. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user frreiss commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r74804217 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -533,6 +533,12 @@ object SQLConf { .timeConf(TimeUnit.MILLISECONDS) .createWithDefault(10L) + val WHOLETEXT = +SQLConfigBuilder("spark.sql.wholetext") --- End diff -- Should this really be a session-global configuration? It seems like something that is specific to a particular input file and should only be set when opening a given file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user ScrapCodes commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r70564698 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileWholeTextReader.scala --- @@ -0,0 +1,53 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.net.URI + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.io.Text +import org.apache.hadoop.mapreduce._ +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit +import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl + +import org.apache.spark.input.WholeTextFileRecordReader + +/** + * An adaptor from a [[PartitionedFile]] to an [[Iterator]] of [[Text]], which is all of the lines + * in that file. + */ +class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration) extends Iterator[Text] { --- End diff -- Looks like they might get used in multiple other formats too, what do you intend by proper package it is unclear to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user ScrapCodes commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r70564514 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileWholeTextReader.scala --- @@ -0,0 +1,53 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.net.URI + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.io.Text +import org.apache.hadoop.mapreduce._ +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit +import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl + +import org.apache.spark.input.WholeTextFileRecordReader + +/** + * An adaptor from a [[PartitionedFile]] to an [[Iterator]] of [[Text]], which is all of the lines + * in that file. + */ +class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration) extends Iterator[Text] { --- End diff -- This is currently in the same package as `HadoopFileLineReader` ? i.e. `datasources`. Should I move both of them to the package `datasource.text` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org