[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2017-12-14 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14151


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2017-12-07 Thread ScrapCodes
Github user ScrapCodes commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r155706338
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -313,11 +313,16 @@ def text(self, paths):
 Each line in the text file is a new row in the resulting DataFrame.
 
 :param paths: string, or list of strings, for input path(s).
+:param wholetext: if true, read each file from input path(s) as a 
single row.
 
 >>> df = spark.read.text('python/test_support/sql/text-test.txt')
 >>> df.collect()
 [Row(value=u'hello'), Row(value=u'this')]
+>>> df = spark.read.text('python/test_support/sql/text-test.txt', 
wholetext=True)
+>>> df.collect()
+[Row(value=u'hello\nthis')]
--- End diff --

That would fail the test, I suppose. I can give that a try though.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2017-12-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r155527201
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -313,11 +313,16 @@ def text(self, paths):
 Each line in the text file is a new row in the resulting DataFrame.
 
 :param paths: string, or list of strings, for input path(s).
+:param wholetext: if true, read each file from input path(s) as a 
single row.
 
 >>> df = spark.read.text('python/test_support/sql/text-test.txt')
 >>> df.collect()
 [Row(value=u'hello'), Row(value=u'this')]
+>>> df = spark.read.text('python/test_support/sql/text-test.txt', 
wholetext=True)
+>>> df.collect()
+[Row(value=u'hello\nthis')]
--- End diff --

Hm, can't we just do `\\n`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2017-12-05 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r155143739
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -313,11 +313,13 @@ def text(self, paths):
 Each line in the text file is a new row in the resulting DataFrame.
 
 :param paths: string, or list of strings, for input path(s).
+:param wholetext: if true, read each file from input path(s) as a 
single row.
 
 >>> df = spark.read.text('python/test_support/sql/text-test.txt')
 >>> df.collect()
 [Row(value=u'hello'), Row(value=u'this')]
 """
--- End diff --

Can you add a doctest for `wholetext` too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2017-12-05 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r15514
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
 ---
@@ -97,14 +109,26 @@ class TextFileFormat extends TextBasedFileFormat with 
DataSourceRegister {
 assert(
   requiredSchema.length <= 1,
   "Text data source only produces a single data column named 
\"value\".")
-
+val textOptions = new TextOptions(options)
 val broadcastedHadoopConf =
   sparkSession.sparkContext.broadcast(new 
SerializableConfiguration(hadoopConf))
 
+readToUnsafeMem(broadcastedHadoopConf, requiredSchema, 
textOptions.wholeText)
+  }
+
+  private def readToUnsafeMem(conf: Broadcast[SerializableConfiguration],
+  requiredSchema: StructType, wholeTextMode: Boolean):
+  (PartitionedFile) => Iterator[UnsafeRow] = {
+
 (file: PartitionedFile) => {
-  val reader = new HadoopFileLinesReader(file, 
broadcastedHadoopConf.value.value)
+  val confValue = conf.value.value
+  var reader: Iterator[Text] with Closeable = null
+  if (!wholeTextMode) {
+reader = new HadoopFileLinesReader(file, confValue)
+  } else {
+reader = new HadoopFileWholeTextReader(file, confValue)
+  }
--- End diff --

We can avoid using `var`:
```scala
val reader = if (!wholeTextMode) {
  new HadoopFileLinesReader(file, confValue)
} else {
  new HadoopFileWholeTextReader(file, confValue)
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2017-12-05 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r155139949
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala
 ---
@@ -39,6 +39,54 @@ class TextSuite extends QueryTest with SharedSQLContext {
 verifyFrame(spark.read.text(testFile))
   }
 
+  test("reading text file with option wholetext=true") {
+val df = spark.read.option("wholetext", "true")
+  .format("text").load(testFile)
+// schema
+assert(df.schema == new StructType().add("value", StringType))
+
+// verify content
+val data = df.collect()
+assert(data(0) ==
+  Row(
+// scalastyle:off nonascii
+"""This is a test file for the text data source
+  |1+1
+  |数据砖头
+  |"doh"
+  |""".stripMargin))
+// scalastyle:on
--- End diff --

nit: // scalastyle:on nonascii


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2017-09-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r137601363
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
 ---
@@ -101,10 +111,24 @@ class TextFileFormat extends TextBasedFileFormat with 
DataSourceRegister {
 val broadcastedHadoopConf =
   sparkSession.sparkContext.broadcast(new 
SerializableConfiguration(hadoopConf))
 
+val wholeText: Boolean = options.getOrElse("wholetext", 
"false").toBoolean
+
+readToUnsafeMem(broadcastedHadoopConf, requiredSchema,
+  wholeText)
+  }
+
+  private[datasources] def readToUnsafeMem(conf: 
Broadcast[SerializableConfiguration],
+  requiredSchema: StructType, wholeTextMode: Boolean):
--- End diff --

```Scala
  private def readToUnsafeMem(
  conf: Broadcast[SerializableConfiguration],
  requiredSchema: StructType,
  wholeTextMode: Boolean): (PartitionedFile) => Iterator[UnsafeRow]
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2017-09-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r137601163
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
 ---
@@ -101,10 +111,24 @@ class TextFileFormat extends TextBasedFileFormat with 
DataSourceRegister {
 val broadcastedHadoopConf =
   sparkSession.sparkContext.broadcast(new 
SerializableConfiguration(hadoopConf))
 
+val wholeText: Boolean = options.getOrElse("wholetext", 
"false").toBoolean
+
+readToUnsafeMem(broadcastedHadoopConf, requiredSchema,
+  wholeText)
+  }
+
+  private[datasources] def readToUnsafeMem(conf: 
Broadcast[SerializableConfiguration],
+  requiredSchema: StructType, wholeTextMode: Boolean):
--- End diff --

`private[datasources]` => `private`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2017-09-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r137601004
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
 ---
@@ -99,8 +100,22 @@ class TextFileFormat extends TextBasedFileFormat with 
DataSourceRegister {
 val broadcastedHadoopConf =
   sparkSession.sparkContext.broadcast(new 
SerializableConfiguration(hadoopConf))
 
+val wholeText: Boolean = options.getOrElse("wholetext", 
"false").toBoolean
--- End diff --

Move it to `TextOptions`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2017-09-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r137600781
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
 ---
@@ -53,6 +57,12 @@ class TextFileFormat extends TextBasedFileFormat with 
DataSourceRegister {
 }
   }
 
+  override def isSplitable(sparkSession: SparkSession,
+  options: Map[String, String], path: Path): Boolean = {
--- End diff --

```
  override def isSplitable(
  sparkSession: SparkSession,
  options: Map[String, String],
  path: Path): Boolean = {
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2017-09-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r137595908
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
 ---
@@ -101,10 +111,24 @@ class TextFileFormat extends TextBasedFileFormat with 
DataSourceRegister {
 val broadcastedHadoopConf =
   sparkSession.sparkContext.broadcast(new 
SerializableConfiguration(hadoopConf))
 
+val wholeText: Boolean = options.getOrElse("wholetext", 
"false").toBoolean
+
+readToUnsafeMem(broadcastedHadoopConf, requiredSchema,
+  wholeText)
--- End diff --

It fits one line


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2017-09-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r137595716
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala
 ---
@@ -185,8 +233,7 @@ class TextSuite extends QueryTest with SharedSQLContext 
{
 val data = df.collect()
 assert(data(0) == Row("This is a test file for the text data source"))
 assert(data(1) == Row("1+1"))
-// non ascii characters are not allowed in the code, so we disable the 
scalastyle here.
-// scalastyle:off
+// scalastyle:off nonascii
 assert(data(2) == Row("数据砖头"))
 // scalastyle:on
--- End diff --

`// scalastyle:on nonascii`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2017-08-03 Thread ScrapCodes
Github user ScrapCodes commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r131100257
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileWholeTextReader.scala
 ---
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.io.Closeable
+import java.net.URI
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.io.Text
+import org.apache.hadoop.mapreduce._
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit
+import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+
+import org.apache.spark.input.WholeTextFileRecordReader
+
+/**
+ * An adaptor from a [[PartitionedFile]] to an [[Iterator]] of [[Text]], 
which is all of the lines
+ * in that file.
+ */
+class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration)
--- End diff --

Thank you, for catching this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2017-08-02 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r130859352
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala
 ---
@@ -185,11 +231,11 @@ class TextSuite extends QueryTest with 
SharedSQLContext {
 val data = df.collect()
 assert(data(0) == Row("This is a test file for the text data source"))
 assert(data(1) == Row("1+1"))
-// non ascii characters are not allowed in the code, so we disable the 
scalastyle here.
-// scalastyle:off
+// scalastyle:off nonascii
 assert(data(2) == Row("数据砖头"))
 // scalastyle:on
 assert(data(3) == Row("\"doh\""))
 assert(data.length == 4)
   }
+
--- End diff --

nit: remove this empty line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2017-08-02 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r130859483
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala
 ---
@@ -39,6 +39,52 @@ class TextSuite extends QueryTest with SharedSQLContext {
 verifyFrame(spark.read.text(testFile))
   }
 
+  test("reading text file with option wholetext=true") {
+val df = spark.read.option("wholetext", "true")
+  .format("text").load(testFile)
+// schema
+assert(df.schema == new StructType().add("value", StringType))
+
+// verify content
+val data = df.collect()
+assert(data(0) ==
+  Row(
+// scalastyle:off nonascii
+"""This is a test file for the text data source
+  |1+1
+  |数据砖头
+  |"doh"
+  |""".stripMargin))
+// scalastyle:on
+assert(data.length == 1)
+  }
+
+  test("reading multiple text files with option wholetext=true") {
+import org.apache.spark.sql.catalyst.util._
+withTempDir { dir =>
+  val file1 = new File(dir, "text1.txt")
+  stringToFile(file1,
+"""text file 1 contents.
+  |From: None to: ??
+""".stripMargin)
+  val file2 = new File(dir, "text2.txt")
+  stringToFile(file2, "text file 2 contents.")
+  val file3 = new File(dir, "text3.txt")
+  stringToFile(file3, "text file 3 contents.")
+  val df = spark.read.option("wholetext", 
"true").text(dir.getAbsolutePath)
+  // Since wholetext option reads each file into a single row, 
df.length should be no. of files.
+  val data = df.sort("value").collect()
+  assert(data.length == 3)
+  // Each files should represent a single Row/element in 
Dataframe/Dataset
+  assert(data(0) == Row(
+"""text file 1 contents.
+  |From: None to: ??
+""".stripMargin))
+  assert(data(1) == Row(
+"""text file 2 contents.""".stripMargin))
--- End diff --

nit: should we also check for `data(2)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2017-08-02 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r130859601
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -309,11 +309,13 @@ def text(self, paths):
 Each line in the text file is a new row in the resulting DataFrame.
 
 :param paths: string, or list of strings, for input path(s).
+:param wholetext: if true, read each file from input paths as a 
single row.
--- End diff --

nit: `paths` -> `path(s)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2017-08-02 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r130858731
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
 ---
@@ -101,10 +105,24 @@ class TextFileFormat extends TextBasedFileFormat with 
DataSourceRegister {
 val broadcastedHadoopConf =
   sparkSession.sparkContext.broadcast(new 
SerializableConfiguration(hadoopConf))
 
+val wholeText: Boolean = options.getOrElse("wholetext", 
"false").toBoolean
+
+readToUnsafeMem(broadcastedHadoopConf, requiredSchema,
+  wholeText)
+  }
+
+  private[datasources] def readToUnsafeMem(conf: 
Broadcast[SerializableConfiguration],
+  requiredSchema: StructType, wholeTextMode: Boolean):
+  (PartitionedFile) => Iterator[UnsafeRow] = {
 (file: PartitionedFile) => {
-  val reader = new HadoopFileLinesReader(file, 
broadcastedHadoopConf.value.value)
-  Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ => 
reader.close()))
 
+  var reader: Iterator[Text] with Closeable = null
+  if (!wholeTextMode) {
+reader = new HadoopFileLinesReader(file, conf.value.value)
--- End diff --

nit: We can extract the Configuration beforehand.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2017-07-30 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r130267182
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileWholeTextReader.scala
 ---
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.io.Closeable
+import java.net.URI
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.io.Text
+import org.apache.hadoop.mapreduce._
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit
+import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+
+import org.apache.spark.input.WholeTextFileRecordReader
+
+/**
+ * An adaptor from a [[PartitionedFile]] to an [[Iterator]] of [[Text]], 
which is all of the lines
+ * in that file.
+ */
+class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration)
--- End diff --

We may want to override `isSplitable` of `TextFileFormat` and return false 
when `wholetext` option is enabled.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2017-07-30 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r130265982
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileWholeTextReader.scala
 ---
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.io.Closeable
+import java.net.URI
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.io.Text
+import org.apache.hadoop.mapreduce._
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit
+import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+
+import org.apache.spark.input.WholeTextFileRecordReader
+
+/**
+ * An adaptor from a [[PartitionedFile]] to an [[Iterator]] of [[Text]], 
which is all of the lines
+ * in that file.
+ */
+class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration)
--- End diff --

I'd like to remind that a `PartitionedFile` can be just a part of a input 
file, instead of a whole file. So you cannot guarantee that in this case the 
reader reads all content of a file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2016-09-30 Thread ScrapCodes
Github user ScrapCodes commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r81292587
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
 ---
@@ -99,8 +100,22 @@ class TextFileFormat extends TextBasedFileFormat with 
DataSourceRegister {
 val broadcastedHadoopConf =
   sparkSession.sparkContext.broadcast(new 
SerializableConfiguration(hadoopConf))
 
+val wholeText: Boolean = options.getOrElse("wholetext", 
"false").toBoolean
--- End diff --

Good reminder ! @HyukjinKwon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2016-09-30 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r81289022
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
 ---
@@ -99,8 +100,22 @@ class TextFileFormat extends TextBasedFileFormat with 
DataSourceRegister {
 val broadcastedHadoopConf =
   sparkSession.sparkContext.broadcast(new 
SerializableConfiguration(hadoopConf))
 
+val wholeText: Boolean = options.getOrElse("wholetext", 
"false").toBoolean
--- End diff --

Actually, we need to document this within `readwriter.py` too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2016-08-31 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r77118505
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
 ---
@@ -99,8 +100,22 @@ class TextFileFormat extends TextBasedFileFormat with 
DataSourceRegister {
 val broadcastedHadoopConf =
   sparkSession.sparkContext.broadcast(new 
SerializableConfiguration(hadoopConf))
 
+val wholeText: Boolean = options.getOrElse("wholetext", 
"false").toBoolean
--- End diff --

Like what we did for csv and json, could you document this new option in 
`DataFrameReader`? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2016-08-16 Thread ScrapCodes
Github user ScrapCodes commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r74903349
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -533,6 +533,12 @@ object SQLConf {
   .timeConf(TimeUnit.MILLISECONDS)
   .createWithDefault(10L)
 
+  val WHOLETEXT =
+SQLConfigBuilder("spark.sql.wholetext")
--- End diff --

They are removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2016-08-16 Thread ScrapCodes
Github user ScrapCodes commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r74902549
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala
 ---
@@ -39,6 +39,11 @@ class TextSuite extends QueryTest with SharedSQLContext {
 verifyFrame(spark.read.text(testFile))
   }
 
+  test("reading text file with wholetext option on") {
--- End diff --

Thanks Fred, Good point !. I have added a test below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2016-08-15 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r74831276
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -533,6 +533,12 @@ object SQLConf {
   .timeConf(TimeUnit.MILLISECONDS)
   .createWithDefault(10L)
 
+  val WHOLETEXT =
+SQLConfigBuilder("spark.sql.wholetext")
--- End diff --

Yea I don't think it should be a session wide config.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2016-08-15 Thread frreiss
Github user frreiss commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r74805700
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala
 ---
@@ -39,6 +39,11 @@ class TextSuite extends QueryTest with SharedSQLContext {
 verifyFrame(spark.read.text(testFile))
   }
 
+  test("reading text file with wholetext option on") {
--- End diff --

As far as I'm aware, the most common use case for reading entire files is 
using a glob to read a directory or directory tree containing multiple files. 
For example, one might download the Enron corpus (see 
[https://www.cs.cmu.edu/~./enron/]), which comes packaged with one file per 
email message. With a large number of files on the input, it's important that 
the work of processing the files be split among many cores. So the test for the 
`wholetext` option really should have multiple input files and verify that 
different files end up in different partitions of the resulting RDD or 
Dataframe.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2016-08-15 Thread frreiss
Github user frreiss commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r74804217
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -533,6 +533,12 @@ object SQLConf {
   .timeConf(TimeUnit.MILLISECONDS)
   .createWithDefault(10L)
 
+  val WHOLETEXT =
+SQLConfigBuilder("spark.sql.wholetext")
--- End diff --

Should this really be a session-global configuration? It seems like 
something that is specific to a particular input file and should only be set 
when opening a given file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2016-07-12 Thread ScrapCodes
Github user ScrapCodes commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r70564698
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileWholeTextReader.scala
 ---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.net.URI
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.io.Text
+import org.apache.hadoop.mapreduce._
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit
+import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+
+import org.apache.spark.input.WholeTextFileRecordReader
+
+/**
+ * An adaptor from a [[PartitionedFile]] to an [[Iterator]] of [[Text]], 
which is all of the lines
+ * in that file.
+ */
+class HadoopFileWholeTextReader(file: PartitionedFile, conf: 
Configuration) extends Iterator[Text] {
--- End diff --

Looks like they might get used in multiple other formats too, what do you 
intend by proper package it is unclear to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

2016-07-12 Thread ScrapCodes
Github user ScrapCodes commented on a diff in the pull request:

https://github.com/apache/spark/pull/14151#discussion_r70564514
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileWholeTextReader.scala
 ---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.net.URI
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.io.Text
+import org.apache.hadoop.mapreduce._
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit
+import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+
+import org.apache.spark.input.WholeTextFileRecordReader
+
+/**
+ * An adaptor from a [[PartitionedFile]] to an [[Iterator]] of [[Text]], 
which is all of the lines
+ * in that file.
+ */
+class HadoopFileWholeTextReader(file: PartitionedFile, conf: 
Configuration) extends Iterator[Text] {
--- End diff --

This is currently in the same package as `HadoopFileLineReader` ? i.e. 
`datasources`. Should I move both of them to the package `datasource.text` ?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org