[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...

2018-07-17 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21769#discussion_r203148694
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala ---
@@ -64,7 +64,7 @@ private[avro] class AvroFileFormat extends FileFormat 
with DataSourceRegister {
 // Schema evolution is not supported yet. Here we only pick a single 
random sample file to
 // figure out the schema of the whole dataset.
 val sampleFile =
-  if 
(conf.getBoolean(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, true)) {
+  if (AvroFileFormat.ignoreFilesWithoutExtensions(conf)) {
--- End diff --

Here is the PR: https://github.com/apache/spark/pull/21798 Please, have a 
look at it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...

2018-07-16 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21769


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...

2018-07-16 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21769#discussion_r202831899
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala ---
@@ -64,7 +64,7 @@ private[avro] class AvroFileFormat extends FileFormat 
with DataSourceRegister {
 // Schema evolution is not supported yet. Here we only pick a single 
random sample file to
 // figure out the schema of the whole dataset.
 val sampleFile =
-  if 
(conf.getBoolean(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, true)) {
+  if (AvroFileFormat.ignoreFilesWithoutExtensions(conf)) {
--- End diff --

Can we submit a separate PR to add a new option for AVRO? We should not 
rely on hadoopConf to control the behaviors of AVRO.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...

2018-07-15 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21769#discussion_r202541358
  
--- Diff: 
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ---
@@ -680,12 +689,22 @@ class AvroSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils {
 
   Files.createFile(new File(tempSaveDir, "non-avro").toPath)
 
-  val newDf = spark
-.read
-.option(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, "true")
-.avro(tempSaveDir)
+  val count = try {
--- End diff --

Nit: consider writing the `try...finally` like this:
```
  val hadoopConf = spark.sqlContext.sparkContext.hadoopConfiguration
  try {
hadoopConf.set(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, 
"true")
val count = spark.read.avro(tempSaveDir).count()
assert(count == 8)
  } finally {
hadoopConf.unset(AvroFileFormat.IgnoreFilesWithoutExtensionProperty)
  }
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...

2018-07-14 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21769#discussion_r202526423
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala ---
@@ -64,7 +64,7 @@ private[avro] class AvroFileFormat extends FileFormat 
with DataSourceRegister {
 // Schema evolution is not supported yet. Here we only pick a single 
random sample file to
 // figure out the schema of the whole dataset.
 val sampleFile =
-  if 
(conf.getBoolean(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, true)) {
+  if (AvroFileFormat.ignoreFilesWithoutExtensions(conf)) {
--- End diff --

The Hadoop config can be changed like:
```scala
spark
  .sqlContext
  .sparkContext
  .hadoopConfiguration
  .set("avro.mapred.ignore.inputs.without.extension", "true")
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...

2018-07-14 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21769#discussion_r202525034
  
--- Diff: 
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ---
@@ -623,7 +624,7 @@ class AvroSuite extends SparkFunSuite {
   spark.read.avro("*/*/*/*/*/*/*/something.avro")
 }
 
-intercept[FileNotFoundException] {
+intercept[java.io.IOException] {
   TestUtils.withTempDir { dir =>
 FileUtils.touch(new File(dir, "test"))
 spark.read.avro(dir.toString)
--- End diff --

I would actually remove this piece of code from the test, and write a 
separate test (reading a folder with files without `.avro` extensions) in which 
the Hadoop's parameter `avro.mapred.ignore.inputs.without.extension` is set to 
`true` explicitly. Unfortunatelly there is no method like `withSQLConf` for 
Hadoop's configs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...

2018-07-14 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21769#discussion_r202524884
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala ---
@@ -64,7 +64,7 @@ private[avro] class AvroFileFormat extends FileFormat 
with DataSourceRegister {
 // Schema evolution is not supported yet. Here we only pick a single 
random sample file to
 // figure out the schema of the whole dataset.
 val sampleFile =
-  if 
(conf.getBoolean(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, true)) {
+  if (AvroFileFormat.ignoreFilesWithoutExtensions(conf)) {
--- End diff --

Here is how people use the option so far: 
https://github.com/databricks/spark-avro/issues/71#issuecomment-280844562 . 
Probably we should seperatly from the PR discuss how we are going to fix the 
"bug" and break backward compatibily.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...

2018-07-14 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21769#discussion_r202524696
  
--- Diff: 
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ---
@@ -809,4 +810,16 @@ class AvroSuite extends SparkFunSuite {
   assert(readDf.collect().sameElements(writeDf.collect()))
 }
   }
+
+  test("SPARK-24805: reading files without .avro extension") {
--- End diff --

Does it just introduce unnesseccary dependency here and overcomplicate the 
test? I can create small (with just one row) avro file without `.avro` 
extension especially for the test. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...

2018-07-14 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21769#discussion_r202524552
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala ---
@@ -64,7 +64,7 @@ private[avro] class AvroFileFormat extends FileFormat 
with DataSourceRegister {
 // Schema evolution is not supported yet. Here we only pick a single 
random sample file to
 // figure out the schema of the whole dataset.
 val sampleFile =
-  if 
(conf.getBoolean(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, true)) {
+  if (AvroFileFormat.ignoreFilesWithoutExtensions(conf)) {
--- End diff --

The `avro.mapred.ignore.inputs.without.extension` is hadoop's parameter. 
This PR aims to change the default behavior only. I don't want to convert the 
hadoop parameter to Avro datasource option here. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...

2018-07-14 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21769#discussion_r202521397
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala ---
@@ -64,7 +64,7 @@ private[avro] class AvroFileFormat extends FileFormat 
with DataSourceRegister {
 // Schema evolution is not supported yet. Here we only pick a single 
random sample file to
 // figure out the schema of the whole dataset.
 val sampleFile =
-  if 
(conf.getBoolean(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, true)) {
+  if (AvroFileFormat.ignoreFilesWithoutExtensions(conf)) {
--- End diff --

I tried running queries. The option 
`avro.mapred.ignore.inputs.without.extension` is not set in `conf`.  This is a 
bug in `spark-avro`.
Please read the value from `options`. It would be good to have a new test 
case with `avro.mapred.ignore.inputs.without.extension` as true.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...

2018-07-14 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21769#discussion_r202521471
  
--- Diff: 
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ---
@@ -623,7 +624,7 @@ class AvroSuite extends SparkFunSuite {
   spark.read.avro("*/*/*/*/*/*/*/something.avro")
 }
 
-intercept[FileNotFoundException] {
+intercept[java.io.IOException] {
   TestUtils.withTempDir { dir =>
 FileUtils.touch(new File(dir, "test"))
 spark.read.avro(dir.toString)
--- End diff --

We can fix the case as
```
spark.read.option("avro.mapred.ignore.inputs.without.extension", 
false).avro(dir.toString)
```
The behavior will be the same as before. And we don't need to modify the 
expected `FileNotFoundException`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...

2018-07-14 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21769#discussion_r202521435
  
--- Diff: 
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ---
@@ -809,4 +810,16 @@ class AvroSuite extends SparkFunSuite {
   assert(readDf.collect().sameElements(writeDf.collect()))
 }
   }
+
+  test("SPARK-24805: reading files without .avro extension") {
--- End diff --

Nit: Can we create a temp path and copy the original `episodes.avro` to the 
path? So that we don't need to have two duplicated resource file. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...

2018-07-14 Thread MaxGekk
GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/21769

[SPARK-24805][SQL] Do not ignore avro files without extensions

## What changes were proposed in this pull request?

In the PR, I propose to change default behaviour of AVRO datasource which 
currently ignores files without `.avro` extension in read by default. This PR 
sets the default value for `avro.mapred.ignore.inputs.without.extension` to 
`false` in the case if the parameter is not set by an user.

## How was this patch tested?

Added a test file without extension in AVRO format, and new test for 
reading the file with and wihout specified schema.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 avro-without-extension

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21769.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21769


commit 35063ef8e734bdeb39316e02b2e8451d0d75d43a
Author: Maxim Gekk 
Date:   2018-07-14T09:49:30Z

Test for reading files without avro extension

commit 760f98e7aecb5a4c267599b318479d7f2ade165a
Author: Maxim Gekk 
Date:   2018-07-14T10:38:33Z

Fix tests

commit 8562a8d43868d551efa6a0e9a00d50cdc838a178
Author: Maxim Gekk 
Date:   2018-07-14T10:53:24Z

Adding ticket number to test title




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org