[GitHub] spark pull request #22654: [SPARK-25660][SQL] Fix for the backward slash as ...

2018-10-12 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22654


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22654: [SPARK-25660][SQL] Fix for the backward slash as ...

2018-10-12 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22654#discussion_r224799147
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1826,4 +1826,13 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils with Te
 val df = spark.read.option("enforceSchema", false).csv(input)
 checkAnswer(df, Row("1", "2"))
   }
+
+  test("using the backward slash as the delimiter") {
+val input = Seq("""abc\1""").toDS()
--- End diff --

I prohibited single backslash and throw an exception with a tip of using 
double backslash.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22654: [SPARK-25660][SQL] Fix for the backward slash as ...

2018-10-11 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22654#discussion_r224626882
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1826,4 +1826,13 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils with Te
 val df = spark.read.option("enforceSchema", false).csv(input)
 checkAnswer(df, Row("1", "2"))
   }
+
+  test("using the backward slash as the delimiter") {
+val input = Seq("""abc\1""").toDS()
--- End diff --

If a user specified `\` as a delimiter, we should issue an exception 
message and let users add the escape symbol i.e., `\\`. It is weird to see both 
`\\` and `\` are representing the same thing `\`. We should be consistent for 
handling backslash in all the cases.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22654: [SPARK-25660][SQL] Fix for the backward slash as ...

2018-10-09 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22654#discussion_r223754099
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1826,4 +1826,13 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils with Te
 val df = spark.read.option("enforceSchema", false).csv(input)
 checkAnswer(df, Row("1", "2"))
   }
+
+  test("using the backward slash as the delimiter") {
+val input = Seq("""abc\1""").toDS()
--- End diff --

> Or is the point that delimiting takes precedence?

Right, if an user specified `\` as a delimiter, CSV parser considers it as 
the delimiter first of all. We can see that in the main loop that delimiters 
are handled before escape characters:

https://github.com/uniVocity/univocity-parsers/blob/6746adc2ddb420ebba7441339887e4bbc35cf087/src/main/java/com/univocity/parsers/csv/CsvParser.java#L115


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22654: [SPARK-25660][SQL] Fix for the backward slash as ...

2018-10-09 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22654#discussion_r223751904
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
 ---
@@ -97,23 +97,22 @@ object CSVUtils {
*/
   @throws[IllegalArgumentException]
   def toChar(str: String): Char = {
-if (str.charAt(0) == '\\') {
-  str.charAt(1)
-  match {
-case 't' => '\t'
-case 'r' => '\r'
-case 'b' => '\b'
-case 'f' => '\f'
-case '\"' => '\"' // In case user changes quote char and uses \" 
as delimiter in options
-case '\'' => '\''
-case 'u' if str == """\u""" => '\u'
-case _ =>
-  throw new IllegalArgumentException(s"Unsupported special 
character for delimiter: $str")
-  }
-} else if (str.length == 1) {
-  str.charAt(0)
-} else {
-  throw new IllegalArgumentException(s"Delimiter cannot be more than 
one character: $str")
+(str: Seq[Char]) match {
+  case Seq() => throw new IllegalArgumentException("Delimiter cannot 
be empty string")
+  case Seq(c) => c
--- End diff --

Ah yeah good points. This is too clever, off the top of my head. I still 
wonder if the code here can reduce the duplication of `Seq('\\', c) => '\c'` 
but I don't see a way that actually works, yeah. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22654: [SPARK-25660][SQL] Fix for the backward slash as ...

2018-10-09 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22654#discussion_r223749951
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
 ---
@@ -97,23 +97,22 @@ object CSVUtils {
*/
   @throws[IllegalArgumentException]
   def toChar(str: String): Char = {
-if (str.charAt(0) == '\\') {
-  str.charAt(1)
-  match {
-case 't' => '\t'
-case 'r' => '\r'
-case 'b' => '\b'
-case 'f' => '\f'
-case '\"' => '\"' // In case user changes quote char and uses \" 
as delimiter in options
-case '\'' => '\''
-case 'u' if str == """\u""" => '\u'
-case _ =>
-  throw new IllegalArgumentException(s"Unsupported special 
character for delimiter: $str")
-  }
-} else if (str.length == 1) {
-  str.charAt(0)
-} else {
-  throw new IllegalArgumentException(s"Delimiter cannot be more than 
one character: $str")
+(str: Seq[Char]) match {
+  case Seq() => throw new IllegalArgumentException("Delimiter cannot 
be empty string")
+  case Seq(c) => c
--- End diff --

I would prefer more declarative way and less nested levels of controls. but 
this is personal opinion. Let's look at your example:
```
str.length
```
you didn't check that str can be null. 

```
case 2 if str(0) == '\\' =>
case 'u' if str == """\u""" => '\0'
```
If it has length 2, how `str` could be `"""\u"""`?

```
case c if """trbf"'\""".contains(c) => c
```
You should produce control chars not just second char. For example: `\t` -> 
Seq('\', 't') -> '\t`.

In my approach, everything is simple. One input case is mapped to one 
output. There is no unnecessary complexity.  



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22654: [SPARK-25660][SQL] Fix for the backward slash as ...

2018-10-09 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22654#discussion_r223726757
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
 ---
@@ -97,23 +97,22 @@ object CSVUtils {
*/
   @throws[IllegalArgumentException]
   def toChar(str: String): Char = {
-if (str.charAt(0) == '\\') {
-  str.charAt(1)
-  match {
-case 't' => '\t'
-case 'r' => '\r'
-case 'b' => '\b'
-case 'f' => '\f'
-case '\"' => '\"' // In case user changes quote char and uses \" 
as delimiter in options
-case '\'' => '\''
-case 'u' if str == """\u""" => '\u'
-case _ =>
-  throw new IllegalArgumentException(s"Unsupported special 
character for delimiter: $str")
-  }
-} else if (str.length == 1) {
-  str.charAt(0)
-} else {
-  throw new IllegalArgumentException(s"Delimiter cannot be more than 
one character: $str")
+(str: Seq[Char]) match {
+  case Seq() => throw new IllegalArgumentException("Delimiter cannot 
be empty string")
+  case Seq(c) => c
--- End diff --

I'm missing why we had to switch up the case statement like this. I get 
that we need to cover more cases, but there was duplication and now there is a 
bit more. What about ...

```
str.length match {
  case 0 => // error
  case 1 => str(0)
  case 2 if str(0) == '\\' =>
str(1) match {
  case c if """trbf"'\""".contains(c) => c
  case 'u' if str == """\u""" => '\0'
  case _ => // error
}
  case _ => // error
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22654: [SPARK-25660][SQL] Fix for the backward slash as ...

2018-10-09 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22654#discussion_r223724099
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1826,4 +1826,13 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils with Te
 val df = spark.read.option("enforceSchema", false).csv(input)
 checkAnswer(df, Row("1", "2"))
   }
+
+  test("using the backward slash as the delimiter") {
+val input = Seq("""abc\1""").toDS()
--- End diff --

Isn't `\` the default escape character? this should be read as the string 
"abc1" then, and not delimited. It would have to be `\\`, right? I'm not 
talking about Scala string escaping, but CSV here.

Or is the point that delimiting takes precedence?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22654: [SPARK-25660][SQL] Fix for the backward slash as ...

2018-10-09 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22654#discussion_r223635442
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
 ---
@@ -97,23 +97,21 @@ object CSVUtils {
*/
   @throws[IllegalArgumentException]
   def toChar(str: String): Char = {
-if (str.charAt(0) == '\\') {
-  str.charAt(1)
-  match {
-case 't' => '\t'
-case 'r' => '\r'
-case 'b' => '\b'
-case 'f' => '\f'
-case '\"' => '\"' // In case user changes quote char and uses \" 
as delimiter in options
-case '\'' => '\''
-case 'u' if str == """\u""" => '\u'
-case _ =>
-  throw new IllegalArgumentException(s"Unsupported special 
character for delimiter: $str")
-  }
-} else if (str.length == 1) {
-  str.charAt(0)
-} else {
-  throw new IllegalArgumentException(s"Delimiter cannot be more than 
one character: $str")
+(str: Seq[Char]) match {
+  case Seq() => throw new IllegalArgumentException(s"Delimiter cannot 
be empty string")
+  case Seq(c) => c
+  case Seq('\\', 't') => '\t'
+  case Seq('\\', 'r') => '\r'
+  case Seq('\\', 'b') => '\b'
+  case Seq('\\', 'f') => '\f'
+  // In case user changes quote char and uses \" as delimiter in 
options
+  case Seq('\\', '\"') => '\"'
+  case Seq('\\', '\'') => '\''
--- End diff --

Added the case + a test for that


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22654: [SPARK-25660][SQL] Fix for the backward slash as ...

2018-10-09 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22654#discussion_r223623311
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
 ---
@@ -97,23 +97,21 @@ object CSVUtils {
*/
   @throws[IllegalArgumentException]
   def toChar(str: String): Char = {
-if (str.charAt(0) == '\\') {
-  str.charAt(1)
-  match {
-case 't' => '\t'
-case 'r' => '\r'
-case 'b' => '\b'
-case 'f' => '\f'
-case '\"' => '\"' // In case user changes quote char and uses \" 
as delimiter in options
-case '\'' => '\''
-case 'u' if str == """\u""" => '\u'
-case _ =>
-  throw new IllegalArgumentException(s"Unsupported special 
character for delimiter: $str")
-  }
-} else if (str.length == 1) {
-  str.charAt(0)
-} else {
-  throw new IllegalArgumentException(s"Delimiter cannot be more than 
one character: $str")
+(str: Seq[Char]) match {
+  case Seq() => throw new IllegalArgumentException(s"Delimiter cannot 
be empty string")
--- End diff --

removed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22654: [SPARK-25660][SQL] Fix for the backward slash as ...

2018-10-09 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22654#discussion_r223581611
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
 ---
@@ -97,23 +97,21 @@ object CSVUtils {
*/
   @throws[IllegalArgumentException]
   def toChar(str: String): Char = {
-if (str.charAt(0) == '\\') {
-  str.charAt(1)
-  match {
-case 't' => '\t'
-case 'r' => '\r'
-case 'b' => '\b'
-case 'f' => '\f'
-case '\"' => '\"' // In case user changes quote char and uses \" 
as delimiter in options
-case '\'' => '\''
-case 'u' if str == """\u""" => '\u'
-case _ =>
-  throw new IllegalArgumentException(s"Unsupported special 
character for delimiter: $str")
-  }
-} else if (str.length == 1) {
-  str.charAt(0)
-} else {
-  throw new IllegalArgumentException(s"Delimiter cannot be more than 
one character: $str")
+(str: Seq[Char]) match {
+  case Seq() => throw new IllegalArgumentException(s"Delimiter cannot 
be empty string")
+  case Seq(c) => c
+  case Seq('\\', 't') => '\t'
+  case Seq('\\', 'r') => '\r'
+  case Seq('\\', 'b') => '\b'
+  case Seq('\\', 'f') => '\f'
+  // In case user changes quote char and uses \" as delimiter in 
options
+  case Seq('\\', '\"') => '\"'
+  case Seq('\\', '\'') => '\''
--- End diff --

Regarding the backslash support, I think we should follow 
https://docs.oracle.com/javase/tutorial/java/data/characters.html 
`case Seq('\\', '\\') => '\\'` 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22654: [SPARK-25660][SQL] Fix for the backward slash as ...

2018-10-09 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22654#discussion_r223579444
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
 ---
@@ -97,23 +97,21 @@ object CSVUtils {
*/
   @throws[IllegalArgumentException]
   def toChar(str: String): Char = {
-if (str.charAt(0) == '\\') {
-  str.charAt(1)
-  match {
-case 't' => '\t'
-case 'r' => '\r'
-case 'b' => '\b'
-case 'f' => '\f'
-case '\"' => '\"' // In case user changes quote char and uses \" 
as delimiter in options
-case '\'' => '\''
-case 'u' if str == """\u""" => '\u'
-case _ =>
-  throw new IllegalArgumentException(s"Unsupported special 
character for delimiter: $str")
-  }
-} else if (str.length == 1) {
-  str.charAt(0)
-} else {
-  throw new IllegalArgumentException(s"Delimiter cannot be more than 
one character: $str")
+(str: Seq[Char]) match {
+  case Seq() => throw new IllegalArgumentException(s"Delimiter cannot 
be empty string")
--- End diff --

Nit: remove `s`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22654: [SPARK-25660][SQL] Fix for the backward slash as ...

2018-10-06 Thread MaxGekk
GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/22654

[SPARK-25660][SQL] Fix for the backward slash as CSV fields delimiter

## What changes were proposed in this pull request?

The PR addresses the exception raised on accessing chars out of delimiter 
string. In particular, the backward slash `\` as the CSV fields delimiter 
causes the following exception on reading `abc\1`:
```Scala
String index out of range: 1
java.lang.StringIndexOutOfBoundsException: String index out of range: 1
at java.lang.String.charAt(String.java:658)
``` 
because `str.charAt(1)` tries to access a char out of `str` in 
`CSVUtils.toChar`

## How was this patch tested?

Added tests for empty string and string containing the backward slash to 
`CSVUtilsSuite`. Besides of that I added an end-to-end test to check how the 
backward slash is handled in reading CSV string with it.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 csv-slash-delim

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22654.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22654


commit dd16ca349dd20626b6a8ae00f6dd445c269104b8
Author: Maxim Gekk 
Date:   2018-10-06T09:12:33Z

Test for backward slash as the delimiter

commit 7bf453a5a5ddedda4319d1146b8c125fa0d4
Author: Maxim Gekk 
Date:   2018-10-06T09:55:15Z

Bug fix + tests




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org