[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-09-06 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22048


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-09-04 Thread xuejianbest
Github user xuejianbest commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r214867601
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2794,6 +2794,30 @@ private[spark] object Utils extends Logging {
   }
 }
   }
+
+  /**
+   * Regular expression matching full width characters
+   */
+  private val fullWidthRegex = ("""[""" +
+// scalastyle:off nonascii
+"""\u1100-\u115F""" +
+"""\u2E80-\uA4CF""" +
+"""\uAC00-\uD7A3""" +
+"""\uF900-\uFAFF""" +
+"""\uFE10-\uFE19""" +
+"""\uFE30-\uFE6F""" +
+"""\uFF00-\uFF60""" +
+"""\uFFE0-\uFFE6""" +
--- End diff --

Do I need to merge the above commited into one commit, 
Or add another new commit?
Or change the last commit comments ?
@srowen 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-09-04 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r214862749
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2794,6 +2794,30 @@ private[spark] object Utils extends Logging {
   }
 }
   }
+
+  /**
+   * Regular expression matching full width characters
+   */
+  private val fullWidthRegex = ("""[""" +
+// scalastyle:off nonascii
+"""\u1100-\u115F""" +
+"""\u2E80-\uA4CF""" +
+"""\uAC00-\uD7A3""" +
+"""\uF900-\uFAFF""" +
+"""\uFE10-\uFE19""" +
+"""\uFE30-\uFE6F""" +
+"""\uFF00-\uFF60""" +
+"""\uFFE0-\uFFE6""" +
--- End diff --

I think this is fine. Just copy a summary of your comments here into the 
comments in the code. Yes this has nothing to do with UTF8 encoding directly. 
You are matching UCS2 really, 16bit char values. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-09-03 Thread xuejianbest
Github user xuejianbest commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r214778257
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2794,6 +2794,30 @@ private[spark] object Utils extends Logging {
   }
 }
   }
+
+  /**
+   * Regular expression matching full width characters
+   */
+  private val fullWidthRegex = ("""[""" +
+// scalastyle:off nonascii
+"""\u1100-\u115F""" +
+"""\u2E80-\uA4CF""" +
+"""\uAC00-\uD7A3""" +
+"""\uF900-\uFAFF""" +
+"""\uFE10-\uFE19""" +
+"""\uFE30-\uFE6F""" +
+"""\uFF00-\uFF60""" +
+"""\uFFE0-\uFFE6""" +
--- End diff --

> Can you describe them there and put a references to a public unicode 
document?

This is a regular expression match using unicode, regardless of the 
specific encoding.
For example, the following string is encoded using gbk instead of utf8, and 
the match still works:
`
val bytes = Array[Byte](0xd6.toByte, 0xd0.toByte, 0xB9.toByte, 
0xFA.toByte)
val s1 = new String(bytes, "gbk")

println(s1) //中国

val fullWidthRegex = ("""[""" +
// scalastyle:off nonascii
"""\u1100-\u115F""" +
"""\u2E80-\uA4CF""" +
"""\uAC00-\uD7A3""" +
"""\uF900-\uFAFF""" +
"""\uFE10-\uFE19""" +
"""\uFE30-\uFE6F""" +
"""\uFF00-\uFF60""" +
"""\uFFE0-\uFFE6""" +
// scalastyle:on nonascii
"""]""").r

println(fullWidthRegex.findAllIn(s1).size) //2
`
This regular expression is obtained experimentally under a specific font.
I don't understand what you are going to do.


> How about some additional overheads when calling showString as compared 
to showString w/o this patch?

I tested a Dataset consisting of 100 rows, each row has two columns, one 
column is the index (0-99), and the other column is a random string of length 
100 characters, and then the showString display is called separately.
The original showString method (w/o this patch) took about 42ms, and the 
improved time took about 46ms, and the performance was about 10% worse.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-09-03 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r214626069
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2794,6 +2794,30 @@ private[spark] object Utils extends Logging {
   }
 }
   }
+
+  /**
+   * Regular expression matching full width characters
+   */
+  private val fullWidthRegex = ("""[""" +
+// scalastyle:off nonascii
+"""\u1100-\u115F""" +
+"""\u2E80-\uA4CF""" +
+"""\uAC00-\uD7A3""" +
+"""\uF900-\uFAFF""" +
+"""\uFE10-\uFE19""" +
+"""\uFE30-\uFE6F""" +
+"""\uFF00-\uFF60""" +
+"""\uFFE0-\uFFE6""" +
--- End diff --

> I looked at all the 0x-0x characters (unicode) and showed them 
under Xshell, then found all the full width characters. Get the regular 
expression.

Can you describe them there and put a references to a public unicode 
document?
See the comment in `UTF8String`;

https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L65

> It takes 49 milliseconds to complete matching all 1000 strings.

How about some additional overheads when calling `showString` as compared 
to `showString ` w/o this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-09-03 Thread xuejianbest
Github user xuejianbest commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r214614936
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2794,6 +2794,30 @@ private[spark] object Utils extends Logging {
   }
 }
   }
+
+  /**
+   * Regular expression matching full width characters
+   */
+  private val fullWidthRegex = ("""[""" +
+// scalastyle:off nonascii
+"""\u1100-\u115F""" +
+"""\u2E80-\uA4CF""" +
+"""\uAC00-\uD7A3""" +
+"""\uF900-\uFAFF""" +
+"""\uFE10-\uFE19""" +
+"""\uFE30-\uFE6F""" +
+"""\uFF00-\uFF60""" +
+"""\uFFE0-\uFFE6""" +
--- End diff --

- How to get this Regex list? Any reference? It sounds like this should be 
a general problem

I looked at all the 0x-0x characters (unicode) and showed them 
under Xshell, then found all the full width characters. Get the regular 
expression.


- What is the performance impact?

I generated 1000 strings, each consisting of 1000 characters with a random 
unicode of 0x-0x. (a total of 1 million characters.)
Then use this regular expression to find the full width character of these 
strings.
I tested 100 rounds and then averaged.
It takes 49 milliseconds to complete matching all 1000 strings.

@gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-31 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r214490659
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2794,6 +2794,30 @@ private[spark] object Utils extends Logging {
   }
 }
   }
+
+  /**
+   * Regular expression matching full width characters
+   */
+  private val fullWidthRegex = ("""[""" +
+// scalastyle:off nonascii
+"""\u1100-\u115F""" +
+"""\u2E80-\uA4CF""" +
+"""\uAC00-\uD7A3""" +
+"""\uF900-\uFAFF""" +
+"""\uFE10-\uFE19""" +
+"""\uFE30-\uFE6F""" +
+"""\uFF00-\uFF60""" +
+"""\uFFE0-\uFFE6""" +
--- End diff --

A general question. 
- How to get this Regex list? Any reference?  It sounds like this should be 
a general problem
- What is the performance impact?

Can you answer them and post them in the PR description?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-29 Thread xuejianbest
Github user xuejianbest commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r213888028
  
--- Diff: core/src/test/scala/org/apache/spark/util/UtilsSuite.scala ---
@@ -1184,6 +1184,25 @@ class UtilsSuite extends SparkFunSuite with 
ResetSystemProperties with Logging {
 
assert(Utils.getSimpleName(classOf[MalformedClassObject.MalformedClass]) ===
   "UtilsSuite$MalformedClassObject$MalformedClass")
   }
+
+  test("stringHalfWidth") {
+assert(Utils.stringHalfWidth(null) == 0)
+assert(Utils.stringHalfWidth("") == 0)
+assert(Utils.stringHalfWidth("ab c") == 4)
+assert(Utils.stringHalfWidth("1098") == 4)
+assert(Utils.stringHalfWidth("mø") == 2)
+assert(Utils.stringHalfWidth("γύρ") == 3)
--- End diff --

I found similar code in the `org.apache.commons.lang3.StringUtils` class.
So the same test was added.
In StringUtils 6934 line is comment:
`* StringUtils.isNumeric("\u0967\u0968\u0969")  = true`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r213884655
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -301,16 +301,16 @@ class Dataset[T] private[sql](
   // Compute the width of each column
   for (row <- rows) {
 for ((cell, i) <- row.zipWithIndex) {
-  colWidths(i) = math.max(colWidths(i), cell.length)
+  colWidths(i) = math.max(colWidths(i), 
Utils.stringHalfWidth(cell))
 }
   }
 
   val paddedRows = rows.map { row =>
 row.zipWithIndex.map { case (cell, i) =>
   if (truncate > 0) {
-StringUtils.leftPad(cell, colWidths(i))
--- End diff --

OK that's fine.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-29 Thread xuejianbest
Github user xuejianbest commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r213883324
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2794,6 +2794,27 @@ private[spark] object Utils extends Logging {
   }
 }
   }
+
+  /**
+   * Regular expression matching full width characters
+   */
+  private lazy val fullWidthRegex = ("""[""" +
+"""\u1100-\u115F""" +
+"""\u2E80-\uA4CF""" +
+"""\uAC00-\uD7A3""" +
+"""\uF900-\uFAFF""" +
+"""\uFE10-\uFE19""" +
+"""\uFE30-\uFE6F""" +
+"""\uFF00-\uFF60""" +
+"""\uFFE0-\uFFE6""" +
+"""]""").r
+  /**
+   * Return the number of half width of a string
+   * A full width character occupies two half widths
+   */
--- End diff --

OK, thinks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-29 Thread xuejianbest
Github user xuejianbest commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r213882803
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -301,16 +301,16 @@ class Dataset[T] private[sql](
   // Compute the width of each column
   for (row <- rows) {
 for ((cell, i) <- row.zipWithIndex) {
-  colWidths(i) = math.max(colWidths(i), cell.length)
+  colWidths(i) = math.max(colWidths(i), 
Utils.stringHalfWidth(cell))
 }
   }
 
   val paddedRows = rows.map { row =>
 row.zipWithIndex.map { case (cell, i) =>
   if (truncate > 0) {
-StringUtils.leftPad(cell, colWidths(i))
--- End diff --

This method fills the string to the specified length.
But what you need to do here is to increase the string by the specified 
length by padding.
If you want to use this function, code programming:
`StringUtils.leftPad(cell, colWidths(i) - Utils.stringHalfWidth(cell) + 
cell.length)`
I think it is more difficult to understand.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r213880735
  
--- Diff: core/src/test/scala/org/apache/spark/util/UtilsSuite.scala ---
@@ -1184,6 +1184,25 @@ class UtilsSuite extends SparkFunSuite with 
ResetSystemProperties with Logging {
 
assert(Utils.getSimpleName(classOf[MalformedClassObject.MalformedClass]) ===
   "UtilsSuite$MalformedClassObject$MalformedClass")
   }
+
+  test("stringHalfWidth") {
+assert(Utils.stringHalfWidth(null) == 0)
+assert(Utils.stringHalfWidth("") == 0)
+assert(Utils.stringHalfWidth("ab c") == 4)
+assert(Utils.stringHalfWidth("1098") == 4)
+assert(Utils.stringHalfWidth("mø") == 2)
+assert(Utils.stringHalfWidth("γύρ") == 3)
--- End diff --

Don't change it. There are Unicode characters elsewhere in the source, and 
it's clearer this way.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-29 Thread xuejianbest
Github user xuejianbest commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r213879659
  
--- Diff: core/src/test/scala/org/apache/spark/util/UtilsSuite.scala ---
@@ -1184,6 +1184,25 @@ class UtilsSuite extends SparkFunSuite with 
ResetSystemProperties with Logging {
 
assert(Utils.getSimpleName(classOf[MalformedClassObject.MalformedClass]) ===
   "UtilsSuite$MalformedClassObject$MalformedClass")
   }
+
+  test("stringHalfWidth") {
+assert(Utils.stringHalfWidth(null) == 0)
+assert(Utils.stringHalfWidth("") == 0)
+assert(Utils.stringHalfWidth("ab c") == 4)
+assert(Utils.stringHalfWidth("1098") == 4)
+assert(Utils.stringHalfWidth("mø") == 2)
+assert(Utils.stringHalfWidth("γύρ") == 3)
--- End diff --

Do you mean that the string here needs to be converted to a unicode escape 
form?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-29 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r213869119
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2794,6 +2794,27 @@ private[spark] object Utils extends Logging {
   }
 }
   }
+
+  /**
+   * Regular expression matching full width characters
+   */
+  private lazy val fullWidthRegex = ("""[""" +
+"""\u1100-\u115F""" +
+"""\u2E80-\uA4CF""" +
+"""\uAC00-\uD7A3""" +
+"""\uF900-\uFAFF""" +
+"""\uFE10-\uFE19""" +
+"""\uFE30-\uFE6F""" +
+"""\uFF00-\uFF60""" +
+"""\uFFE0-\uFFE6""" +
+"""]""").r
+  /**
+   * Return the number of half width of a string
+   * A full width character occupies two half widths
+   */
--- End diff --

How about this?
```
  /**
   * Return the number of half widths in a given string. Note that a full 
width character
   * occupies two half widths.
   */
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-29 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r213868352
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2794,6 +2794,27 @@ private[spark] object Utils extends Logging {
   }
 }
   }
+
+  /**
+   * Regular expression matching full width characters
+   */
+  private lazy val fullWidthRegex = ("""[""" +
+"""\u1100-\u115F""" +
+"""\u2E80-\uA4CF""" +
+"""\uAC00-\uD7A3""" +
+"""\uF900-\uFAFF""" +
+"""\uFE10-\uFE19""" +
+"""\uFE30-\uFE6F""" +
+"""\uFF00-\uFF60""" +
+"""\uFFE0-\uFFE6""" +
+"""]""").r
--- End diff --

super nit: add blank line


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r213699101
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2794,6 +2794,27 @@ private[spark] object Utils extends Logging {
   }
 }
   }
+
+  /**
+   * Regular expression matching full width characters
+   */
+  private lazy val fullWidthRegex = ("""[""" +
--- End diff --

Doesn't need to be lazy


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r213700228
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -301,16 +301,16 @@ class Dataset[T] private[sql](
   // Compute the width of each column
   for (row <- rows) {
 for ((cell, i) <- row.zipWithIndex) {
-  colWidths(i) = math.max(colWidths(i), cell.length)
+  colWidths(i) = math.max(colWidths(i), 
Utils.stringHalfWidth(cell))
 }
   }
 
   val paddedRows = rows.map { row =>
 row.zipWithIndex.map { case (cell, i) =>
   if (truncate > 0) {
-StringUtils.leftPad(cell, colWidths(i))
--- End diff --

Oh why not use this method?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r213699899
  
--- Diff: core/src/test/scala/org/apache/spark/util/UtilsSuite.scala ---
@@ -1184,6 +1184,25 @@ class UtilsSuite extends SparkFunSuite with 
ResetSystemProperties with Logging {
 
assert(Utils.getSimpleName(classOf[MalformedClassObject.MalformedClass]) ===
   "UtilsSuite$MalformedClassObject$MalformedClass")
   }
+
+  test("stringHalfWidth") {
+assert(Utils.stringHalfWidth(null) == 0)
+assert(Utils.stringHalfWidth("") == 0)
+assert(Utils.stringHalfWidth("ab c") == 4)
+assert(Utils.stringHalfWidth("1098") == 4)
+assert(Utils.stringHalfWidth("mø") == 2)
+assert(Utils.stringHalfWidth("γύρ") == 3)
--- End diff --

I think it is OK to type Unicode chars into source files because git and 
Scala ought to be set up to encode and read the source as UTF8. I have avoided 
it in the past and used Unicode escapes (but put actual string in comment) to 
be safe. I don't feel strongly. I think we have other instances of Unicode in 
the source. Just a comment. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r213699025
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2794,6 +2794,27 @@ private[spark] object Utils extends Logging {
   }
 }
   }
+
+  /**
+   * Regular expression matching full width characters
+   */
+  private lazy val fullWidthRegex = ("""[""" +
+"""\u1100-\u115F""" +
+"""\u2E80-\uA4CF""" +
+"""\uAC00-\uD7A3""" +
+"""\uF900-\uFAFF""" +
+"""\uFE10-\uFE19""" +
+"""\uFE30-\uFE6F""" +
+"""\uFF00-\uFF60""" +
+"""\uFFE0-\uFFE6""" +
+"""]""").r
+  /**
+   * Return the number of half width of a string
+   * A full width character occupies two half widths
+   */
+  def stringHalfWidth(str: String): Int = {
+if(str == null) 0 else str.length + fullWidthRegex.findAllIn(str).size
--- End diff --

Nit: space before if


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-29 Thread xuejianbest
Github user xuejianbest commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r213608404
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -294,23 +294,29 @@ class Dataset[T] private[sql](
 // We set a minimum column width at '3'
 val minimumColWidth = 3
 
+// Regular expression matching full width characters
+val fullWidthRegex = 
"""[\u1100-\u115F\u2E80-\uA4CF\uAC00-\uD7A3\uF900-\uFAFF\uFE10-\uFE19\uFE30-\uFE6F\uFF00-\uFF60\uFFE0-\uFFE6]""".r
+// The number of half width of a string
+def stringHalfWidth = (str: String) => {
+  str.length + fullWidthRegex.findAllIn(str).size
+}
--- End diff --

Moved it into `util.Utils`. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-28 Thread xuejianbest
Github user xuejianbest commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r213537923
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -294,23 +294,25 @@ class Dataset[T] private[sql](
 // We set a minimum column width at '3'
 val minimumColWidth = 3
 
+//Regular expression matching full width characters
+val fullWidthRegex = 
"""[\u1100-\u115F\u2E80-\uA4CF\uAC00-\uD7A3\uF900-\uFAFF\uFE10-\uFE19\uFE30-\uFE6F\uFF00-\uFF60\uFFE0-\uFFE6]""".r
 if (!vertical) {
   // Initialise the width of each column to a minimum value
   val colWidths = Array.fill(numCols)(minimumColWidth)
 
   // Compute the width of each column
   for (row <- rows) {
 for ((cell, i) <- row.zipWithIndex) {
-  colWidths(i) = math.max(colWidths(i), cell.length)
+  colWidths(i) = math.max(colWidths(i), cell.length + 
fullWidthRegex.findAllIn(cell).size)
--- End diff --

I committed a new version. See if this is appropriate please ?
@srowen


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-28 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r213537514
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -294,23 +294,29 @@ class Dataset[T] private[sql](
 // We set a minimum column width at '3'
 val minimumColWidth = 3
 
+// Regular expression matching full width characters
+val fullWidthRegex = 
"""[\u1100-\u115F\u2E80-\uA4CF\uAC00-\uD7A3\uF900-\uFAFF\uFE10-\uFE19\uFE30-\uFE6F\uFF00-\uFF60\uFFE0-\uFFE6]""".r
--- End diff --

This line goes over the limit, 100.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-28 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r213537463
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -294,23 +294,29 @@ class Dataset[T] private[sql](
 // We set a minimum column width at '3'
 val minimumColWidth = 3
 
+// Regular expression matching full width characters
+val fullWidthRegex = 
"""[\u1100-\u115F\u2E80-\uA4CF\uAC00-\uD7A3\uF900-\uFAFF\uFE10-\uFE19\uFE30-\uFE6F\uFF00-\uFF60\uFFE0-\uFFE6]""".r
+// The number of half width of a string
+def stringHalfWidth = (str: String) => {
+  str.length + fullWidthRegex.findAllIn(str).size
+}
--- End diff --

better to add tests for the helper function, too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-28 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r213537424
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -294,23 +294,29 @@ class Dataset[T] private[sql](
 // We set a minimum column width at '3'
 val minimumColWidth = 3
 
+// Regular expression matching full width characters
+val fullWidthRegex = 
"""[\u1100-\u115F\u2E80-\uA4CF\uAC00-\uD7A3\uF900-\uFAFF\uFE10-\uFE19\uFE30-\uFE6F\uFF00-\uFF60\uFFE0-\uFFE6]""".r
+// The number of half width of a string
+def stringHalfWidth = (str: String) => {
+  str.length + fullWidthRegex.findAllIn(str).size
+}
--- End diff --

better to move this method into `util.Utils` or something as a helper 
function?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-28 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r213525456
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -294,23 +294,25 @@ class Dataset[T] private[sql](
 // We set a minimum column width at '3'
 val minimumColWidth = 3
 
+//Regular expression matching full width characters
+val fullWidthRegex = 
"""[\u1100-\u115F\u2E80-\uA4CF\uAC00-\uD7A3\uF900-\uFAFF\uFE10-\uFE19\uFE30-\uFE6F\uFF00-\uFF60\uFFE0-\uFFE6]""".r
--- End diff --

OK, not worth optimizing now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-28 Thread xuejianbest
Github user xuejianbest commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r213525417
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -294,23 +294,25 @@ class Dataset[T] private[sql](
 // We set a minimum column width at '3'
 val minimumColWidth = 3
 
+//Regular expression matching full width characters
--- End diff --

OK, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-28 Thread xuejianbest
Github user xuejianbest commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r213525240
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -294,23 +294,25 @@ class Dataset[T] private[sql](
 // We set a minimum column width at '3'
 val minimumColWidth = 3
 
+//Regular expression matching full width characters
+val fullWidthRegex = 
"""[\u1100-\u115F\u2E80-\uA4CF\uAC00-\uD7A3\uF900-\uFAFF\uFE10-\uFE19\uFE30-\uFE6F\uFF00-\uFF60\uFFE0-\uFFE6]""".r
--- End diff --

I generated 1000 strings, each consisting of 1000 characters with a random 
unicode of 0x-0x. (a total of 1 million characters.)
Then use this regular expression to find the full width character of these 
strings. 
I tested 100 rounds and then averaged.
It takes 49 milliseconds to complete matching all 1000 strings.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-28 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r213384744
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -294,23 +294,25 @@ class Dataset[T] private[sql](
 // We set a minimum column width at '3'
 val minimumColWidth = 3
 
+//Regular expression matching full width characters
--- End diff --

Very small nit that might fail scalastyle -- space after comment slashes


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-28 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r213385143
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -294,23 +294,25 @@ class Dataset[T] private[sql](
 // We set a minimum column width at '3'
 val minimumColWidth = 3
 
+//Regular expression matching full width characters
+val fullWidthRegex = 
"""[\u1100-\u115F\u2E80-\uA4CF\uAC00-\uD7A3\uF900-\uFAFF\uFE10-\uFE19\uFE30-\uFE6F\uFF00-\uFF60\uFFE0-\uFFE6]""".r
 if (!vertical) {
   // Initialise the width of each column to a minimum value
   val colWidths = Array.fill(numCols)(minimumColWidth)
 
   // Compute the width of each column
   for (row <- rows) {
 for ((cell, i) <- row.zipWithIndex) {
-  colWidths(i) = math.max(colWidths(i), cell.length)
+  colWidths(i) = math.max(colWidths(i), cell.length + 
fullWidthRegex.findAllIn(cell).size)
--- End diff --

What about a utility method for the `x.length + 
fullWidthRegex.findAllIn(x).size` expression that recurs here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-28 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r213384981
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -294,23 +294,25 @@ class Dataset[T] private[sql](
 // We set a minimum column width at '3'
 val minimumColWidth = 3
 
+//Regular expression matching full width characters
+val fullWidthRegex = 
"""[\u1100-\u115F\u2E80-\uA4CF\uAC00-\uD7A3\uF900-\uFAFF\uFE10-\uFE19\uFE30-\uFE6F\uFF00-\uFF60\uFFE0-\uFFE6]""".r
--- End diff --

I wonder if detecting this using a regex for each character is slow? 
probably not meaningfully enough to care about, but I wonder.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-24 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r212557191
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -294,23 +294,24 @@ class Dataset[T] private[sql](
 // We set a minimum column width at '3'
 val minimumColWidth = 3
 
+val regex = """[^\x00-\u2e39]""".r
--- End diff --

I agree with @srowen's comment. To make it clear, could you please add 
tests for more characters?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-08-19 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r211105897
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -294,23 +294,24 @@ class Dataset[T] private[sql](
 // We set a minimum column width at '3'
 val minimumColWidth = 3
 
+val regex = """[^\x00-\u2e39]""".r
--- End diff --

This could use a comment and slightly better name for the variable? I also 
wonder if a regex is a little bit slow for scanning every character.

However it's not clear this definition is accurate enough. According to 
things like 
https://stackoverflow.com/questions/13505075/analyzing-full-width-or-half-width-character-in-java
 we're really looking for the concept of "fullwidth" East Asian characters. The 
answer there provides a somewhat more precise definition, though others say you 
need something like icu4j for a proper definition. 

Maybe at least adopt the alternative proposed in the SO answer? It's not 
necessary to be perfect here, as it's a cosmetic issue.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org