[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22048 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user xuejianbest commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r214867601 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2794,6 +2794,30 @@ private[spark] object Utils extends Logging { } } } + + /** + * Regular expression matching full width characters + */ + private val fullWidthRegex = ("""[""" + +// scalastyle:off nonascii +"""\u1100-\u115F""" + +"""\u2E80-\uA4CF""" + +"""\uAC00-\uD7A3""" + +"""\uF900-\uFAFF""" + +"""\uFE10-\uFE19""" + +"""\uFE30-\uFE6F""" + +"""\uFF00-\uFF60""" + +"""\uFFE0-\uFFE6""" + --- End diff -- Do I need to merge the above commited into one commit, Or add another new commit? Or change the last commit comments ? @srowen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r214862749 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2794,6 +2794,30 @@ private[spark] object Utils extends Logging { } } } + + /** + * Regular expression matching full width characters + */ + private val fullWidthRegex = ("""[""" + +// scalastyle:off nonascii +"""\u1100-\u115F""" + +"""\u2E80-\uA4CF""" + +"""\uAC00-\uD7A3""" + +"""\uF900-\uFAFF""" + +"""\uFE10-\uFE19""" + +"""\uFE30-\uFE6F""" + +"""\uFF00-\uFF60""" + +"""\uFFE0-\uFFE6""" + --- End diff -- I think this is fine. Just copy a summary of your comments here into the comments in the code. Yes this has nothing to do with UTF8 encoding directly. You are matching UCS2 really, 16bit char values. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user xuejianbest commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r214778257 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2794,6 +2794,30 @@ private[spark] object Utils extends Logging { } } } + + /** + * Regular expression matching full width characters + */ + private val fullWidthRegex = ("""[""" + +// scalastyle:off nonascii +"""\u1100-\u115F""" + +"""\u2E80-\uA4CF""" + +"""\uAC00-\uD7A3""" + +"""\uF900-\uFAFF""" + +"""\uFE10-\uFE19""" + +"""\uFE30-\uFE6F""" + +"""\uFF00-\uFF60""" + +"""\uFFE0-\uFFE6""" + --- End diff -- > Can you describe them there and put a references to a public unicode document? This is a regular expression match using unicode, regardless of the specific encoding. For example, the following string is encoded using gbk instead of utf8, and the match still worksï¼ ` val bytes = Array[Byte](0xd6.toByte, 0xd0.toByte, 0xB9.toByte, 0xFA.toByte) val s1 = new String(bytes, "gbk") println(s1) //ä¸å½ val fullWidthRegex = ("""[""" + // scalastyle:off nonascii """\u1100-\u115F""" + """\u2E80-\uA4CF""" + """\uAC00-\uD7A3""" + """\uF900-\uFAFF""" + """\uFE10-\uFE19""" + """\uFE30-\uFE6F""" + """\uFF00-\uFF60""" + """\uFFE0-\uFFE6""" + // scalastyle:on nonascii """]""").r println(fullWidthRegex.findAllIn(s1).size) //2 ` This regular expression is obtained experimentally under a specific font. I don't understand what you are going to do. > How about some additional overheads when calling showString as compared to showString w/o this patch? I tested a Dataset consisting of 100 rows, each row has two columns, one column is the index (0-99), and the other column is a random string of length 100 characters, and then the showString display is called separately. The original showString method (w/o this patch) took about 42ms, and the improved time took about 46ms, and the performance was about 10% worse. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r214626069 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2794,6 +2794,30 @@ private[spark] object Utils extends Logging { } } } + + /** + * Regular expression matching full width characters + */ + private val fullWidthRegex = ("""[""" + +// scalastyle:off nonascii +"""\u1100-\u115F""" + +"""\u2E80-\uA4CF""" + +"""\uAC00-\uD7A3""" + +"""\uF900-\uFAFF""" + +"""\uFE10-\uFE19""" + +"""\uFE30-\uFE6F""" + +"""\uFF00-\uFF60""" + +"""\uFFE0-\uFFE6""" + --- End diff -- > I looked at all the 0x-0x characters (unicode) and showed them under Xshell, then found all the full width characters. Get the regular expression. Can you describe them there and put a references to a public unicode document? See the comment in `UTF8String`; https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L65 > It takes 49 milliseconds to complete matching all 1000 strings. How about some additional overheads when calling `showString` as compared to `showString ` w/o this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user xuejianbest commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r214614936 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2794,6 +2794,30 @@ private[spark] object Utils extends Logging { } } } + + /** + * Regular expression matching full width characters + */ + private val fullWidthRegex = ("""[""" + +// scalastyle:off nonascii +"""\u1100-\u115F""" + +"""\u2E80-\uA4CF""" + +"""\uAC00-\uD7A3""" + +"""\uF900-\uFAFF""" + +"""\uFE10-\uFE19""" + +"""\uFE30-\uFE6F""" + +"""\uFF00-\uFF60""" + +"""\uFFE0-\uFFE6""" + --- End diff -- - How to get this Regex list? Any reference? It sounds like this should be a general problem I looked at all the 0x-0x characters (unicode) and showed them under Xshell, then found all the full width characters. Get the regular expression. - What is the performance impact? I generated 1000 strings, each consisting of 1000 characters with a random unicode of 0x-0x. (a total of 1 million characters.) Then use this regular expression to find the full width character of these strings. I tested 100 rounds and then averaged. It takes 49 milliseconds to complete matching all 1000 strings. @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r214490659 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2794,6 +2794,30 @@ private[spark] object Utils extends Logging { } } } + + /** + * Regular expression matching full width characters + */ + private val fullWidthRegex = ("""[""" + +// scalastyle:off nonascii +"""\u1100-\u115F""" + +"""\u2E80-\uA4CF""" + +"""\uAC00-\uD7A3""" + +"""\uF900-\uFAFF""" + +"""\uFE10-\uFE19""" + +"""\uFE30-\uFE6F""" + +"""\uFF00-\uFF60""" + +"""\uFFE0-\uFFE6""" + --- End diff -- A general question. - How to get this Regex list? Any reference? It sounds like this should be a general problem - What is the performance impact? Can you answer them and post them in the PR description? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user xuejianbest commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r213888028 --- Diff: core/src/test/scala/org/apache/spark/util/UtilsSuite.scala --- @@ -1184,6 +1184,25 @@ class UtilsSuite extends SparkFunSuite with ResetSystemProperties with Logging { assert(Utils.getSimpleName(classOf[MalformedClassObject.MalformedClass]) === "UtilsSuite$MalformedClassObject$MalformedClass") } + + test("stringHalfWidth") { +assert(Utils.stringHalfWidth(null) == 0) +assert(Utils.stringHalfWidth("") == 0) +assert(Utils.stringHalfWidth("ab c") == 4) +assert(Utils.stringHalfWidth("1098") == 4) +assert(Utils.stringHalfWidth("mø") == 2) +assert(Utils.stringHalfWidth("γÏÏ") == 3) --- End diff -- I found similar code in the `org.apache.commons.lang3.StringUtils` class. So the same test was added. In StringUtils 6934 line is comment: `* StringUtils.isNumeric("\u0967\u0968\u0969") = true` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r213884655 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -301,16 +301,16 @@ class Dataset[T] private[sql]( // Compute the width of each column for (row <- rows) { for ((cell, i) <- row.zipWithIndex) { - colWidths(i) = math.max(colWidths(i), cell.length) + colWidths(i) = math.max(colWidths(i), Utils.stringHalfWidth(cell)) } } val paddedRows = rows.map { row => row.zipWithIndex.map { case (cell, i) => if (truncate > 0) { -StringUtils.leftPad(cell, colWidths(i)) --- End diff -- OK that's fine. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user xuejianbest commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r213883324 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2794,6 +2794,27 @@ private[spark] object Utils extends Logging { } } } + + /** + * Regular expression matching full width characters + */ + private lazy val fullWidthRegex = ("""[""" + +"""\u1100-\u115F""" + +"""\u2E80-\uA4CF""" + +"""\uAC00-\uD7A3""" + +"""\uF900-\uFAFF""" + +"""\uFE10-\uFE19""" + +"""\uFE30-\uFE6F""" + +"""\uFF00-\uFF60""" + +"""\uFFE0-\uFFE6""" + +"""]""").r + /** + * Return the number of half width of a string + * A full width character occupies two half widths + */ --- End diff -- OK, thinks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user xuejianbest commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r213882803 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -301,16 +301,16 @@ class Dataset[T] private[sql]( // Compute the width of each column for (row <- rows) { for ((cell, i) <- row.zipWithIndex) { - colWidths(i) = math.max(colWidths(i), cell.length) + colWidths(i) = math.max(colWidths(i), Utils.stringHalfWidth(cell)) } } val paddedRows = rows.map { row => row.zipWithIndex.map { case (cell, i) => if (truncate > 0) { -StringUtils.leftPad(cell, colWidths(i)) --- End diff -- This method fills the string to the specified length. But what you need to do here is to increase the string by the specified length by padding. If you want to use this function, code programming: `StringUtils.leftPad(cell, colWidths(i) - Utils.stringHalfWidth(cell) + cell.length)` I think it is more difficult to understand. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r213880735 --- Diff: core/src/test/scala/org/apache/spark/util/UtilsSuite.scala --- @@ -1184,6 +1184,25 @@ class UtilsSuite extends SparkFunSuite with ResetSystemProperties with Logging { assert(Utils.getSimpleName(classOf[MalformedClassObject.MalformedClass]) === "UtilsSuite$MalformedClassObject$MalformedClass") } + + test("stringHalfWidth") { +assert(Utils.stringHalfWidth(null) == 0) +assert(Utils.stringHalfWidth("") == 0) +assert(Utils.stringHalfWidth("ab c") == 4) +assert(Utils.stringHalfWidth("1098") == 4) +assert(Utils.stringHalfWidth("mø") == 2) +assert(Utils.stringHalfWidth("γÏÏ") == 3) --- End diff -- Don't change it. There are Unicode characters elsewhere in the source, and it's clearer this way. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user xuejianbest commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r213879659 --- Diff: core/src/test/scala/org/apache/spark/util/UtilsSuite.scala --- @@ -1184,6 +1184,25 @@ class UtilsSuite extends SparkFunSuite with ResetSystemProperties with Logging { assert(Utils.getSimpleName(classOf[MalformedClassObject.MalformedClass]) === "UtilsSuite$MalformedClassObject$MalformedClass") } + + test("stringHalfWidth") { +assert(Utils.stringHalfWidth(null) == 0) +assert(Utils.stringHalfWidth("") == 0) +assert(Utils.stringHalfWidth("ab c") == 4) +assert(Utils.stringHalfWidth("1098") == 4) +assert(Utils.stringHalfWidth("mø") == 2) +assert(Utils.stringHalfWidth("γÏÏ") == 3) --- End diff -- Do you mean that the string here needs to be converted to a unicode escape form? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r213869119 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2794,6 +2794,27 @@ private[spark] object Utils extends Logging { } } } + + /** + * Regular expression matching full width characters + */ + private lazy val fullWidthRegex = ("""[""" + +"""\u1100-\u115F""" + +"""\u2E80-\uA4CF""" + +"""\uAC00-\uD7A3""" + +"""\uF900-\uFAFF""" + +"""\uFE10-\uFE19""" + +"""\uFE30-\uFE6F""" + +"""\uFF00-\uFF60""" + +"""\uFFE0-\uFFE6""" + +"""]""").r + /** + * Return the number of half width of a string + * A full width character occupies two half widths + */ --- End diff -- How about this? ``` /** * Return the number of half widths in a given string. Note that a full width character * occupies two half widths. */ ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r213868352 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2794,6 +2794,27 @@ private[spark] object Utils extends Logging { } } } + + /** + * Regular expression matching full width characters + */ + private lazy val fullWidthRegex = ("""[""" + +"""\u1100-\u115F""" + +"""\u2E80-\uA4CF""" + +"""\uAC00-\uD7A3""" + +"""\uF900-\uFAFF""" + +"""\uFE10-\uFE19""" + +"""\uFE30-\uFE6F""" + +"""\uFF00-\uFF60""" + +"""\uFFE0-\uFFE6""" + +"""]""").r --- End diff -- super nit: add blank line --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r213699101 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2794,6 +2794,27 @@ private[spark] object Utils extends Logging { } } } + + /** + * Regular expression matching full width characters + */ + private lazy val fullWidthRegex = ("""[""" + --- End diff -- Doesn't need to be lazy --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r213700228 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -301,16 +301,16 @@ class Dataset[T] private[sql]( // Compute the width of each column for (row <- rows) { for ((cell, i) <- row.zipWithIndex) { - colWidths(i) = math.max(colWidths(i), cell.length) + colWidths(i) = math.max(colWidths(i), Utils.stringHalfWidth(cell)) } } val paddedRows = rows.map { row => row.zipWithIndex.map { case (cell, i) => if (truncate > 0) { -StringUtils.leftPad(cell, colWidths(i)) --- End diff -- Oh why not use this method? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r213699899 --- Diff: core/src/test/scala/org/apache/spark/util/UtilsSuite.scala --- @@ -1184,6 +1184,25 @@ class UtilsSuite extends SparkFunSuite with ResetSystemProperties with Logging { assert(Utils.getSimpleName(classOf[MalformedClassObject.MalformedClass]) === "UtilsSuite$MalformedClassObject$MalformedClass") } + + test("stringHalfWidth") { +assert(Utils.stringHalfWidth(null) == 0) +assert(Utils.stringHalfWidth("") == 0) +assert(Utils.stringHalfWidth("ab c") == 4) +assert(Utils.stringHalfWidth("1098") == 4) +assert(Utils.stringHalfWidth("mø") == 2) +assert(Utils.stringHalfWidth("γÏÏ") == 3) --- End diff -- I think it is OK to type Unicode chars into source files because git and Scala ought to be set up to encode and read the source as UTF8. I have avoided it in the past and used Unicode escapes (but put actual string in comment) to be safe. I don't feel strongly. I think we have other instances of Unicode in the source. Just a comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r213699025 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2794,6 +2794,27 @@ private[spark] object Utils extends Logging { } } } + + /** + * Regular expression matching full width characters + */ + private lazy val fullWidthRegex = ("""[""" + +"""\u1100-\u115F""" + +"""\u2E80-\uA4CF""" + +"""\uAC00-\uD7A3""" + +"""\uF900-\uFAFF""" + +"""\uFE10-\uFE19""" + +"""\uFE30-\uFE6F""" + +"""\uFF00-\uFF60""" + +"""\uFFE0-\uFFE6""" + +"""]""").r + /** + * Return the number of half width of a string + * A full width character occupies two half widths + */ + def stringHalfWidth(str: String): Int = { +if(str == null) 0 else str.length + fullWidthRegex.findAllIn(str).size --- End diff -- Nit: space before if --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user xuejianbest commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r213608404 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -294,23 +294,29 @@ class Dataset[T] private[sql]( // We set a minimum column width at '3' val minimumColWidth = 3 +// Regular expression matching full width characters +val fullWidthRegex = """[\u1100-\u115F\u2E80-\uA4CF\uAC00-\uD7A3\uF900-\uFAFF\uFE10-\uFE19\uFE30-\uFE6F\uFF00-\uFF60\uFFE0-\uFFE6]""".r +// The number of half width of a string +def stringHalfWidth = (str: String) => { + str.length + fullWidthRegex.findAllIn(str).size +} --- End diff -- Moved it into `util.Utils`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user xuejianbest commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r213537923 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -294,23 +294,25 @@ class Dataset[T] private[sql]( // We set a minimum column width at '3' val minimumColWidth = 3 +//Regular expression matching full width characters +val fullWidthRegex = """[\u1100-\u115F\u2E80-\uA4CF\uAC00-\uD7A3\uF900-\uFAFF\uFE10-\uFE19\uFE30-\uFE6F\uFF00-\uFF60\uFFE0-\uFFE6]""".r if (!vertical) { // Initialise the width of each column to a minimum value val colWidths = Array.fill(numCols)(minimumColWidth) // Compute the width of each column for (row <- rows) { for ((cell, i) <- row.zipWithIndex) { - colWidths(i) = math.max(colWidths(i), cell.length) + colWidths(i) = math.max(colWidths(i), cell.length + fullWidthRegex.findAllIn(cell).size) --- End diff -- I committed a new version. See if this is appropriate please ? @srowen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r213537514 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -294,23 +294,29 @@ class Dataset[T] private[sql]( // We set a minimum column width at '3' val minimumColWidth = 3 +// Regular expression matching full width characters +val fullWidthRegex = """[\u1100-\u115F\u2E80-\uA4CF\uAC00-\uD7A3\uF900-\uFAFF\uFE10-\uFE19\uFE30-\uFE6F\uFF00-\uFF60\uFFE0-\uFFE6]""".r --- End diff -- This line goes over the limit, 100. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r213537463 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -294,23 +294,29 @@ class Dataset[T] private[sql]( // We set a minimum column width at '3' val minimumColWidth = 3 +// Regular expression matching full width characters +val fullWidthRegex = """[\u1100-\u115F\u2E80-\uA4CF\uAC00-\uD7A3\uF900-\uFAFF\uFE10-\uFE19\uFE30-\uFE6F\uFF00-\uFF60\uFFE0-\uFFE6]""".r +// The number of half width of a string +def stringHalfWidth = (str: String) => { + str.length + fullWidthRegex.findAllIn(str).size +} --- End diff -- better to add tests for the helper function, too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r213537424 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -294,23 +294,29 @@ class Dataset[T] private[sql]( // We set a minimum column width at '3' val minimumColWidth = 3 +// Regular expression matching full width characters +val fullWidthRegex = """[\u1100-\u115F\u2E80-\uA4CF\uAC00-\uD7A3\uF900-\uFAFF\uFE10-\uFE19\uFE30-\uFE6F\uFF00-\uFF60\uFFE0-\uFFE6]""".r +// The number of half width of a string +def stringHalfWidth = (str: String) => { + str.length + fullWidthRegex.findAllIn(str).size +} --- End diff -- better to move this method into `util.Utils` or something as a helper function? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r213525456 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -294,23 +294,25 @@ class Dataset[T] private[sql]( // We set a minimum column width at '3' val minimumColWidth = 3 +//Regular expression matching full width characters +val fullWidthRegex = """[\u1100-\u115F\u2E80-\uA4CF\uAC00-\uD7A3\uF900-\uFAFF\uFE10-\uFE19\uFE30-\uFE6F\uFF00-\uFF60\uFFE0-\uFFE6]""".r --- End diff -- OK, not worth optimizing now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user xuejianbest commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r213525417 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -294,23 +294,25 @@ class Dataset[T] private[sql]( // We set a minimum column width at '3' val minimumColWidth = 3 +//Regular expression matching full width characters --- End diff -- OK, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user xuejianbest commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r213525240 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -294,23 +294,25 @@ class Dataset[T] private[sql]( // We set a minimum column width at '3' val minimumColWidth = 3 +//Regular expression matching full width characters +val fullWidthRegex = """[\u1100-\u115F\u2E80-\uA4CF\uAC00-\uD7A3\uF900-\uFAFF\uFE10-\uFE19\uFE30-\uFE6F\uFF00-\uFF60\uFFE0-\uFFE6]""".r --- End diff -- I generated 1000 strings, each consisting of 1000 characters with a random unicode of 0x-0x. (a total of 1 million characters.) Then use this regular expression to find the full width character of these strings. I tested 100 rounds and then averaged. It takes 49 milliseconds to complete matching all 1000 strings. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r213384744 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -294,23 +294,25 @@ class Dataset[T] private[sql]( // We set a minimum column width at '3' val minimumColWidth = 3 +//Regular expression matching full width characters --- End diff -- Very small nit that might fail scalastyle -- space after comment slashes --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r213385143 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -294,23 +294,25 @@ class Dataset[T] private[sql]( // We set a minimum column width at '3' val minimumColWidth = 3 +//Regular expression matching full width characters +val fullWidthRegex = """[\u1100-\u115F\u2E80-\uA4CF\uAC00-\uD7A3\uF900-\uFAFF\uFE10-\uFE19\uFE30-\uFE6F\uFF00-\uFF60\uFFE0-\uFFE6]""".r if (!vertical) { // Initialise the width of each column to a minimum value val colWidths = Array.fill(numCols)(minimumColWidth) // Compute the width of each column for (row <- rows) { for ((cell, i) <- row.zipWithIndex) { - colWidths(i) = math.max(colWidths(i), cell.length) + colWidths(i) = math.max(colWidths(i), cell.length + fullWidthRegex.findAllIn(cell).size) --- End diff -- What about a utility method for the `x.length + fullWidthRegex.findAllIn(x).size` expression that recurs here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r213384981 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -294,23 +294,25 @@ class Dataset[T] private[sql]( // We set a minimum column width at '3' val minimumColWidth = 3 +//Regular expression matching full width characters +val fullWidthRegex = """[\u1100-\u115F\u2E80-\uA4CF\uAC00-\uD7A3\uF900-\uFAFF\uFE10-\uFE19\uFE30-\uFE6F\uFF00-\uFF60\uFFE0-\uFFE6]""".r --- End diff -- I wonder if detecting this using a regex for each character is slow? probably not meaningfully enough to care about, but I wonder. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r212557191 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -294,23 +294,24 @@ class Dataset[T] private[sql]( // We set a minimum column width at '3' val minimumColWidth = 3 +val regex = """[^\x00-\u2e39]""".r --- End diff -- I agree with @srowen's comment. To make it clear, could you please add tests for more characters? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r211105897 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -294,23 +294,24 @@ class Dataset[T] private[sql]( // We set a minimum column width at '3' val minimumColWidth = 3 +val regex = """[^\x00-\u2e39]""".r --- End diff -- This could use a comment and slightly better name for the variable? I also wonder if a regex is a little bit slow for scanning every character. However it's not clear this definition is accurate enough. According to things like https://stackoverflow.com/questions/13505075/analyzing-full-width-or-half-width-character-in-java we're really looking for the concept of "fullwidth" East Asian characters. The answer there provides a somewhat more precise definition, though others say you need something like icu4j for a proper definition. Maybe at least adopt the alternative proposed in the SO answer? It's not necessary to be perfect here, as it's a cosmetic issue. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org