[jira] [Commented] (SPARK-26335) Add a property for Dataset#show not to care about wide characters when padding them
[ https://issues.apache.org/jira/browse/SPARK-26335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720201#comment-16720201 ] Keiji Yoshida commented on SPARK-26335: --- https://github.com/apache/spark/pull/23307#issuecomment-446978389 > Add a property for Dataset#show not to care about wide characters when > padding them > --- > > Key: SPARK-26335 > URL: https://issues.apache.org/jira/browse/SPARK-26335 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Keiji Yoshida >Priority: Major > Attachments: Screen Shot 2018-12-11 at 17.53.54.png > > > h2. Issue > https://issues.apache.org/jira/browse/SPARK-25108 makes Dataset#show care > about wide characters when padding them. That is useful for humans to read a > result of Dataset#show. On the other hand, that makes it impossible for > programs to parse a result of Dataset#show because each cell's length can be > different from its header's length. My company develops and manages a > Jupyter/Apache Zeppelin-like visualization tool named "OASIS" > ([https://databricks.com/session/oasis-collaborative-data-analysis-platform-using-apache-spark]). > On this application, a result of Dataset#show on a Scala or Python process > is parsed to visualize it as an HTML table format. (A screenshot of OASIS has > been attached to this ticket as a file named "Screen Shot 2018-12-11 at > 17.53.54.png".) > h2. Solution > Add a property for Dataset#show not to care about wide characters when > padding them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26335) Add a property for Dataset#show not to care about wide characters when padding them
[ https://issues.apache.org/jira/browse/SPARK-26335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720199#comment-16720199 ] ASF GitHub Bot commented on SPARK-26335: kjmrknsn closed pull request #23307: [SPARK-26335][SQL] Add a property for Dataset#show not to care about wide characters when padding them URL: https://github.com/apache/spark/pull/23307 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/core/src/main/scala/org/apache/spark/util/Utils.scala b/core/src/main/scala/org/apache/spark/util/Utils.scala index b4ea1ee950217..49c721873377b 100644 --- a/core/src/main/scala/org/apache/spark/util/Utils.scala +++ b/core/src/main/scala/org/apache/spark/util/Utils.scala @@ -2822,6 +2822,19 @@ private[spark] object Utils extends Logging { if (str == null) 0 else str.length + fullWidthRegex.findAllIn(str).size } + /** + * Return a width of a given string. + * + * @param str a string + * @param halfWidth If it is set to true, the number of half widths of a given string will be + * returned. + * Otherwise, the number of characters of a given string will be returned. + * @return a width of a given string + */ + def stringWidth(str: String, halfWidth: Boolean): Int = { +if (str == null) 0 else if (halfWidth) stringHalfWidth(str) else str.length + } + def sanitizeDirName(str: String): String = { str.replaceAll("[ :/]", "-").replaceAll("[.${}'\"]", "_").toLowerCase(Locale.ROOT) } diff --git a/core/src/test/scala/org/apache/spark/util/UtilsSuite.scala b/core/src/test/scala/org/apache/spark/util/UtilsSuite.scala index b2ff1cce3eb0b..ea6c72d553543 100644 --- a/core/src/test/scala/org/apache/spark/util/UtilsSuite.scala +++ b/core/src/test/scala/org/apache/spark/util/UtilsSuite.scala @@ -1193,6 +1193,44 @@ class UtilsSuite extends SparkFunSuite with ResetSystemProperties with Logging { // scalastyle:on nonascii } + test("stringWidth") { +// scalastyle:off nonascii +assert(Utils.stringWidth(null, false) == 0) +assert(Utils.stringWidth(null, true) == 0) +assert(Utils.stringWidth("", false) == 0) +assert(Utils.stringWidth("", true) == 0) +assert(Utils.stringWidth("ab c", false) == 4) +assert(Utils.stringWidth("ab c", true) == 4) +assert(Utils.stringWidth("1098", false) == 4) +assert(Utils.stringWidth("1098", true) == 4) +assert(Utils.stringWidth("mø", false) == 2) +assert(Utils.stringWidth("mø", true) == 2) +assert(Utils.stringWidth("γύρ", false) == 3) +assert(Utils.stringWidth("γύρ", true) == 3) +assert(Utils.stringWidth("pê", false) == 2) +assert(Utils.stringWidth("pê", true) == 2) +assert(Utils.stringWidth("ー", false) == 1) +assert(Utils.stringWidth("ー", true) == 2) +assert(Utils.stringWidth("测", false) == 1) +assert(Utils.stringWidth("测", true) == 2) +assert(Utils.stringWidth("か", false) == 1) +assert(Utils.stringWidth("か", true) == 2) +assert(Utils.stringWidth("걸", false) == 1) +assert(Utils.stringWidth("걸", true) == 2) +assert(Utils.stringWidth("à", false) == 1) +assert(Utils.stringWidth("à", true) == 1) +assert(Utils.stringWidth("焼", false) == 1) +assert(Utils.stringWidth("焼", true) == 2) +assert(Utils.stringWidth("羍む", false) == 2) +assert(Utils.stringWidth("羍む", true) == 4) +assert(Utils.stringWidth("뺭ᾘ", false) == 2) +assert(Utils.stringWidth("뺭ᾘ", true) == 3) +assert(Utils.stringWidth("\u0967\u0968\u0969", false) == 3) +assert(Utils.stringWidth("\u0967\u0968\u0969", true) == 3) +// scalastyle:on nonascii + } + + test("trimExceptCRLF standalone") { val crlfSet = Set("\r", "\n") val nonPrintableButCRLF = (0 to 32).map(_.toChar.toString).toSet -- crlfSet diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala index 86e068bf632bd..3b4351560c061 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala @@ -1635,6 +1635,18 @@ object SQLConf { "java.time.* packages are used for the same purpose.") .booleanConf .createWithDefault(false) + + val DATASET_SHOW_HANDLE_FULL_WIDTH_CHARACTERS = +buildConf("spark.sql.dataset.show.handleFullWidthCharacters") + .doc("If it is set to true, a width of a full width character will be calculated as two " + +"half widths. That makes it easy for humans to view a result of " + +"`org.apache.spark.sql.Dataset#show`. On the other hand, that makes it impossible for " + +"programs
[jira] [Commented] (SPARK-26335) Add a property for Dataset#show not to care about wide characters when padding them
[ https://issues.apache.org/jira/browse/SPARK-26335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719895#comment-16719895 ] Keiji Yoshida commented on SPARK-26335: --- > Hm, I don't think Dataset#show is supposed to be used to be parsed. It's > rather for just showing a pretty print. To make HTML table, I think you > should use collect or copy some methods in Spark into your project to make it > pretty. Thanks for your comment. I can implement an API for making an HTML table tag from a Dataset on WebUI, but end users cannot use the same code on their terminal console and WebUI, for example, they write `spark.sql(...).show()` on their terminal console to print a dataset but they have to write, say, `html(spark.sql(...))` on a WebUI. It would be much more useful for users to use the same code on both their terminal console and WebUI. > Add a property for Dataset#show not to care about wide characters when > padding them > --- > > Key: SPARK-26335 > URL: https://issues.apache.org/jira/browse/SPARK-26335 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Keiji Yoshida >Priority: Major > Attachments: Screen Shot 2018-12-11 at 17.53.54.png > > > h2. Issue > https://issues.apache.org/jira/browse/SPARK-25108 makes Dataset#show care > about wide characters when padding them. That is useful for humans to read a > result of Dataset#show. On the other hand, that makes it impossible for > programs to parse a result of Dataset#show because each cell's length can be > different from its header's length. My company develops and manages a > Jupyter/Apache Zeppelin-like visualization tool named "OASIS" > ([https://databricks.com/session/oasis-collaborative-data-analysis-platform-using-apache-spark]). > On this application, a result of Dataset#show on a Scala or Python process > is parsed to visualize it as an HTML table format. (A screenshot of OASIS has > been attached to this ticket as a file named "Screen Shot 2018-12-11 at > 17.53.54.png".) > h2. Solution > Add a property for Dataset#show not to care about wide characters when > padding them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26335) Add a property for Dataset#show not to care about wide characters when padding them
[ https://issues.apache.org/jira/browse/SPARK-26335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719870#comment-16719870 ] ASF GitHub Bot commented on SPARK-26335: kjmrknsn opened a new pull request #23307: [SPARK-26335][SQL] Add a property for Dataset#show not to care about wide characters when padding them URL: https://github.com/apache/spark/pull/23307 ## What changes were proposed in this pull request? ### Issue [SPARK-25108](https://issues.apache.org/jira/browse/SPARK-25108) made `Dataset#show` care about wide characters when padding them. That is useful for humans to read a result of `Dataset#show`. On the other hand, that makes it impossible for programs to parse a result of `Dataset#show` because each cell's length can be different from its header's length. My company develops and manages a Jupyter/Apache Zeppelin-like visualization tool named [OASIS](https://databricks.com/session/oasis-collaborative-data-analysis-platform-using-apache-spark). On this application, a result of `Dataset#show` on a Scala or Python process is parsed to visualize it as an HTML table format as follows: https://user-images.githubusercontent.com/31149688/49923017-9e3c6180-fef5-11e8-970b-077bed46cdee.png;> ### Solution Add the `spark.sql.dataset.show.handleFullWidthCharacters` property for `Dataset#show` to control whether wide characters are cared/handled or not. ## How was this patch tested? This patch was tested via unit tests. ## Jira Issue https://issues.apache.org/jira/browse/SPARK-26335 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add a property for Dataset#show not to care about wide characters when > padding them > --- > > Key: SPARK-26335 > URL: https://issues.apache.org/jira/browse/SPARK-26335 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Keiji Yoshida >Priority: Major > Attachments: Screen Shot 2018-12-11 at 17.53.54.png > > > h2. Issue > https://issues.apache.org/jira/browse/SPARK-25108 makes Dataset#show care > about wide characters when padding them. That is useful for humans to read a > result of Dataset#show. On the other hand, that makes it impossible for > programs to parse a result of Dataset#show because each cell's length can be > different from its header's length. My company develops and manages a > Jupyter/Apache Zeppelin-like visualization tool named "OASIS" > ([https://databricks.com/session/oasis-collaborative-data-analysis-platform-using-apache-spark]). > On this application, a result of Dataset#show on a Scala or Python process > is parsed to visualize it as an HTML table format. (A screenshot of OASIS has > been attached to this ticket as a file named "Screen Shot 2018-12-11 at > 17.53.54.png".) > h2. Solution > Add a property for Dataset#show not to care about wide characters when > padding them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org