[jira] [Commented] (SPARK-26335) Add a property for Dataset#show not to care about wide characters when padding them

2018-12-13 Thread Keiji Yoshida (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720201#comment-16720201
 ] 

Keiji Yoshida commented on SPARK-26335:
---

https://github.com/apache/spark/pull/23307#issuecomment-446978389

> Add a property for Dataset#show not to care about wide characters when 
> padding them
> ---
>
> Key: SPARK-26335
> URL: https://issues.apache.org/jira/browse/SPARK-26335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Keiji Yoshida
>Priority: Major
> Attachments: Screen Shot 2018-12-11 at 17.53.54.png
>
>
> h2. Issue
> https://issues.apache.org/jira/browse/SPARK-25108 makes Dataset#show care 
> about wide characters when padding them. That is useful for humans to read a 
> result of Dataset#show. On the other hand, that makes it impossible for 
> programs to parse a result of Dataset#show because each cell's length can be 
> different from its header's length. My company develops and manages a 
> Jupyter/Apache Zeppelin-like visualization tool named "OASIS" 
> ([https://databricks.com/session/oasis-collaborative-data-analysis-platform-using-apache-spark]).
>  On this application, a result of Dataset#show on a Scala or Python process 
> is parsed to visualize it as an HTML table format. (A screenshot of OASIS has 
> been attached to this ticket as a file named "Screen Shot 2018-12-11 at 
> 17.53.54.png".)
> h2. Solution
> Add a property for Dataset#show not to care about wide characters when 
> padding them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26335) Add a property for Dataset#show not to care about wide characters when padding them

2018-12-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720199#comment-16720199
 ] 

ASF GitHub Bot commented on SPARK-26335:


kjmrknsn closed pull request #23307: [SPARK-26335][SQL] Add a property for 
Dataset#show not to care about wide characters when padding them
URL: https://github.com/apache/spark/pull/23307
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/core/src/main/scala/org/apache/spark/util/Utils.scala 
b/core/src/main/scala/org/apache/spark/util/Utils.scala
index b4ea1ee950217..49c721873377b 100644
--- a/core/src/main/scala/org/apache/spark/util/Utils.scala
+++ b/core/src/main/scala/org/apache/spark/util/Utils.scala
@@ -2822,6 +2822,19 @@ private[spark] object Utils extends Logging {
 if (str == null) 0 else str.length + fullWidthRegex.findAllIn(str).size
   }
 
+  /**
+   * Return a width of a given string.
+   *
+   * @param str a string
+   * @param halfWidth If it is set to true, the number of half widths of a 
given string will be
+   *  returned.
+   *  Otherwise, the number of characters of a given string 
will be returned.
+   * @return a width of a given string
+   */
+  def stringWidth(str: String, halfWidth: Boolean): Int = {
+if (str == null) 0 else if (halfWidth) stringHalfWidth(str) else str.length
+  }
+
   def sanitizeDirName(str: String): String = {
 str.replaceAll("[ :/]", "-").replaceAll("[.${}'\"]", 
"_").toLowerCase(Locale.ROOT)
   }
diff --git a/core/src/test/scala/org/apache/spark/util/UtilsSuite.scala 
b/core/src/test/scala/org/apache/spark/util/UtilsSuite.scala
index b2ff1cce3eb0b..ea6c72d553543 100644
--- a/core/src/test/scala/org/apache/spark/util/UtilsSuite.scala
+++ b/core/src/test/scala/org/apache/spark/util/UtilsSuite.scala
@@ -1193,6 +1193,44 @@ class UtilsSuite extends SparkFunSuite with 
ResetSystemProperties with Logging {
 // scalastyle:on nonascii
   }
 
+   test("stringWidth") {
+// scalastyle:off nonascii
+assert(Utils.stringWidth(null, false) == 0)
+assert(Utils.stringWidth(null, true) == 0)
+assert(Utils.stringWidth("", false) == 0)
+assert(Utils.stringWidth("", true) == 0)
+assert(Utils.stringWidth("ab c", false) == 4)
+assert(Utils.stringWidth("ab c", true) == 4)
+assert(Utils.stringWidth("1098", false) == 4)
+assert(Utils.stringWidth("1098", true) == 4)
+assert(Utils.stringWidth("mø", false) == 2)
+assert(Utils.stringWidth("mø", true) == 2)
+assert(Utils.stringWidth("γύρ", false) == 3)
+assert(Utils.stringWidth("γύρ", true) == 3)
+assert(Utils.stringWidth("pê", false) == 2)
+assert(Utils.stringWidth("pê", true) == 2)
+assert(Utils.stringWidth("ー", false) == 1)
+assert(Utils.stringWidth("ー", true) == 2)
+assert(Utils.stringWidth("测", false) == 1)
+assert(Utils.stringWidth("测", true) == 2)
+assert(Utils.stringWidth("か", false) == 1)
+assert(Utils.stringWidth("か", true) == 2)
+assert(Utils.stringWidth("걸", false) == 1)
+assert(Utils.stringWidth("걸", true) == 2)
+assert(Utils.stringWidth("à", false) == 1)
+assert(Utils.stringWidth("à", true) == 1)
+assert(Utils.stringWidth("焼", false) == 1)
+assert(Utils.stringWidth("焼", true) == 2)
+assert(Utils.stringWidth("羍む", false) == 2)
+assert(Utils.stringWidth("羍む", true) == 4)
+assert(Utils.stringWidth("뺭ᾘ", false) == 2)
+assert(Utils.stringWidth("뺭ᾘ", true) == 3)
+assert(Utils.stringWidth("\u0967\u0968\u0969", false) == 3)
+assert(Utils.stringWidth("\u0967\u0968\u0969", true) == 3)
+// scalastyle:on nonascii
+  }
+
+
   test("trimExceptCRLF standalone") {
 val crlfSet = Set("\r", "\n")
 val nonPrintableButCRLF = (0 to 32).map(_.toChar.toString).toSet -- crlfSet
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 86e068bf632bd..3b4351560c061 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -1635,6 +1635,18 @@ object SQLConf {
   "java.time.* packages are used for the same purpose.")
 .booleanConf
 .createWithDefault(false)
+
+  val DATASET_SHOW_HANDLE_FULL_WIDTH_CHARACTERS =
+buildConf("spark.sql.dataset.show.handleFullWidthCharacters")
+  .doc("If it is set to true, a width of a full width character will be 
calculated as two " +
+"half widths. That makes it easy for humans to view a result of " +
+"`org.apache.spark.sql.Dataset#show`. On the other hand, that makes it 
impossible for " +
+"programs 

[jira] [Commented] (SPARK-26335) Add a property for Dataset#show not to care about wide characters when padding them

2018-12-13 Thread Keiji Yoshida (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719895#comment-16719895
 ] 

Keiji Yoshida commented on SPARK-26335:
---

> Hm, I don't think Dataset#show is supposed to be used to be parsed. It's 
> rather for just showing a pretty print. To make HTML table, I think you 
> should use collect or copy some methods in Spark into your project to make it 
> pretty.

Thanks for your comment.

I can implement an API for making an HTML table tag from a Dataset on WebUI, 
but end users cannot use the same code on their terminal console and WebUI, for 
example, they write `spark.sql(...).show()` on their terminal console to print 
a dataset but they have to write, say, `html(spark.sql(...))` on a WebUI. It 
would be much more useful for users to use the same code on both their terminal 
console and WebUI.

> Add a property for Dataset#show not to care about wide characters when 
> padding them
> ---
>
> Key: SPARK-26335
> URL: https://issues.apache.org/jira/browse/SPARK-26335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Keiji Yoshida
>Priority: Major
> Attachments: Screen Shot 2018-12-11 at 17.53.54.png
>
>
> h2. Issue
> https://issues.apache.org/jira/browse/SPARK-25108 makes Dataset#show care 
> about wide characters when padding them. That is useful for humans to read a 
> result of Dataset#show. On the other hand, that makes it impossible for 
> programs to parse a result of Dataset#show because each cell's length can be 
> different from its header's length. My company develops and manages a 
> Jupyter/Apache Zeppelin-like visualization tool named "OASIS" 
> ([https://databricks.com/session/oasis-collaborative-data-analysis-platform-using-apache-spark]).
>  On this application, a result of Dataset#show on a Scala or Python process 
> is parsed to visualize it as an HTML table format. (A screenshot of OASIS has 
> been attached to this ticket as a file named "Screen Shot 2018-12-11 at 
> 17.53.54.png".)
> h2. Solution
> Add a property for Dataset#show not to care about wide characters when 
> padding them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26335) Add a property for Dataset#show not to care about wide characters when padding them

2018-12-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719870#comment-16719870
 ] 

ASF GitHub Bot commented on SPARK-26335:


kjmrknsn opened a new pull request #23307: [SPARK-26335][SQL] Add a property 
for Dataset#show not to care about wide characters when padding them
URL: https://github.com/apache/spark/pull/23307
 
 
   ## What changes were proposed in this pull request?
   
   ### Issue
   [SPARK-25108](https://issues.apache.org/jira/browse/SPARK-25108) made 
`Dataset#show` care about wide characters when padding them. That is useful for 
humans to read a result of `Dataset#show`. On the other hand, that makes it 
impossible for programs to parse a result of `Dataset#show` because each cell's 
length can be different from its header's length. My company develops and 
manages a Jupyter/Apache Zeppelin-like visualization tool named 
[OASIS](https://databricks.com/session/oasis-collaborative-data-analysis-platform-using-apache-spark).
 On this application, a result of `Dataset#show` on a Scala or Python process 
is parsed to visualize it as an HTML table format as follows: 
   
   https://user-images.githubusercontent.com/31149688/49923017-9e3c6180-fef5-11e8-970b-077bed46cdee.png;>
   
   ### Solution
   Add the `spark.sql.dataset.show.handleFullWidthCharacters` property for 
`Dataset#show` to control whether wide characters are cared/handled or not.
   
   ## How was this patch tested?
   This patch was tested via unit tests.
   
   ## Jira Issue
   https://issues.apache.org/jira/browse/SPARK-26335


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add a property for Dataset#show not to care about wide characters when 
> padding them
> ---
>
> Key: SPARK-26335
> URL: https://issues.apache.org/jira/browse/SPARK-26335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Keiji Yoshida
>Priority: Major
> Attachments: Screen Shot 2018-12-11 at 17.53.54.png
>
>
> h2. Issue
> https://issues.apache.org/jira/browse/SPARK-25108 makes Dataset#show care 
> about wide characters when padding them. That is useful for humans to read a 
> result of Dataset#show. On the other hand, that makes it impossible for 
> programs to parse a result of Dataset#show because each cell's length can be 
> different from its header's length. My company develops and manages a 
> Jupyter/Apache Zeppelin-like visualization tool named "OASIS" 
> ([https://databricks.com/session/oasis-collaborative-data-analysis-platform-using-apache-spark]).
>  On this application, a result of Dataset#show on a Scala or Python process 
> is parsed to visualize it as an HTML table format. (A screenshot of OASIS has 
> been attached to this ticket as a file named "Screen Shot 2018-12-11 at 
> 17.53.54.png".)
> h2. Solution
> Add a property for Dataset#show not to care about wide characters when 
> padding them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org