[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion

2018-10-25 Thread Oron Navon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663730#comment-16663730
 ] 

Oron Navon commented on SPARK-25807:


OK, thanks in any case - I'm closing the issue.

> Mitigate 1-based substr() confusion
> ---
>
> Key: SPARK-25807
> URL: https://issues.apache.org/jira/browse/SPARK-25807
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, PySpark
>Affects Versions: 1.3.0, 2.3.2, 2.4.0, 3.0.0
>Reporter: Oron Navon
>Priority: Minor
>
> The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's 
> {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's 
> {{substr}}, which are zero-based.  Both PySpark users and Java API users 
> often naturally expect a 0-based {{substr()}}. Adding to the confusion, 
> {{substr()}} currently allows a {{startPos}} value of 0, which returns the 
> same result as {{startPos==1}}.
> Since changing {{substr()}} to 0-based is probably NOT a reasonable option 
> here, I suggest making one or more of the following changes:
>  # Adding a method {{substr0}}, which would be zero-based
>  # Renaming {{substr}} to {{substr1}}
>  # Making the existing {{substr()}} throw an exception on {{startPos==0}}, 
> which should catch and alert most users who expect zero-based behavior.
> This is my first discussion on this project, apologies for any faux pas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion

2018-10-25 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663648#comment-16663648
 ] 

Sean Owen commented on SPARK-25807:
---

The language here is SQL(-like) here though, not Python or Java. 
Column.substr() has to match how SQL substr() works. And this method exists 
because it exists in SQL, and needs to match its semantics, including 1-based 
position. These methods exist to match SQL, and not as a library of useful 
functions designed from first principles. Most anything else you want to do, 
write code/UDFs for, as it's easier. But in this case, I can't see value in 
redundant substr functions. The existence of alternatives doesn't even mitigate 
the issue you describe.

The pos semantics are again taken from SQL, and lots of things are clunky about 
SQL. This argument can be negative to define a position from the end of the 
string. I don't see behavior for 0 defined anywhere, but treating it as an 
error or the empty string always or just as a synonym for "1" seem plausible. 
Hive does the latter, and maybe other SQL engines, but matching Hive alone is a 
good enough reason for this behavior.

> Mitigate 1-based substr() confusion
> ---
>
> Key: SPARK-25807
> URL: https://issues.apache.org/jira/browse/SPARK-25807
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, PySpark
>Affects Versions: 1.3.0, 2.3.2, 2.4.0, 3.0.0
>Reporter: Oron Navon
>Priority: Minor
>
> The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's 
> {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's 
> {{substr}}, which are zero-based.  Both PySpark users and Java API users 
> often naturally expect a 0-based {{substr()}}. Adding to the confusion, 
> {{substr()}} currently allows a {{startPos}} value of 0, which returns the 
> same result as {{startPos==1}}.
> Since changing {{substr()}} to 0-based is probably NOT a reasonable option 
> here, I suggest making one or more of the following changes:
>  # Adding a method {{substr0}}, which would be zero-based
>  # Renaming {{substr}} to {{substr1}}
>  # Making the existing {{substr()}} throw an exception on {{startPos==0}}, 
> which should catch and alert most users who expect zero-based behavior.
> This is my first discussion on this project, apologies for any faux pas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion

2018-10-25 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663560#comment-16663560
 ] 

Hyukjin Kwon commented on SPARK-25807:
--

I think it's not an issue if that's clearly documented.

> Mitigate 1-based substr() confusion
> ---
>
> Key: SPARK-25807
> URL: https://issues.apache.org/jira/browse/SPARK-25807
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, PySpark
>Affects Versions: 1.3.0, 2.3.2, 2.4.0, 3.0.0
>Reporter: Oron Navon
>Priority: Minor
>
> The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's 
> {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's 
> {{substr}}, which are zero-based.  Both PySpark users and Java API users 
> often naturally expect a 0-based {{substr()}}. Adding to the confusion, 
> {{substr()}} currently allows a {{startPos}} value of 0, which returns the 
> same result as {{startPos==1}}.
> Since changing {{substr()}} to 0-based is probably NOT a reasonable option 
> here, I suggest making one or more of the following changes:
>  # Adding a method {{substr0}}, which would be zero-based
>  # Renaming {{substr}} to {{substr1}}
>  # Making the existing {{substr()}} throw an exception on {{startPos==0}}, 
> which should catch and alert most users who expect zero-based behavior.
> This is my first discussion on this project, apologies for any faux pas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion

2018-10-25 Thread Oron Navon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663378#comment-16663378
 ] 

Oron Navon commented on SPARK-25807:


Thanks guys. [~srowen], fair enough about matching Hive/SQL behavior, but note 
that since users code in Python/Java/Scala (where substr behavior is 
zero-based), this becomes unintuitive and can easily lead to misuse of the API. 
 An explicit {{substr0}} and {{substr1}} would be unambiguous, but I agree it's 
distasteful. Would appreciate if you have any other ideas.

Specifically about allowing {{substr(0, ...)}}, what's the motivation for that? 
With behavior identical to {{substr(1, ...)}}, a user calling {{substr(0, 
...)}} almost certainly indicates the user expects 0-based behavior. Shouldn't 
we throw an exception in this case? It would catch most such situations.

> Mitigate 1-based substr() confusion
> ---
>
> Key: SPARK-25807
> URL: https://issues.apache.org/jira/browse/SPARK-25807
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, PySpark
>Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0
>Reporter: Oron Navon
>Priority: Minor
>
> The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's 
> {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's 
> {{substr}}, which are zero-based.  Both PySpark users and Java API users 
> often naturally expect a 0-based {{substr()}}. Adding to the confusion, 
> {{substr()}} currently allows a {{startPos}} value of 0, which returns the 
> same result as {{startPos==1}}.
> Since changing {{substr()}} to 0-based is probably NOT a reasonable option 
> here, I suggest making one or more of the following changes:
>  # Adding a method {{substr0}}, which would be zero-based
>  # Renaming {{substr}} to {{substr1}}
>  # Making the existing {{substr()}} throw an exception on {{startPos==0}}, 
> which should catch and alert most users who expect zero-based behavior.
> This is my first discussion on this project, apologies for any faux pas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion

2018-10-25 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663332#comment-16663332
 ] 

kevin yu commented on SPARK-25807:
--

[~oron.navon]: You can also try to implement a simply python UDF to do the 
0-based substr(), here is an example:

`def substr0(x,y,z):
 if y < 0:
 y = 0
 if z < 0:
 z = 0
 return x[y:z+y]

from pyspark.sql.functions import UserDefinedFunction

sub_str0 = spark.catalog.registerFunction("subStr0", UserDefinedFunction(lambda 
x,y,z: substr0(x,y,z)))

spark.sql("select substr0('kevin', 0, 2)").collect()

[Row(subStr0(kevin, 0, 2)=u'ke')]

spark.sql("select substr0('kevin', 0, 2)").collect()

[Row(subStr0(kevin, 1, 2)=u'ev')]`

 

> Mitigate 1-based substr() confusion
> ---
>
> Key: SPARK-25807
> URL: https://issues.apache.org/jira/browse/SPARK-25807
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, PySpark
>Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0
>Reporter: Oron Navon
>Priority: Minor
>
> The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's 
> {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's 
> {{substr}}, which are zero-based.  Both PySpark users and Java API users 
> often naturally expect a 0-based {{substr()}}. Adding to the confusion, 
> {{substr()}} currently allows a {{startPos}} value of 0, which returns the 
> same result as {{startPos==1}}.
> Since changing {{substr()}} to 0-based is probably NOT a reasonable option 
> here, I suggest making one or more of the following changes:
>  # Adding a method {{substr0}}, which would be zero-based
>  # Renaming {{substr}} to {{substr1}}
>  # Making the existing {{substr()}} throw an exception on {{startPos==0}}, 
> which should catch and alert most users who expect zero-based behavior.
> This is my first discussion on this project, apologies for any faux pas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion

2018-10-24 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662830#comment-16662830
 ] 

kevin yu commented on SPARK-25807:
--

Thanks Sean, ok, I will leave as it is. 

> Mitigate 1-based substr() confusion
> ---
>
> Key: SPARK-25807
> URL: https://issues.apache.org/jira/browse/SPARK-25807
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, PySpark
>Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0
>Reporter: Oron Navon
>Priority: Minor
>
> The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's 
> {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's 
> {{substr}}, which are zero-based.  Both PySpark users and Java API users 
> often naturally expect a 0-based {{substr()}}. Adding to the confusion, 
> {{substr()}} currently allows a {{startPos}} value of 0, which returns the 
> same result as {{startPos==1}}.
> Since changing {{substr()}} to 0-based is probably NOT a reasonable option 
> here, I suggest making one or more of the following changes:
>  # Adding a method {{substr0}}, which would be zero-based
>  # Renaming {{substr}} to {{substr1}}
>  # Making the existing {{substr()}} throw an exception on {{startPos==0}}, 
> which should catch and alert most users who expect zero-based behavior.
> This is my first discussion on this project, apologies for any faux pas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion

2018-10-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662817#comment-16662817
 ] 

Sean Owen commented on SPARK-25807:
---

They are meant to match Hive, SQL. They should not match Java, Python. No, this 
should not be changed.

> Mitigate 1-based substr() confusion
> ---
>
> Key: SPARK-25807
> URL: https://issues.apache.org/jira/browse/SPARK-25807
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, PySpark
>Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0
>Reporter: Oron Navon
>Priority: Minor
>
> The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's 
> {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's 
> {{substr}}, which are zero-based.  Both PySpark users and Java API users 
> often naturally expect a 0-based {{substr()}}. Adding to the confusion, 
> {{substr()}} currently allows a {{startPos}} value of 0, which returns the 
> same result as {{startPos==1}}.
> Since changing {{substr()}} to 0-based is probably NOT a reasonable option 
> here, I suggest making one or more of the following changes:
>  # Adding a method {{substr0}}, which would be zero-based
>  # Renaming {{substr}} to {{substr1}}
>  # Making the existing {{substr()}} throw an exception on {{startPos==0}}, 
> which should catch and alert most users who expect zero-based behavior.
> This is my first discussion on this project, apologies for any faux pas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion

2018-10-23 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661324#comment-16661324
 ] 

kevin yu commented on SPARK-25807:
--

I am looking into option 1, option 3 causes to change behavior, probably 
require more discussion.

Kevin

> Mitigate 1-based substr() confusion
> ---
>
> Key: SPARK-25807
> URL: https://issues.apache.org/jira/browse/SPARK-25807
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, PySpark
>Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0
>Reporter: Oron Navon
>Priority: Minor
>
> The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's 
> {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's 
> {{substr}}, which are zero-based.  Both PySpark users and Java API users 
> often naturally expect a 0-based {{substr()}}. Adding to the confusion, 
> {{substr()}} currently allows a {{startPos}} value of 0, which returns the 
> same result as {{startPos==1}}.
> Since changing {{substr()}} to 0-based is probably NOT a reasonable option 
> here, I suggest making one or more of the following changes:
>  # Adding a method {{substr0}}, which would be zero-based
>  # Renaming {{substr}} to {{substr1}}
>  # Making the existing {{substr()}} throw an exception on {{startPos==0}}, 
> which should catch and alert most users who expect zero-based behavior.
> This is my first discussion on this project, apologies for any faux pas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org