[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion
[ https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663730#comment-16663730 ] Oron Navon commented on SPARK-25807: OK, thanks in any case - I'm closing the issue. > Mitigate 1-based substr() confusion > --- > > Key: SPARK-25807 > URL: https://issues.apache.org/jira/browse/SPARK-25807 > Project: Spark > Issue Type: Improvement > Components: Java API, PySpark >Affects Versions: 1.3.0, 2.3.2, 2.4.0, 3.0.0 >Reporter: Oron Navon >Priority: Minor > > The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's > {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's > {{substr}}, which are zero-based. Both PySpark users and Java API users > often naturally expect a 0-based {{substr()}}. Adding to the confusion, > {{substr()}} currently allows a {{startPos}} value of 0, which returns the > same result as {{startPos==1}}. > Since changing {{substr()}} to 0-based is probably NOT a reasonable option > here, I suggest making one or more of the following changes: > # Adding a method {{substr0}}, which would be zero-based > # Renaming {{substr}} to {{substr1}} > # Making the existing {{substr()}} throw an exception on {{startPos==0}}, > which should catch and alert most users who expect zero-based behavior. > This is my first discussion on this project, apologies for any faux pas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion
[ https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663648#comment-16663648 ] Sean Owen commented on SPARK-25807: --- The language here is SQL(-like) here though, not Python or Java. Column.substr() has to match how SQL substr() works. And this method exists because it exists in SQL, and needs to match its semantics, including 1-based position. These methods exist to match SQL, and not as a library of useful functions designed from first principles. Most anything else you want to do, write code/UDFs for, as it's easier. But in this case, I can't see value in redundant substr functions. The existence of alternatives doesn't even mitigate the issue you describe. The pos semantics are again taken from SQL, and lots of things are clunky about SQL. This argument can be negative to define a position from the end of the string. I don't see behavior for 0 defined anywhere, but treating it as an error or the empty string always or just as a synonym for "1" seem plausible. Hive does the latter, and maybe other SQL engines, but matching Hive alone is a good enough reason for this behavior. > Mitigate 1-based substr() confusion > --- > > Key: SPARK-25807 > URL: https://issues.apache.org/jira/browse/SPARK-25807 > Project: Spark > Issue Type: Improvement > Components: Java API, PySpark >Affects Versions: 1.3.0, 2.3.2, 2.4.0, 3.0.0 >Reporter: Oron Navon >Priority: Minor > > The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's > {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's > {{substr}}, which are zero-based. Both PySpark users and Java API users > often naturally expect a 0-based {{substr()}}. Adding to the confusion, > {{substr()}} currently allows a {{startPos}} value of 0, which returns the > same result as {{startPos==1}}. > Since changing {{substr()}} to 0-based is probably NOT a reasonable option > here, I suggest making one or more of the following changes: > # Adding a method {{substr0}}, which would be zero-based > # Renaming {{substr}} to {{substr1}} > # Making the existing {{substr()}} throw an exception on {{startPos==0}}, > which should catch and alert most users who expect zero-based behavior. > This is my first discussion on this project, apologies for any faux pas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion
[ https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663560#comment-16663560 ] Hyukjin Kwon commented on SPARK-25807: -- I think it's not an issue if that's clearly documented. > Mitigate 1-based substr() confusion > --- > > Key: SPARK-25807 > URL: https://issues.apache.org/jira/browse/SPARK-25807 > Project: Spark > Issue Type: Improvement > Components: Java API, PySpark >Affects Versions: 1.3.0, 2.3.2, 2.4.0, 3.0.0 >Reporter: Oron Navon >Priority: Minor > > The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's > {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's > {{substr}}, which are zero-based. Both PySpark users and Java API users > often naturally expect a 0-based {{substr()}}. Adding to the confusion, > {{substr()}} currently allows a {{startPos}} value of 0, which returns the > same result as {{startPos==1}}. > Since changing {{substr()}} to 0-based is probably NOT a reasonable option > here, I suggest making one or more of the following changes: > # Adding a method {{substr0}}, which would be zero-based > # Renaming {{substr}} to {{substr1}} > # Making the existing {{substr()}} throw an exception on {{startPos==0}}, > which should catch and alert most users who expect zero-based behavior. > This is my first discussion on this project, apologies for any faux pas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion
[ https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663378#comment-16663378 ] Oron Navon commented on SPARK-25807: Thanks guys. [~srowen], fair enough about matching Hive/SQL behavior, but note that since users code in Python/Java/Scala (where substr behavior is zero-based), this becomes unintuitive and can easily lead to misuse of the API. An explicit {{substr0}} and {{substr1}} would be unambiguous, but I agree it's distasteful. Would appreciate if you have any other ideas. Specifically about allowing {{substr(0, ...)}}, what's the motivation for that? With behavior identical to {{substr(1, ...)}}, a user calling {{substr(0, ...)}} almost certainly indicates the user expects 0-based behavior. Shouldn't we throw an exception in this case? It would catch most such situations. > Mitigate 1-based substr() confusion > --- > > Key: SPARK-25807 > URL: https://issues.apache.org/jira/browse/SPARK-25807 > Project: Spark > Issue Type: Improvement > Components: Java API, PySpark >Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0 >Reporter: Oron Navon >Priority: Minor > > The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's > {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's > {{substr}}, which are zero-based. Both PySpark users and Java API users > often naturally expect a 0-based {{substr()}}. Adding to the confusion, > {{substr()}} currently allows a {{startPos}} value of 0, which returns the > same result as {{startPos==1}}. > Since changing {{substr()}} to 0-based is probably NOT a reasonable option > here, I suggest making one or more of the following changes: > # Adding a method {{substr0}}, which would be zero-based > # Renaming {{substr}} to {{substr1}} > # Making the existing {{substr()}} throw an exception on {{startPos==0}}, > which should catch and alert most users who expect zero-based behavior. > This is my first discussion on this project, apologies for any faux pas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion
[ https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663332#comment-16663332 ] kevin yu commented on SPARK-25807: -- [~oron.navon]: You can also try to implement a simply python UDF to do the 0-based substr(), here is an example: `def substr0(x,y,z): if y < 0: y = 0 if z < 0: z = 0 return x[y:z+y] from pyspark.sql.functions import UserDefinedFunction sub_str0 = spark.catalog.registerFunction("subStr0", UserDefinedFunction(lambda x,y,z: substr0(x,y,z))) spark.sql("select substr0('kevin', 0, 2)").collect() [Row(subStr0(kevin, 0, 2)=u'ke')] spark.sql("select substr0('kevin', 0, 2)").collect() [Row(subStr0(kevin, 1, 2)=u'ev')]` > Mitigate 1-based substr() confusion > --- > > Key: SPARK-25807 > URL: https://issues.apache.org/jira/browse/SPARK-25807 > Project: Spark > Issue Type: Improvement > Components: Java API, PySpark >Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0 >Reporter: Oron Navon >Priority: Minor > > The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's > {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's > {{substr}}, which are zero-based. Both PySpark users and Java API users > often naturally expect a 0-based {{substr()}}. Adding to the confusion, > {{substr()}} currently allows a {{startPos}} value of 0, which returns the > same result as {{startPos==1}}. > Since changing {{substr()}} to 0-based is probably NOT a reasonable option > here, I suggest making one or more of the following changes: > # Adding a method {{substr0}}, which would be zero-based > # Renaming {{substr}} to {{substr1}} > # Making the existing {{substr()}} throw an exception on {{startPos==0}}, > which should catch and alert most users who expect zero-based behavior. > This is my first discussion on this project, apologies for any faux pas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion
[ https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662830#comment-16662830 ] kevin yu commented on SPARK-25807: -- Thanks Sean, ok, I will leave as it is. > Mitigate 1-based substr() confusion > --- > > Key: SPARK-25807 > URL: https://issues.apache.org/jira/browse/SPARK-25807 > Project: Spark > Issue Type: Improvement > Components: Java API, PySpark >Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0 >Reporter: Oron Navon >Priority: Minor > > The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's > {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's > {{substr}}, which are zero-based. Both PySpark users and Java API users > often naturally expect a 0-based {{substr()}}. Adding to the confusion, > {{substr()}} currently allows a {{startPos}} value of 0, which returns the > same result as {{startPos==1}}. > Since changing {{substr()}} to 0-based is probably NOT a reasonable option > here, I suggest making one or more of the following changes: > # Adding a method {{substr0}}, which would be zero-based > # Renaming {{substr}} to {{substr1}} > # Making the existing {{substr()}} throw an exception on {{startPos==0}}, > which should catch and alert most users who expect zero-based behavior. > This is my first discussion on this project, apologies for any faux pas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion
[ https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662817#comment-16662817 ] Sean Owen commented on SPARK-25807: --- They are meant to match Hive, SQL. They should not match Java, Python. No, this should not be changed. > Mitigate 1-based substr() confusion > --- > > Key: SPARK-25807 > URL: https://issues.apache.org/jira/browse/SPARK-25807 > Project: Spark > Issue Type: Improvement > Components: Java API, PySpark >Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0 >Reporter: Oron Navon >Priority: Minor > > The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's > {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's > {{substr}}, which are zero-based. Both PySpark users and Java API users > often naturally expect a 0-based {{substr()}}. Adding to the confusion, > {{substr()}} currently allows a {{startPos}} value of 0, which returns the > same result as {{startPos==1}}. > Since changing {{substr()}} to 0-based is probably NOT a reasonable option > here, I suggest making one or more of the following changes: > # Adding a method {{substr0}}, which would be zero-based > # Renaming {{substr}} to {{substr1}} > # Making the existing {{substr()}} throw an exception on {{startPos==0}}, > which should catch and alert most users who expect zero-based behavior. > This is my first discussion on this project, apologies for any faux pas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion
[ https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661324#comment-16661324 ] kevin yu commented on SPARK-25807: -- I am looking into option 1, option 3 causes to change behavior, probably require more discussion. Kevin > Mitigate 1-based substr() confusion > --- > > Key: SPARK-25807 > URL: https://issues.apache.org/jira/browse/SPARK-25807 > Project: Spark > Issue Type: Improvement > Components: Java API, PySpark >Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0 >Reporter: Oron Navon >Priority: Minor > > The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's > {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's > {{substr}}, which are zero-based. Both PySpark users and Java API users > often naturally expect a 0-based {{substr()}}. Adding to the confusion, > {{substr()}} currently allows a {{startPos}} value of 0, which returns the > same result as {{startPos==1}}. > Since changing {{substr()}} to 0-based is probably NOT a reasonable option > here, I suggest making one or more of the following changes: > # Adding a method {{substr0}}, which would be zero-based > # Renaming {{substr}} to {{substr1}} > # Making the existing {{substr()}} throw an exception on {{startPos==0}}, > which should catch and alert most users who expect zero-based behavior. > This is my first discussion on this project, apologies for any faux pas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org