RTRIM`

Dongjoon Hyun Mon, 17 Feb 2020 23:46:14 -0800

Thank you for feedback, Wenchen, Maxim, and Takeshi.

    1. "we would also promote the SQL standard TRIM syntax"
    2. "we could output a warning with a notice about the order of
parameters"
    3. "it looks nice to make these (two parameters) trim functions
unrecommended in future releases"


Yes, in case of reverting SPARK-28093, we had better deprecate these
two-parameter SQL function invocations. It's because this is really
esoteric and even inconsistent inside Spark (Scala API also follows
PostgresSQL/Presto style like the following.)

    def trim(e: Column, trimString: String)

If we keep this situation in 3.0.0 release (a major release), it means
Apache Spark will be forever.
And, it becomes worse and worse because it leads more users to fall into
this trap.

There are a few deprecation ways. I believe (3) can be a proper next step
in case of reverting because (2) is infeasible and (1) is considered
insufficient because it's silent when we do SPARK-28093. We need non-silent
(noisy) one in this case. Technically, (3) can be done in
`Analyzer.ResolveFunctions`.

    1. Documentation-only: Removing example and add migration guide
    2. Compile-time warning by annotation: This is not an option for SQL
function in SQL string.
    3. Runtime warning with a directional guide
       (log.warn("... USE TRIM(trimStr FROM str) INSTEAD")

How do you think about (3)?

Bests,
Dongjoon.

On Sun, Feb 16, 2020 at 1:22 AM Takeshi Yamamuro <linguin....@gmail.com>
wrote:

> The revert looks fine to me for keeping the compatibility.
> Also, I think the different orders between the systems easily lead to
> mistakes, so
> , as Wenchen suggested, it looks nice to make these (two parameters) trim
> functions
> unrecommended in future releases:
> https://github.com/apache/spark/pull/27540#discussion_r377682518
> Actually, I think the SQL TRIM syntax is enough for trim use cases...
>
> Bests,
> Takeshi
>
>
> On Sun, Feb 16, 2020 at 3:02 AM Maxim Gekk <maxim.g...@databricks.com>
> wrote:
>
>> Also if we look at possible combination of trim parameters:
>> 1. foldable srcStr + foldable trimStr
>> 2. foldable srcStr + non-foldable trimStr
>> 3. non-foldable srcStr + foldable trimStr
>> 4. non-foldable srcStr + non-foldable trimStr
>>
>> The case # 2 seems a rare case, and # 3 is probably the most common case.
>> Once we see the second case, we could output a warning with a notice about
>> the order of parameters.
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>>
>> On Sat, Feb 15, 2020 at 5:04 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> It's unfortunate that we don't have a clear document to talk about
>>> breaking changes (I'm working on it BTW). I believe the general guidance
>>> is: *avoid breaking changes unless we have to*. For example, the
>>> previous result was so broken that we have to fix it, moving to Scala 2.12
>>> makes us have to break some APIs, etc.
>>>
>>> For this particular case, do we have to switch the parameter order? It's
>>> different from some systems, the order was not decided explicitly, but I
>>> don't think they are strong reasons. People from RDBMS should use the SQL
>>> standard TRIM syntax more often. People using prior Spark versions should
>>> have figured out the parameter order of Spark TRIM (there was no document)
>>> and adjusted their queries. There is no such standard that defines the
>>> parameter order of the TRIM function.
>>>
>>> In the long term, we would also promote the SQL standard TRIM syntax. I
>>> don't see much benefit of "fixing" the parameter order that worth to make a
>>> breaking change.
>>>
>>> Thanks,
>>> Wenchen
>>>
>>>
>>>
>>> On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>> wrote:
>>>
>>>> Please note that the context if TRIM/LTRIM/RTRIM with two-parameters
>>>> and TRIM(trimStr FROM str) syntax.
>>>>
>>>> This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM.
>>>>
>>>> On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> I'm sending this email because the Apache Spark committers had better
>>>>> have a consistent point of views for the upcoming PRs. And, the community
>>>>> policy is the way to lead the community members transparently and clearly
>>>>> for a long term good.
>>>>>
>>>>> First of all, I want to emphasize that, like Apache Spark 2.0.0,
>>>>> Apache Spark 3.0.0 is going to achieve many improvement in SQL areas.
>>>>>
>>>>> Especially, we invested lots of efforts to improve SQL ANSI compliance
>>>>> and related test coverage for the future. One simple example is the
>>>>> following.
>>>>>
>>>>>     [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax
>>>>>
>>>>> With this support, we are able to move away from one of esoteric Spark
>>>>> function `TRIM/LTRIM/RTRIM`.
>>>>> (Note that the above syntax is ANSI standard, but we have additional
>>>>> non-standard these functions, too.)
>>>>>
>>>>> Here is the summary of the current status.
>>>>>
>>>>>     +------------------------+-------------+---------------+
>>>>>     | SQL Progressing Engine | TRIM Syntax | TRIM Function |
>>>>>     +------------------------------------------------------+
>>>>>     | Spark 3.0.0-preview/2  |      O      |       O       |
>>>>>     | PostgreSQL             |      O      |       O       |
>>>>>     | Presto                 |      O      |       O       |
>>>>>     | MySQL                  |      O(*)   |       X       |
>>>>>     | Oracle                 |      O      |       X       |
>>>>>     | MsSQL                  |      O      |       X       |
>>>>>     | Terradata              |      O      |       X       |
>>>>>     | Hive                   |      X      |       X       |
>>>>>     | Spark 1.6 ~ 2.2        |      X      |       X       |
>>>>>     | Spark 2.3 ~ 2.4        |      X      |       O(**)   |
>>>>>     +------------------------+-------------+---------------+
>>>>>     * MySQL doesn't allow multiple trim character
>>>>>     * Spark 2.3 ~ 2.4 have the function in a different way.
>>>>>
>>>>> Here is the illustrative example of the problem.
>>>>>
>>>>>     postgres=# SELECT trim('yxTomxx', 'xyz');
>>>>>     btrim
>>>>>     -------
>>>>>     Tom
>>>>>
>>>>>     presto:default> SELECT trim('yxTomxx', 'xyz');
>>>>>     _col0
>>>>>     -------
>>>>>     Tom
>>>>>
>>>>>     spark-sql> SELECT trim('yxTomxx', 'xyz');
>>>>>     z
>>>>>
>>>>> Here is our history to fix the above issue.
>>>>>
>>>>>     [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order
>>>>> issue
>>>>>     1. https://github.com/apache/spark/pull/24902
>>>>>        (Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2
>>>>> released.)
>>>>>     2. https://github.com/apache/spark/pull/24907
>>>>>        (Merged 2019-06-20 for v2.3.4, but reverted)
>>>>>     3. https://github.com/apache/spark/pull/24908
>>>>>        (Merged 2019-06-21 for v2.4.4, but reverted)
>>>>>
>>>>> (2) and (3) were reverted before releases because we didn't want to
>>>>> fix that in the maintenance releases. Please see the following references
>>>>> of the decision.
>>>>>
>>>>>     https://github.com/apache/spark/pull/24908#issuecomment-504799028
>>>>> (2.3)
>>>>>     https://github.com/apache/spark/pull/24907#issuecomment-504799021
>>>>> (2.4)
>>>>>
>>>>> Now, there are some requests to revert SPARK-28093 and to keep these
>>>>> esoteric functions for backward compatibility and the following reason.
>>>>>
>>>>>     > Reordering function parameters to match another system,
>>>>>     > for a method that is otherwise working correctly,
>>>>>     > sounds exactly like a cosmetic change to me.
>>>>>
>>>>>     > How can we silently change the parameter of an existing SQL
>>>>> function?
>>>>>     > I don't think this is a correctness issue as the SQL standard
>>>>>     > doesn't say that the function signature have to be trim(srcStr,
>>>>> trimStr).
>>>>>
>>>>> The concern and the point of views make sense.
>>>>>
>>>>> My concerns are the followings.
>>>>>
>>>>>     1. This kind of esoteric differences are called `vendor-lock-in`
>>>>> stuffs in a negative way.
>>>>>        - It's difficult for new users to understand.
>>>>>        - It's hard to migrate between Apache Spark and the others.
>>>>>     2. Although we did our bests, Apache Spark SQL has not been enough
>>>>> always.
>>>>>     3. We need to do improvement in the future releases.
>>>>>
>>>>> In short, we can keep 3.0.0-preview behaviors here or revert
>>>>> SPARK-28093 in order to keep this vendor-lock-in stuffs for
>>>>> backward-compatibility.
>>>>>
>>>>> What I want is building a consistent point of views in this category
>>>>> for the upcoming PR reviews.
>>>>>
>>>>> If we have a clear policy, we can save future community efforts in
>>>>> many ways.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>
>
> --
> ---
> Takeshi Yamamuro
>

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

Reply via email to