Why not perform a df.select(...) before the final write to ensure a
consistent ordering.

On Thu, Mar 4, 2021, 7:39 AM Oldrich Vlasic <[email protected]>
wrote:

> Thanks for reply! Is there something to be done, setting a config property
> for example? I'd like to prevent users (mainly data scientists) from
> falling victim to this.
> ------------------------------
> *From:* Russell Spitzer <[email protected]>
> *Sent:* Wednesday, March 3, 2021 3:31 PM
> *To:* Sean Owen <[email protected]>
> *Cc:* Oldrich Vlasic <[email protected]>; user <
> [email protected]>; Ondřej Havlíček <[email protected]>
> *Subject:* Re: [Spark SQL, intermediate+] possible bug or weird behavior
> of insertInto
>
> Yep this is the behavior for Insert Into, using the other write apis does
> schema matching I believe.
>
> On Mar 3, 2021, at 8:29 AM, Sean Owen <[email protected]> wrote:
>
> I don't have any good answer here, but, I seem to recall that this is
> because of SQL semantics, which follows column ordering not naming when
> performing operations like this. It may well be as intended.
>
> On Tue, Mar 2, 2021 at 6:10 AM Oldrich Vlasic <
> [email protected]> wrote:
>
> Hi,
>
> I have encountered a weird and potentially dangerous behaviour of Spark
> concerning
> partial overwrites of partitioned data. Not sure if this is a bug or just
> abstraction
> leak. I have checked Spark section of Stack Overflow and haven't found any
> relevant
> questions or answers.
>
> Full minimal working example provided as attachment. Tested on Databricks
> runtime 7.3 LTS
> ML (Spark 3.0.1). Short summary:
>
> Write dataframe using partitioning by a column using saveAsTable. Filter
> out part of the
> dataframe, change some values (simulates new increment of data) and write
> again,
> overwriting a subset of partitions using insertInto. This operation will
> either fail on
> schema mismatch or cause data corruption.
>
> Reason: on the first write, the ordering of the columns is changed
> (partition column is
> placed at the end). On the second write this is not taken into
> consideration and Spark
> tries to insert values into the columns based on their order and not on
> their name. If
> they have different types this will fail. If not, values will be written
> to incorrect
> columns causing data corruption.
>
> My question: is this a bug or intended behaviour? Can something be done
> about it to prevent
> it? This issue can be avoided by doing a select with schema loaded from
> the target table.
> However, when user is not aware this could cause hard to track down errors
> in data.
>
> Best regards,
> Oldřich Vlašic
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>
>

Reply via email to