Why not perform a df.select(...) before the final write to ensure a consistent ordering.
On Thu, Mar 4, 2021, 7:39 AM Oldrich Vlasic <[email protected]> wrote: > Thanks for reply! Is there something to be done, setting a config property > for example? I'd like to prevent users (mainly data scientists) from > falling victim to this. > ------------------------------ > *From:* Russell Spitzer <[email protected]> > *Sent:* Wednesday, March 3, 2021 3:31 PM > *To:* Sean Owen <[email protected]> > *Cc:* Oldrich Vlasic <[email protected]>; user < > [email protected]>; Ondřej Havlíček <[email protected]> > *Subject:* Re: [Spark SQL, intermediate+] possible bug or weird behavior > of insertInto > > Yep this is the behavior for Insert Into, using the other write apis does > schema matching I believe. > > On Mar 3, 2021, at 8:29 AM, Sean Owen <[email protected]> wrote: > > I don't have any good answer here, but, I seem to recall that this is > because of SQL semantics, which follows column ordering not naming when > performing operations like this. It may well be as intended. > > On Tue, Mar 2, 2021 at 6:10 AM Oldrich Vlasic < > [email protected]> wrote: > > Hi, > > I have encountered a weird and potentially dangerous behaviour of Spark > concerning > partial overwrites of partitioned data. Not sure if this is a bug or just > abstraction > leak. I have checked Spark section of Stack Overflow and haven't found any > relevant > questions or answers. > > Full minimal working example provided as attachment. Tested on Databricks > runtime 7.3 LTS > ML (Spark 3.0.1). Short summary: > > Write dataframe using partitioning by a column using saveAsTable. Filter > out part of the > dataframe, change some values (simulates new increment of data) and write > again, > overwriting a subset of partitions using insertInto. This operation will > either fail on > schema mismatch or cause data corruption. > > Reason: on the first write, the ordering of the columns is changed > (partition column is > placed at the end). On the second write this is not taken into > consideration and Spark > tries to insert values into the columns based on their order and not on > their name. If > they have different types this will fail. If not, values will be written > to incorrect > columns causing data corruption. > > My question: is this a bug or intended behaviour? Can something be done > about it to prevent > it? This issue can be avoided by doing a select with schema loaded from > the target table. > However, when user is not aware this could cause hard to track down errors > in data. > > Best regards, > Oldřich Vlašic > > --------------------------------------------------------------------- > To unsubscribe e-mail: [email protected] > > >
