Yep this is the behavior for Insert Into, using the other write apis does schema matching I believe.
> On Mar 3, 2021, at 8:29 AM, Sean Owen <[email protected]> wrote: > > I don't have any good answer here, but, I seem to recall that this is because > of SQL semantics, which follows column ordering not naming when performing > operations like this. It may well be as intended. > > On Tue, Mar 2, 2021 at 6:10 AM Oldrich Vlasic <[email protected] > <mailto:[email protected]>> wrote: > Hi, > > I have encountered a weird and potentially dangerous behaviour of Spark > concerning > partial overwrites of partitioned data. Not sure if this is a bug or just > abstraction > leak. I have checked Spark section of Stack Overflow and haven't found any > relevant > questions or answers. > > Full minimal working example provided as attachment. Tested on Databricks > runtime 7.3 LTS > ML (Spark 3.0.1). Short summary: > > Write dataframe using partitioning by a column using saveAsTable. Filter out > part of the > dataframe, change some values (simulates new increment of data) and write > again, > overwriting a subset of partitions using insertInto. This operation will > either fail on > schema mismatch or cause data corruption. > > Reason: on the first write, the ordering of the columns is changed (partition > column is > placed at the end). On the second write this is not taken into consideration > and Spark > tries to insert values into the columns based on their order and not on their > name. If > they have different types this will fail. If not, values will be written to > incorrect > columns causing data corruption. > > My question: is this a bug or intended behaviour? Can something be done about > it to prevent > it? This issue can be avoided by doing a select with schema loaded from the > target table. > However, when user is not aware this could cause hard to track down errors in > data. > > Best regards, > Oldřich Vlašic > > --------------------------------------------------------------------- > To unsubscribe e-mail: [email protected] > <mailto:[email protected]>
