Yep this is the behavior for Insert Into, using the other write apis does 
schema matching I believe.

> On Mar 3, 2021, at 8:29 AM, Sean Owen <[email protected]> wrote:
> 
> I don't have any good answer here, but, I seem to recall that this is because 
> of SQL semantics, which follows column ordering not naming when performing 
> operations like this. It may well be as intended.
> 
> On Tue, Mar 2, 2021 at 6:10 AM Oldrich Vlasic <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi,
> 
> I have encountered a weird and potentially dangerous behaviour of Spark 
> concerning
> partial overwrites of partitioned data. Not sure if this is a bug or just 
> abstraction
> leak. I have checked Spark section of Stack Overflow and haven't found any 
> relevant
> questions or answers.
> 
> Full minimal working example provided as attachment. Tested on Databricks 
> runtime 7.3 LTS
> ML (Spark 3.0.1). Short summary:
> 
> Write dataframe using partitioning by a column using saveAsTable. Filter out 
> part of the
> dataframe, change some values (simulates new increment of data) and write 
> again,
> overwriting a subset of partitions using insertInto. This operation will 
> either fail on
> schema mismatch or cause data corruption.
> 
> Reason: on the first write, the ordering of the columns is changed (partition 
> column is
> placed at the end). On the second write this is not taken into consideration 
> and Spark
> tries to insert values into the columns based on their order and not on their 
> name. If
> they have different types this will fail. If not, values will be written to 
> incorrect
> columns causing data corruption.
> 
> My question: is this a bug or intended behaviour? Can something be done about 
> it to prevent
> it? This issue can be avoided by doing a select with schema loaded from the 
> target table.
> However, when user is not aware this could cause hard to track down errors in 
> data.
> 
> Best regards,
> Oldřich Vlašic
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected] 
> <mailto:[email protected]>

Reply via email to