Re: Spark SQL: Preserving Dataframe Schema

Xiao Li Tue, 20 Oct 2015 13:18:22 -0700

Sure. Will try to do a pull request this week.

Schema evolution is always painful for database people. IMO, NULL is a bad
design in the original system R. It introduces a lot of problems during the
system migration and data integration.


Let me find a possible scenario: RDBMS is used as an ODS. Spark is used as
an external online data analysis engine. The results could be stored in
Parquet files and inserted back RDBMS every interval. In this case, we
could face a few options:

- Change the data types of columns in RDBMS tables to support the possible
nullable values and the logics of RDBMS applications that consume these
results must also support NULL. When the applications are third-party,
changing the applications become harder.

- As what you suggested, before loading the data from the Parquet files, we
need to add an extra step to do a possible data cleaning, value
transformation or exception reporting in case of finding NULL.

If having such an external parameter, when writing data schema to external
data store, Spark will do its best to keep the original schema without any
change (e.g., keep the initial definition of nullability). If some data
type/schema conversions are not avoidable, it will issue warnings or errors
to the users. Does that make sense?

Thanks,

Xiao Li






 In this case,



2015-10-20 12:38 GMT-07:00 Michael Armbrust <mich...@databricks.com>:

> First, this is not documented in the official document. Maybe we should do
>> it? http://spark.apache.org/docs/latest/sql-programming-guide.html
>>
>
> Pull requests welcome.
>
>
>> Second, nullability is a significant concept in the database people. It
>> is part of schema. Extra codes are needed for evaluating if a value is null
>> for all the nullable data types. Thus, it might cause a problem if you need
>> to use Spark to transfer the data between parquet and RDBMS. My suggestion
>> is to introduce another external parameter?
>>
>
> Sure, but a traditional RDBMS has the opportunity to do validation before
> loading data in.  Thats not really an option when you are reading random
> files from S3.  This is why Hive and many other systems in this space treat
> all columns as nullable.
>
> What would the semantics of this proposed external parameter be?
>

Re: Spark SQL: Preserving Dataframe Schema

Reply via email to