Thanks Reynold and Cheng. It does seem quiet a bit of heavy lifting to have
schema per row. I will for now settle with having to do a union schema of
all the schema versions and complain any incompatibilities :-)
Looking forward to do great things with the API!
Thanks,
Aniket
On Thu Jan 29 2015
I saw the talk on Spark data sources and looking at the interfaces, it
seems that the schema needs to be provided upfront. This works for many
data sources but I have a situation in which I would need to integrate a
system that supports schema evolutions by allowing users to change schema
without
Hi Aniket,
In general the schema of all rows in a single table must be same. This
is a basic assumption made by Spark SQL. Schema union does make sense,
and we're planning to support this for Parquet. But as you've mentioned,
it doesn't help if types of different versions of a column differ
It's an interesting idea, but there are major challenges with per row
schema.
1. Performance - query optimizer and execution use assumptions about schema
and data to generate optimized query plans. Having to re-reason about
schema for each row can substantially slow down the engine, but due to