I see what you mean by a schema only gets you so far. Your Fred Flinstone
example shows how you almost need the ability to apply a transformation at
the reader level (instead of at the projection level) to properly read such
data files.
I think I agree with Charles Givre. I've always like the tag
Hi JC,
Bingo, you just hit the core problem with schema-on-read: there is no "right"
rule for how to handle ambiguous or inconsistent schemas. Take your
string/binary example. You determined that the binary fields were actually
strings (encoded in what, UTF-8? ASCII? Host's native codeset?)
I'm writing a msgpack reader and have encountered datasets where an array
contains different types for example a VARCHAR and a BINARY. Turns out the
BINARY is actually a string. I know this is probably just not modeled
correctly in the first place but I'll still going to modify the reading of
list