Currently drill tries to infer schemas from data that doesn't come with one, 
such as JSON, CSV, and mongoDB.  However this doesn't work well if the first N 
rows are missing values for fields - drill just assigns an arbitrary type to 
fields that are only null and no type to fields that are missing completely, 
then rejects values when it finds them later.

What if you could instead query in a mode where each row is just given as a 
string, and you use JSON functions to load the data out and convert or cast it 
to the appropriate type?

For JSON in particular it's common these days to provide functions that extract 
data from a JSON string column.  BigQuery and postgres are two good examples.

I think in many cases these JSON functions could be inspected by a driver and 
still be used for filter push
down.

Anyway, just an idea I had to approach the mongo schema problem that's a bit 
different from trying to specify the schema up front.  I think this approach 
offers more flexibility to the user at the cost of more verbose syntax and 
harder to optimize queries.

Reply via email to