Re: Mongo column types

Dobes Vandermeer Wed, 26 Feb 2020 13:27:28 -0800

Hi Paul,

For our case being able to specify the schema in any way would probably solve 
it, whether as a cast or condition in the query, or as some sort of data 
dictionary I can configure within drill.  We do conform to a schema in our 
mongo collections nowadays.

If the types are set in the query I'd advocate for it being a new thing rather 
than overload CAST - cast is intended to convert data rather than make any 
assertion about the type of data (other than it is possible to convert to the 
target type).  Putting it in the WHERE clause seems to fit into an existing 
feature (filter pushdown) a little bit.  Of course the most logical place for 
this information (in a query) would be in the FROM clause, like SELECT * FROM 
mongo.db.answers (_id: STRING, points: DOUBLE).  But that would require an 
extension to the ANSI syntax.

Declaring this ahead of time does seem interesting but you would have to 
specify all the files that a schema applies to, maybe using some kind of file 
path wildcard or somesuch.  Seems doable but also complex.

We don't allow our end-users to write any SQL queries, I'm just trying to get 
this to the point where it is friendly enough for me, an experienced 
programmer, to use.

On 2/26/2020 11:23:33 AM, Paul Rogers <[email protected]> wrote:
Hi Dobes,

I had a similar thought: perhaps we can implement "type push-down": use the 
CAST to implicitly declare a provided schema for a column. In your query:

SELECT _id, CAST(points AS DOUBLE)
FROM mongo.formative.answers AS answer
WHERE answer.createdAt > DATE_SUB(current_timestamp, interval '1' day)
LIMIT 100

A normal SQL planner runs type propagation from column types to expressions to 
result columns.
For "type push-down", we would run the planner type propagation "backwards" to 
say that `points` must be a DOUBLE and `current_timestamp` must be a DATE, and 
thus so must `answer.createdAt`. These types would then be pushed into the scan 
so that if the scan sees NULLs for any of these types, it knows the type of the 
NULL. In this case, we have no information about the type of _id, which is OK 
because _id sounds like something that would never be NULL.

This does not help if `points` is read as an INT: now every reader would have 
to be prepared for any type conversion the user dreams up. This is, in fact, 
the problem I've been fighting in a current PR about type conversions. We end 
up not only with CAST implementations, but a separate implementation of the 
CAST logic in each reader. I'm actually pretty skeptical that the approach is a 
much of a win.

While this might be a clever hack, and can probably be implemented, it pushes 
all the complexity onto your user. If your target market is primary or 
secondary education, you may have a hard time convincing users or app 
developers to include the needed boilerplate. That is, while the above would 
work, the following would not:

SELECT _id, points ...

Further, your users are not likely going to write SQL. (In my years working 
with my local schools, I never found a single person who could write SQL, 
usually not even the IT staff.) Instead, the users will use some front-end: 
your hosted UI, a custom web UI, a BI tool, etc.. Those tools won't know to 
insert the cast. You could require the web UI people to do so, but not a BI 
tool vendor.

Each of these audiences would observe, rightly, that in the SQL world, the 
right place for type information is in the data dictionary so that UI 
engineers, BI tool vendors, ad-hoc query writers don't all have to keep the DD 
in their heads and write the correct CAST statements.

At least, that's been my (limited) experience. Is there something unique or 
clever about your app where pushing type info into the query would be practical?

We've been chipping away at a generalized data dictionary ("metastore") so that 
you can connect Drill to whatever type registry you might have. Not quite ready 
for prime-time, but close.

Drill has another trick: if we did do "type push-down", you could write a view 
on top of each table that includes all the casts. This is described in the 
Drill docs, on the mailing list, and in our Drill book. Problem is, now instead 
of maintaining a DD, you maintain a set of views. I've yet to be convinced that 
this is an improvement.

Would views + type-push-down (+ maybe hints) give you what you want without 
having some form of data dictionary?

Thanks,
- Paul

On Wednesday, February 26, 2020, 8:06:20 AM PST, Dobes Vandermeer wrote:

Shower thought:

If you wrote types into the where clause as filters it would both declare the 
type and give permission to skip anything that doesn't match that type. This 
filter can even be pushed down to the mongo query using the $type operator.

It would still have to support nullable values, though.

Do you think this idea has potential?
On Feb 25, 2020, 11:50 PM -0800, Paul Rogers , wrote:
> Hi Dobes,
>
> Looking at the BSON spec [1], it seems that BSON works like JSON: a null 
> isn't a null of some type, it is just null. So, if Drill sees, say, 10 nulls, 
> and has to create a vector to store them, it doesn't know which type to use. 
> In fact, if BSON worked any other way, there could not be a 1:1 translation 
> between the two.
>
>
> This contrasts with the SQL model where a NULL is a NULL INT or a NULL 
> VARCHAR; never just a NULL. Makes sense: all values in a SQL column must be 
> of the same type so NULLs are of that column type.
>
> And here we have the fundamental conflict between SQL and JSON. If I have a 
> field "c" in three JSON objects, JSON does not care if those three fields 
> have the same type or wildly differing types (null, Integer, object, say.) 
> SQL, however, gets its convenience from types being implicit and constant: 
> the data dictionary gives the types so the user does not have to.
>
> Drill has a UNION type (called a "Variant" in some DBs) that can hold, say, a 
> NULL a FLOAT8 and an object (MAP). Perfect! But, how do you sum, sort or 
> group by this monster? How do you send the results to JDBC or ODBC? How to 
> you chart a UNION?
>
>
> Trying to get SQL to act like JSON (or visa-versa) has been an ongoing 
> conflict for the life of the Drill project (and, for me, with a BI tool 
> before that.)
>
>
> The best we can do is to say that Drill works on the subset of JSON which 
> represents a tabular structure (same types on fields of the same name across 
> objects.) That is, Drill works if the JSON was created by the equivalent of a 
> SQL query (with nested structure to "un-join" tables.) But, we still need to 
> deal with untyped nulls.
>
> The "sample the first batch" approach only works if we can guarantee a 
> non-null value appears in the first batch, always. I'd not bet on it. A query 
> of one row ("Johnny's result from the test on Tuesday" when Johnny was absent 
> and so "points" is null) will confuse the JDBC client programmed to expect a 
> (possibly null) double.
>
>
> What we need is a way to know that "when column `points` does appear, it will 
> be a FLOAT8". This is what the "provided schema" does: you can specify just 
> the problematic columns. For files, we put the schema file in the table 
> directory. This is very light-weight: when a conflict occurs, add a hint to 
> make the problem go away.
>
> But, there is no directory in Mongo where we can stash a schema file. So, 
> what we need is some other place to store that annotation. Maybe something 
> associated with the storage plugin for something like Mongo: "for this Mongo 
> server, `points` is FLOAT8." Or something fancier if, say, students, 
> teachers, assignments and tests are all just objects (with some type field) 
> in a single Mongo "table."
>
>
> Thinking a bit more broadly, I'd guess that the data comes from somewhere, 
> perhaps a web form. If so, then the UI designer certainly knew what type she 
> was gathering. If it is a DB extract, the original source had a type. 
> Scantron? Has a type. PowerSchool/SchoolWires/Moodle/Whatever likely store 
> data in a DB, so it has a type. If you ETL to Parquet for storage in S3, 
> Parquet needs a type. So, somebody knows the type. We just have to let Drill 
> in on the secret.
>
>
> Would some kind of hint-based model work for your use case? Some idea that 
> would be better?
>
>
> Thanks,
> - Paul
>
> [1] http://bsonspec.org/spec.html
>
>
>
>
> On Tuesday, February 25, 2020, 10:17:56 PM PST, Dobes Vandermeer wrote:
>
> Hi Paul,
>
> It seems to me that the type of the columns in a JSON file is "JSON" - e.g. 
> map, array, number, string, null, or boolean. In mongodb it is "BSON", which 
> adds dates, integers, and a few other things.
>
> Lacking further guidance from the user, I would expect drill to handle all 
> JSON & BSON columns as if they could hold any of those types at any time. It 
> definitely should not distinguish between integers and floats in JSON, 
> because JSON does not have this distinction.
>
> I suppose this may seem like a pain, though; perhaps it blows up the 
> algorithms drill uses. I'm still new to drill so I don't really understand 
> all the implications of it. But I do know that this *is* the true data model 
> of JSON & BSON. Trying to lockdown the schema will create impedance 
> mismatches.
>
> Unless this reality is accepted then the pain will never end, I suspect.
>
>
> If the user does a CAST() on some values then the output of the CAST 
> operation can be assumed to be specified type, or there will be an error. 
> Perhaps there's some hope in that direction.
>
>
>
>
> On 2/25/2020 8:05:37 PM, Paul Rogers wrote:
> Hi Dobes,
>
> You've run into the classic drawback of runtime schema inference: if Drill 
> never sees a column value in its first sample, then it has no way to "predict 
> the future" and guess what type will eventually show up. So, Drill guesses 
> "nullable INT" which turns out to almost always be wrong.
>
> Some record readers pick the type on the very first row (a sample size of 1.) 
> The newer JSON reader we're working on uses the first batch (a few thousand 
> rows) as its sample size.
>
> Still, if you request "points", the reader is obligated to provide a column 
> even if has to make something up. So, it makes up "nullable INT."
>
> This is the "black swan" problem of inductive reasoning: no matter how many 
> empty values Drill sees, there could always be a non-empty value of some 
> other type.
>
>
> Worse, one scan may see no value and choose "nullable INT" while another sees 
> the actual value and chooses Float8. Now, some poor exchange receiver 
> operator will see both types and have no clue what to do.
>
>
> This is why most DBs require a metastore (AKA data dictionary) to provide 
> table descriptions. Instead of infering types, DBs define the types, often 
> via the same spec that drives the generative process that created the data.
>
>
> Drill also has relatively new "provided schema" feature that helps with this 
> issue in some (but not all) format plugins. But, it has not yet been added to 
> Mongo (or any other storage plugin other than the file system plugin.)
>
> You could try a conditional cast: something like
>
> IF(sqlTypeOf(points) = `INT`, CAST(NULL AS FLOAT4), points)
>
> (I probably have the syntax a bit wrong.) This works if two different scans 
> see the different types. But, it will fail if a single scan sees an empty 
> value followed by a null value (which is exactly the case you describe) 
> because the scan is trying to cope with the data before its even gotten to 
> the Project operator where the IF would be applied.
>
> Sorry for the long post, but this is a difficult issue that has frustrated 
> users for years. I recently posted a proposed solution design at [1] and 
> would welcome feedback.
>
>
> Thanks,
> - Paul
>
> [1] 
> https://github.com/paul-rogers/drill/wiki/Toward-a-Workable-Dynamic-Schema-Model
>
>
> On Tuesday, February 25, 2020, 5:27:01 PM PST, Dobes Vandermeer wrote:
>
> Hi,
>
>
> I was experimenting with the mongo storage system and I found that when I 
> query a field that doesn't usually have any value, I get this error "You 
> tried to write a Float8 type when you are using a ValueWriter of type 
> NullableIntWriterImpl."
>
> Based on a bit of googling I found that this means drill has inferred the 
> incorrect type for that field. I was hoping I could override the inferred 
> type using CAST or something, but CAST didn't work. Is there a way to tell 
> drill what type a field from mongodb is supposed to be?
>
> Example query:
>
> SELECT _id, CAST(points AS DOUBLE)
> FROM mongo.formative.answers AS answer
> WHERE answer.createdAt > DATE_SUB(current_timestamp, interval '1' day)
> LIMIT 100
>
> In this case "points" isn't set on every row, so I guess drill assumes it is 
> "NullableInt" when really it is should be considered a double. We also have 
> many boolean fields that are not set by default that we would want to query.
>
> What's the standard workaround for this case?

Re: Mongo column types

Reply via email to