Hi Dobes,
Looking at the BSON spec [1], it seems that BSON works like JSON: a null isn't
a null of some type, it is just null. So, if Drill sees, say, 10 nulls, and has
to create a vector to store them, it doesn't know which type to use. In fact,
if BSON worked any other way, there could not be a 1:1 translation between the
two.
This contrasts with the SQL model where a NULL is a NULL INT or a NULL VARCHAR;
never just a NULL. Makes sense: all values in a SQL column must be of the same
type so NULLs are of that column type.
And here we have the fundamental conflict between SQL and JSON. If I have a
field "c" in three JSON objects, JSON does not care if those three fields have
the same type or wildly differing types (null, Integer, object, say.) SQL,
however, gets its convenience from types being implicit and constant: the data
dictionary gives the types so the user does not have to.
Drill has a UNION type (called a "Variant" in some DBs) that can hold, say, a
NULL a FLOAT8 and an object (MAP). Perfect! But, how do you sum, sort or group
by this monster? How do you send the results to JDBC or ODBC? How to you chart
a UNION?
Trying to get SQL to act like JSON (or visa-versa) has been an ongoing conflict
for the life of the Drill project (and, for me, with a BI tool before that.)
The best we can do is to say that Drill works on the subset of JSON which
represents a tabular structure (same types on fields of the same name across
objects.) That is, Drill works if the JSON was created by the equivalent of a
SQL query (with nested structure to "un-join" tables.) But, we still need to
deal with untyped nulls.
The "sample the first batch" approach only works if we can guarantee a non-null
value appears in the first batch, always. I'd not bet on it. A query of one row
("Johnny's result from the test on Tuesday" when Johnny was absent and so
"points" is null) will confuse the JDBC client programmed to expect a (possibly
null) double.
What we need is a way to know that "when column `points` does appear, it will
be a FLOAT8". This is what the "provided schema" does: you can specify just the
problematic columns. For files, we put the schema file in the table directory.
This is very light-weight: when a conflict occurs, add a hint to make the
problem go away.
But, there is no directory in Mongo where we can stash a schema file. So, what
we need is some other place to store that annotation. Maybe something
associated with the storage plugin for something like Mongo: "for this Mongo
server, `points` is FLOAT8." Or something fancier if, say, students, teachers,
assignments and tests are all just objects (with some type field) in a single
Mongo "table."
Thinking a bit more broadly, I'd guess that the data comes from somewhere,
perhaps a web form. If so, then the UI designer certainly knew what type she
was gathering. If it is a DB extract, the original source had a type. Scantron?
Has a type. PowerSchool/SchoolWires/Moodle/Whatever likely store data in a DB,
so it has a type. If you ETL to Parquet for storage in S3, Parquet needs a
type. So, somebody knows the type. We just have to let Drill in on the secret.
Would some kind of hint-based model work for your use case? Some idea that
would be better?
Thanks,
- Paul
[1] http://bsonspec.org/spec.html
On Tuesday, February 25, 2020, 10:17:56 PM PST, Dobes Vandermeer
<[email protected]> wrote:
Hi Paul,
It seems to me that the type of the columns in a JSON file is "JSON" - e.g.
map, array, number, string, null, or boolean. In mongodb it is "BSON", which
adds dates, integers, and a few other things.
Lacking further guidance from the user, I would expect drill to handle all JSON
& BSON columns as if they could hold any of those types at any time. It
definitely should not distinguish between integers and floats in JSON, because
JSON does not have this distinction.
I suppose this may seem like a pain, though; perhaps it blows up the algorithms
drill uses. I'm still new to drill so I don't really understand all the
implications of it. But I do know that this *is* the true data model of JSON &
BSON. Trying to lockdown the schema will create impedance mismatches.
Unless this reality is accepted then the pain will never end, I suspect.
If the user does a CAST() on some values then the output of the CAST operation
can be assumed to be specified type, or there will be an error. Perhaps
there's some hope in that direction.
On 2/25/2020 8:05:37 PM, Paul Rogers <[email protected]> wrote:
Hi Dobes,
You've run into the classic drawback of runtime schema inference: if Drill
never sees a column value in its first sample, then it has no way to "predict
the future" and guess what type will eventually show up. So, Drill guesses
"nullable INT" which turns out to almost always be wrong.
Some record readers pick the type on the very first row (a sample size of 1.)
The newer JSON reader we're working on uses the first batch (a few thousand
rows) as its sample size.
Still, if you request "points", the reader is obligated to provide a column
even if has to make something up. So, it makes up "nullable INT."
This is the "black swan" problem of inductive reasoning: no matter how many
empty values Drill sees, there could always be a non-empty value of some other
type.
Worse, one scan may see no value and choose "nullable INT" while another sees
the actual value and chooses Float8. Now, some poor exchange receiver operator
will see both types and have no clue what to do.
This is why most DBs require a metastore (AKA data dictionary) to provide table
descriptions. Instead of infering types, DBs define the types, often via the
same spec that drives the generative process that created the data.
Drill also has relatively new "provided schema" feature that helps with this
issue in some (but not all) format plugins. But, it has not yet been added to
Mongo (or any other storage plugin other than the file system plugin.)
You could try a conditional cast: something like
IF(sqlTypeOf(points) = `INT`, CAST(NULL AS FLOAT4), points)
(I probably have the syntax a bit wrong.) This works if two different scans see
the different types. But, it will fail if a single scan sees an empty value
followed by a null value (which is exactly the case you describe) because the
scan is trying to cope with the data before its even gotten to the Project
operator where the IF would be applied.
Sorry for the long post, but this is a difficult issue that has frustrated
users for years. I recently posted a proposed solution design at [1] and would
welcome feedback.
Thanks,
- Paul
[1]
https://github.com/paul-rogers/drill/wiki/Toward-a-Workable-Dynamic-Schema-Model
On Tuesday, February 25, 2020, 5:27:01 PM PST, Dobes Vandermeer wrote:
Hi,
I was experimenting with the mongo storage system and I found that when I query
a field that doesn't usually have any value, I get this error "You tried to
write a Float8 type when you are using a ValueWriter of type
NullableIntWriterImpl."
Based on a bit of googling I found that this means drill has inferred the
incorrect type for that field. I was hoping I could override the inferred type
using CAST or something, but CAST didn't work. Is there a way to tell drill
what type a field from mongodb is supposed to be?
Example query:
SELECT _id, CAST(points AS DOUBLE)
FROM mongo.formative.answers AS answer
WHERE answer.createdAt > DATE_SUB(current_timestamp, interval '1' day)
LIMIT 100
In this case "points" isn't set on every row, so I guess drill assumes it is
"NullableInt" when really it is should be considered a double. We also have
many boolean fields that are not set by default that we would want to query.
What's the standard workaround for this case?