Hi Dobes,
I like your idea! I think it would be a great addition to Drill in general and
will show the way for other storage plugins.
Technically, you are right, there should not be that much work. We have working
examples in other plugins of what would be needed. Perhaps the biggest cost is
just to get familiar with the storage plugin lifecycle, which is not that
complex, but has a number of moving parts.
Some general thoughts. For the scchema, you can use the
TupleMetadata/ColumnMetadata classes which are how we handle the provided
schema (and many other tasks.) Pretty simple and there are many examples.
The first step is to pass the schema from the planner to the execution engine
(reader). Basically, in the GroupScan/SubScan, obtain your schema from wherever
make sense. Extend the SubScan with an element of type TupleMetadata and set it
to the schema. That will automatically get serialized to JSON in the execution
plan and sent to each executor. Then, use this schema so the reader knows what
type of column to create.
The next step is to actually use the schema when reading data. We have a newer
framework called EVF ("extended vector framework") which handles much of the
boilerplate including creating a set of vectors and "column writers" given a
schema. Mongo, however, uses an older method, but the same ideas should apply.
For JSON, we are working on a newer, EVF-based JSON parser that will support a
provided schema.
Mongo uses BSON. Looks like it uses a BsonRecordReader class. The way that
class is structured, it might be a bit tricky to insert the type conversion
layer. There is a trick we've used elsewhere to convert the in-line case
statement by type into a set of classes, one per column type, which can do the
conversion. Once we've made that change, the column "adapter" can use the
schema to write of the write type.
The fancy solution would be to update the Mongo plugin to use EVF and the
BsonRecordReader to follow the pattern for the EVF-based parser. The more
pragmatic solution is to just wrap the existing solution with a bit more code
to add the schema; we can do a full upgrade later.
We also have a work-in-progress filter push-down framework which can improve
Mongo's ability to push what it can, leave the rest to Drill. That way, if you
can push into Mongo things like a date range, student ID or whatever, then more
advanced predicates can stay in Drill.
Let me know where we can help with suggestions, answering questions, pointers
to examples, tackling some of the trickier bits, or whatever.
Thanks,
- Paul
On Thursday, February 27, 2020, 6:21:46 PM PST, Dobes Vandermeer
<[email protected]> wrote:
Hi Paul,
After looking at the mongo stuff a bit more today I realized that probably the
simplest solution would be to put some kind of schema mapping into the storage
plugin configuration. Some subset of JSON schema using the drill config syntax.
What do you think of this idea? Might actually be something that can be
implemented pretty quickly.
On Feb 26, 2020, 1:36 PM -0800, Dobes Vandermeer <[email protected]>, wrote:
> Hi Paul,
>
> You can always model a type union as a struct with a nullable field for each
> variant. So JSON would be a struct with nullable boolean, double, object,
> array fields. This would work for JDBC as well. As you have pointed out,
> this does have negative performance implications.
>
> Operators would have to be updated to support the JSON data type, it's true,
> and there's some complexity in terms of type promotions, the sort of of
> heterogeneous types, and so forth.
>
> Our data isn't really unstructured but we typically omit fields if they are
> on a default value, to save space. And we have fields that might be null, an
> int, or a double. Would need Drill to support this scenario without relying
> on a double or int specifically showing up in some sampling of records in
> order to determine the data type of that column.
>
> We aren't allowing our clients to write any SQL. This is just for our
> purposes trying to run reports in a scalable manner. MongoDB doesn't scale
> horizontally or dynamically, and writing reports in MongoDB is quite painful.
> SQL is much better for this.
>
> If SQL isn't suitable for use with MongoDB/JSON that's fine in general, but
> for me at least I was thinking of using Apache Drill to allow SQL against
> both MongoDB, JSON, and Parquet data together. Otherwise there are probably
> more mature products that just target purely relational data or Parquet files.
>
>
>
> > On 2/26/2020 11:44:09 AM, Paul Rogers <[email protected]> wrote:
> > Hi Dobes,
> >
> > You have a very interesting use case. Let's explore a bit further.
> >
> > You are right that is should be possible to build a query engine on top of
> > Union/Variant (JSON object) types. MapR has done so with MapRDB. Sounds
> > like Mongo has also (I'll make a note to read up on Mongo.) Drill has
> > implemented some of these ideas in bits and pieces. However we've hit
> > several problems due to the SQL and columnar nature of Drill.
> >
> >
> > First, SQL is not designed for this use case: every operator and function
> > has to be rethought. (For example, a sort has see only INT values and has
> > been using an INT comparison. Now, the column becomes a DOUBLE, so we need
> > to add a DOUBLE comparison, DOUBLE/INT comparison, and logic to combine the
> > two. Vectors that were pure INT in previous sorted runs now have to change
> > to become a UNION of either INT or DOUBLE so we can merge runs of INTs with
> > runs of DOUBLEs.)
> >
> > Second, UNIONs are very inefficient with the extra storage, type checks and
> > so on. So, they need to be used only where needed else we violate the #1
> > concern for any query engine: Performance. So, code complexity vastly
> > increases.
> >
> > Third, neither JDBC nor ODBC understand UNION (Variant). BI tools don't
> > understand UNION. Yet, these are Drill's primary clients. So, somebody has
> > to convert the UNION into something that xDBC can understand. Should this
> > be client code (which, I think, can't even be done for ODBC) or should
> > Drill do it? And, if we've only ever seen untyped nulls, we could not
> > convert to the correct type, even in principle, because we don't know it
> > (unless we enforce the CAST push-down from the other discussion.)
> >
> >
> > If Drill does the UNION-to-type conversion, why not do it early rather than
> > late to avoid all the complexity?
> >
> > Sounds like you may live in a world where it is JSON in, JSON out,
> > JSON-aware clients and you need a JSON-aware query engine. That query
> > engine cannot be based on (standard) SQL. Might be based on SQL++ or a
> > JSON-extended SQL which does not try to work with xDBC.
> >
> > Likely, elsewhere in your project, there is someone else (maybe it is you)
> > trying to convert runs of untyped JSON nulls into a typed Parquet file that
> > matches the case where those JSON fields have a type. Now you'd need to
> > convince Parquet, and all its consumers (Amazon Athena, say) to support
> > ambiguous types and unions. What about your ML group that now has to have
> > feature vectors of mixed types? Right about now a little voice might be
> > saying, "ah, maybe we're on the wrong track."
> >
> >
> > One could argue that the education market is looking for a simple solution:
> > one that uses standard (SQL-based) tools. Rather than having to reinvent
> > the entire stack to work around a flaw in JSON encoding (untyped nulls), it
> > might be better to fix the JSON bug so you can leverage the rich ecosystem
> > of SQL-based tools.
> >
> > Can you explain a bit more the source and use of the data? Will either the
> > source or the user expect typed data? How might you handle type ambiguity
> > in parts of your app other than query?
> >
> >
> > Thanks,
> > - Paul
> >
> >
> >
> > On Wednesday, February 26, 2020, 8:14:56 AM PST, Dobes Vandermeer wrote:
> >
> > Hi Paul,
> >
> > You can, of course, represent a JSON value as a union of the various types,
> > and your questions are already well answered in mongo and elsewhere.
> >
> > Typically numeric operations (sum) will fail if values are not all numbers.
> > Numbers are promoted to double if they are not all double.
> >
> > Group by can just compare each field separately and its presence or
> > absence, no problem there.
> >
> > Conversion to jdbc can use whatever system is already used for objects. Any
> > variant not populated is null. If they want a more concise representation
> > then CAST will do a best effort conversion or fail.
> >
> > For a chart you would CAST the data points to double.
> >
> > I don't personally see any major roadblock there.
> > On Feb 25, 2020, 11:50 PM -0800, Paul Rogers , wrote:
> > > Hi Dobes,
> > >
> > > Looking at the BSON spec [1], it seems that BSON works like JSON: a null
> > > isn't a null of some type, it is just null. So, if Drill sees, say, 10
> > > nulls, and has to create a vector to store them, it doesn't know which
> > > type to use. In fact, if BSON worked any other way, there could not be a
> > > 1:1 translation between the two.
> > >
> > >
> > > This contrasts with the SQL model where a NULL is a NULL INT or a NULL
> > > VARCHAR; never just a NULL. Makes sense: all values in a SQL column must
> > > be of the same type so NULLs are of that column type.
> > >
> > > And here we have the fundamental conflict between SQL and JSON. If I have
> > > a field "c" in three JSON objects, JSON does not care if those three
> > > fields have the same type or wildly differing types (null, Integer,
> > > object, say.) SQL, however, gets its convenience from types being
> > > implicit and constant: the data dictionary gives the types so the user
> > > does not have to.
> > >
> > > Drill has a UNION type (called a "Variant" in some DBs) that can hold,
> > > say, a NULL a FLOAT8 and an object (MAP). Perfect! But, how do you sum,
> > > sort or group by this monster? How do you send the results to JDBC or
> > > ODBC? How to you chart a UNION?
> > >
> > >
> > > Trying to get SQL to act like JSON (or visa-versa) has been an ongoing
> > > conflict for the life of the Drill project (and, for me, with a BI tool
> > > before that.)
> > >
> > >
> > > The best we can do is to say that Drill works on the subset of JSON which
> > > represents a tabular structure (same types on fields of the same name
> > > across objects.) That is, Drill works if the JSON was created by the
> > > equivalent of a SQL query (with nested structure to "un-join" tables.)
> > > But, we still need to deal with untyped nulls.
> > >
> > > The "sample the first batch" approach only works if we can guarantee a
> > > non-null value appears in the first batch, always. I'd not bet on it. A
> > > query of one row ("Johnny's result from the test on Tuesday" when Johnny
> > > was absent and so "points" is null) will confuse the JDBC client
> > > programmed to expect a (possibly null) double.
> > >
> > >
> > > What we need is a way to know that "when column `points` does appear, it
> > > will be a FLOAT8". This is what the "provided schema" does: you can
> > > specify just the problematic columns. For files, we put the schema file
> > > in the table directory. This is very light-weight: when a conflict
> > > occurs, add a hint to make the problem go away.
> > >
> > > But, there is no directory in Mongo where we can stash a schema file. So,
> > > what we need is some other place to store that annotation. Maybe
> > > something associated with the storage plugin for something like Mongo:
> > > "for this Mongo server, `points` is FLOAT8." Or something fancier if,
> > > say, students, teachers, assignments and tests are all just objects (with
> > > some type field) in a single Mongo "table."
> > >
> > >
> > > Thinking a bit more broadly, I'd guess that the data comes from
> > > somewhere, perhaps a web form. If so, then the UI designer certainly knew
> > > what type she was gathering. If it is a DB extract, the original source
> > > had a type. Scantron? Has a type. PowerSchool/SchoolWires/Moodle/Whatever
> > > likely store data in a DB, so it has a type. If you ETL to Parquet for
> > > storage in S3, Parquet needs a type. So, somebody knows the type. We just
> > > have to let Drill in on the secret.
> > >
> > >
> > > Would some kind of hint-based model work for your use case? Some idea
> > > that would be better?
> > >
> > >
> > > Thanks,
> > > - Paul
> > >
> > > [1] http://bsonspec.org/spec.html
> > >
> > >
> > >
> > >
> > > On Tuesday, February 25, 2020, 10:17:56 PM PST, Dobes Vandermeer wrote:
> > >
> > > Hi Paul,
> > >
> > > It seems to me that the type of the columns in a JSON file is "JSON" -
> > > e.g. map, array, number, string, null, or boolean. In mongodb it is
> > > "BSON", which adds dates, integers, and a few other things.
> > >
> > > Lacking further guidance from the user, I would expect drill to handle
> > > all JSON & BSON columns as if they could hold any of those types at any
> > > time. It definitely should not distinguish between integers and floats
> > > in JSON, because JSON does not have this distinction.
> > >
> > > I suppose this may seem like a pain, though; perhaps it blows up the
> > > algorithms drill uses. I'm still new to drill so I don't really
> > > understand all the implications of it. But I do know that this *is* the
> > > true data model of JSON & BSON. Trying to lockdown the schema will
> > > create impedance mismatches.
> > >
> > > Unless this reality is accepted then the pain will never end, I suspect.
> > >
> > >
> > > If the user does a CAST() on some values then the output of the CAST
> > > operation can be assumed to be specified type, or there will be an error.
> > > Perhaps there's some hope in that direction.
> > >
> > >
> > >
> > >
> > > On 2/25/2020 8:05:37 PM, Paul Rogers wrote:
> > > Hi Dobes,
> > >
> > > You've run into the classic drawback of runtime schema inference: if
> > > Drill never sees a column value in its first sample, then it has no way
> > > to "predict the future" and guess what type will eventually show up. So,
> > > Drill guesses "nullable INT" which turns out to almost always be wrong.
> > >
> > > Some record readers pick the type on the very first row (a sample size of
> > > 1.) The newer JSON reader we're working on uses the first batch (a few
> > > thousand rows) as its sample size.
> > >
> > > Still, if you request "points", the reader is obligated to provide a
> > > column even if has to make something up. So, it makes up "nullable INT."
> > >
> > > This is the "black swan" problem of inductive reasoning: no matter how
> > > many empty values Drill sees, there could always be a non-empty value of
> > > some other type.
> > >
> > >
> > > Worse, one scan may see no value and choose "nullable INT" while another
> > > sees the actual value and chooses Float8. Now, some poor exchange
> > > receiver operator will see both types and have no clue what to do.
> > >
> > >
> > > This is why most DBs require a metastore (AKA data dictionary) to provide
> > > table descriptions. Instead of infering types, DBs define the types,
> > > often via the same spec that drives the generative process that created
> > > the data.
> > >
> > >
> > > Drill also has relatively new "provided schema" feature that helps with
> > > this issue in some (but not all) format plugins. But, it has not yet been
> > > added to Mongo (or any other storage plugin other than the file system
> > > plugin.)
> > >
> > > You could try a conditional cast: something like
> > >
> > > IF(sqlTypeOf(points) = `INT`, CAST(NULL AS FLOAT4), points)
> > >
> > > (I probably have the syntax a bit wrong.) This works if two different
> > > scans see the different types. But, it will fail if a single scan sees an
> > > empty value followed by a null value (which is exactly the case you
> > > describe) because the scan is trying to cope with the data before its
> > > even gotten to the Project operator where the IF would be applied.
> > >
> > > Sorry for the long post, but this is a difficult issue that has
> > > frustrated users for years. I recently posted a proposed solution design
> > > at [1] and would welcome feedback.
> > >
> > >
> > > Thanks,
> > > - Paul
> > >
> > > [1]
> > > https://github.com/paul-rogers/drill/wiki/Toward-a-Workable-Dynamic-Schema-Model
> > >
> > >
> > > On Tuesday, February 25, 2020, 5:27:01 PM PST, Dobes Vandermeer wrote:
> > >
> > > Hi,
> > >
> > >
> > > I was experimenting with the mongo storage system and I found that when I
> > > query a field that doesn't usually have any value, I get this error "You
> > > tried to write a Float8 type when you are using a ValueWriter of type
> > > NullableIntWriterImpl."
> > >
> > > Based on a bit of googling I found that this means drill has inferred the
> > > incorrect type for that field. I was hoping I could override the
> > > inferred type using CAST or something, but CAST didn't work. Is there a
> > > way to tell drill what type a field from mongodb is supposed to be?
> > >
> > > Example query:
> > >
> > > SELECT _id, CAST(points AS DOUBLE)
> > > FROM mongo.formative.answers AS answer
> > > WHERE answer.createdAt > DATE_SUB(current_timestamp, interval '1' day)
> > > LIMIT 100
> > >
> > > In this case "points" isn't set on every row, so I guess drill assumes it
> > > is "NullableInt" when really it is should be considered a double. We
> > > also have many boolean fields that are not set by default that we would
> > > want to query.
> > >
> > > What's the standard workaround for this case?