Re: QueryRecord and AvroReader: What is the expect syntax ?

Mark Payne Wed, 05 Jul 2017 06:05:12 -0700

What Matt laid out here is accurate. QueryFlowFile currently operates only on 
the 'root level' fields.
I definitely want to have a mechanism for querying embedded fields as well, but 
right now it's not really
possible to do so directly.

As Matt pointed out, you can certainly use UpdateRecord to extract those fields 
of interest to a higher
level. Or an alternative would be to convert the Avro data into JSON and use 
the JsonPathReader. But of
course moving away from having to convert into JSON is one of the goals of the 
record-oriented paradigm.
When JsonPathReader was written, though, we hadn't yet developed the RecordPath 
DSL so we didn't build
a more generic RecordPathReader.

While I want to enable QueryRecord to handle these embedded fields, it's really 
a matter of how best to design
it. There are a few options. Firstly, we could let QueryRecord's dynamic 
properties either be SQL or RecordPath
and let the RecordPath's extract fields to query. This may be a bit confusing 
though because describing exactly
what the dynamic properties are and how they are used becomes a bit confusing. 
An alternate approach would be
to develop a RecordPathReader that is configured with another RecordReader (an 
Avro reader in this case) and then
configured with 1 more RecordPath's that extract the fields of interest, and 
this Record that has the fields of interest
as root-level fields would be what you'd configured QueryRecord with.

A third option may be to look at the Calcite library and see if they provide 
any sort of mechanism for identifying which
columns are being referenced in the projections, and then try compiling those 
as RecordPath's and using them when
creating the Schema. This would perhaps be the cleanest solution in terms of 
usability but would mean that:

(a) other processors may still have to do a lot of work to do something 
similar, and
(b) it may end up being quite difficult and inefficient to do this, and I'm not 
sure if it's even possible currently.

So I said all that to say this: it's certainly something that we want to do, 
that is currently lacking. But given the amount of
interest that we've seen in the record-oriented processors, this will likely be 
a hot focus area for development. If you have
suggestions or are interested in delving in here, please let me know and I'd be 
happy to discuss further, review PR's, etc.

Thanks!
-Mark

> On Jul 3, 2017, at 10:59 PM, Matt Burgess <[email protected]> wrote:
> 
> Andre,
> 
> I don't believe that QueryRecord supports RecordPath expressions as
> "column names" (but someone should correct me if I'm mistaken :) I
> think it takes a tabular view of the record, meaning each top-level
> field is a "column" for the purposes of querying.
> 
> For JSON, the JsonPathReader lets you "hoist" fields from a nested
> record into what looks like top-level fields from the record API
> perspective. For Avro, there is no current AvroPathReader, which IIRC
> is partly due to the fact that there is no AvroPath implementation in
> the Avro project [1]. However the Jira case mentions an
> Apache-licensed library for doing such things, and certainly
> RecordPathCompiler handles it from the Record API side. I think the
> kicker is how to expose fields for projections (i.e. SELECTs); for
> arbitrarily nested records, that's a lot of "columns" and/or
> non-standard SQL. It can be done (Apache Drill does); but from what I
> remember, Drill has some crafted Calcite voodoo, where QueryRecord
> leverages Calcite as a library.
> 
> A workaround could use UpdateRecord to manually hoist the field to the
> top-level by adding a new field to the outgoing schema at the
> top-level, and in UpdateRecord setting its value to the evaluation of
> your RecordPath "/header/deviceEventClassId". Then you could use
> QueryRecord on that "column".
> 
> Regards,
> Matt
> 
> [1] https://issues.apache.org/jira/browse/AVRO-992
> 
> On Mon, Jul 3, 2017 at 10:07 PM, Andre <[email protected]> wrote:
>> All,
>> 
>> What is the expected to refer to nested avro element from within QueryRecord
>> SQL queries?
>> 
>> When using LookupRecord I may refer to a record child element by the
>> RecordPath "/header/deviceEventClassId", however, it is not clear what
>> syntax should I use with QueryRecord?
>> 
>> SELECT * FROM FLOWFILE WHERE "/header/deviceEventClassId" <> "1"
>> 
>> Fails with column not found.

Re: QueryRecord and AvroReader: What is the expect syntax ?

Reply via email to