What Matt laid out here is accurate. QueryFlowFile currently operates only on the 'root level' fields. I definitely want to have a mechanism for querying embedded fields as well, but right now it's not really possible to do so directly.
As Matt pointed out, you can certainly use UpdateRecord to extract those fields of interest to a higher level. Or an alternative would be to convert the Avro data into JSON and use the JsonPathReader. But of course moving away from having to convert into JSON is one of the goals of the record-oriented paradigm. When JsonPathReader was written, though, we hadn't yet developed the RecordPath DSL so we didn't build a more generic RecordPathReader. While I want to enable QueryRecord to handle these embedded fields, it's really a matter of how best to design it. There are a few options. Firstly, we could let QueryRecord's dynamic properties either be SQL or RecordPath and let the RecordPath's extract fields to query. This may be a bit confusing though because describing exactly what the dynamic properties are and how they are used becomes a bit confusing. An alternate approach would be to develop a RecordPathReader that is configured with another RecordReader (an Avro reader in this case) and then configured with 1 more RecordPath's that extract the fields of interest, and this Record that has the fields of interest as root-level fields would be what you'd configured QueryRecord with. A third option may be to look at the Calcite library and see if they provide any sort of mechanism for identifying which columns are being referenced in the projections, and then try compiling those as RecordPath's and using them when creating the Schema. This would perhaps be the cleanest solution in terms of usability but would mean that: (a) other processors may still have to do a lot of work to do something similar, and (b) it may end up being quite difficult and inefficient to do this, and I'm not sure if it's even possible currently. So I said all that to say this: it's certainly something that we want to do, that is currently lacking. But given the amount of interest that we've seen in the record-oriented processors, this will likely be a hot focus area for development. If you have suggestions or are interested in delving in here, please let me know and I'd be happy to discuss further, review PR's, etc. Thanks! -Mark > On Jul 3, 2017, at 10:59 PM, Matt Burgess <[email protected]> wrote: > > Andre, > > I don't believe that QueryRecord supports RecordPath expressions as > "column names" (but someone should correct me if I'm mistaken :) I > think it takes a tabular view of the record, meaning each top-level > field is a "column" for the purposes of querying. > > For JSON, the JsonPathReader lets you "hoist" fields from a nested > record into what looks like top-level fields from the record API > perspective. For Avro, there is no current AvroPathReader, which IIRC > is partly due to the fact that there is no AvroPath implementation in > the Avro project [1]. However the Jira case mentions an > Apache-licensed library for doing such things, and certainly > RecordPathCompiler handles it from the Record API side. I think the > kicker is how to expose fields for projections (i.e. SELECTs); for > arbitrarily nested records, that's a lot of "columns" and/or > non-standard SQL. It can be done (Apache Drill does); but from what I > remember, Drill has some crafted Calcite voodoo, where QueryRecord > leverages Calcite as a library. > > A workaround could use UpdateRecord to manually hoist the field to the > top-level by adding a new field to the outgoing schema at the > top-level, and in UpdateRecord setting its value to the evaluation of > your RecordPath "/header/deviceEventClassId". Then you could use > QueryRecord on that "column". > > Regards, > Matt > > [1] https://issues.apache.org/jira/browse/AVRO-992 > > On Mon, Jul 3, 2017 at 10:07 PM, Andre <[email protected]> wrote: >> All, >> >> What is the expected to refer to nested avro element from within QueryRecord >> SQL queries? >> >> When using LookupRecord I may refer to a record child element by the >> RecordPath "/header/deviceEventClassId", however, it is not clear what >> syntax should I use with QueryRecord? >> >> SELECT * FROM FLOWFILE WHERE "/header/deviceEventClassId" <> "1" >> >> Fails with column not found.
