Re: AVRO is the only output format with ExecuteSQL

Matt Burgess Fri, 10 Aug 2018 13:56:10 -0700

Boris et al,

I put up a PR [1] to add ExecuteSQLRecord and QueryDatabaseTableRecord
under NIFI-4517, in case anyone wants to play around with it :)


Regards,
Matt

[1] https://github.com/apache/nifi/pull/2945
On Tue, Aug 7, 2018 at 8:30 PM Boris Tyukin <bo...@boristyukin.com> wrote:
>
> Matt, you rock!! thank you!!
>
> On Tue, Aug 7, 2018 at 5:16 PM Matt Burgess <mattyb...@gmail.com> wrote:
>>
>> Sounds good, it makes the underlying code a bit more complicated but I see 
>> from y’all’s points that a “separate” processor is a better user experience. 
>> I’m knee deep in it as we speak, hope to have a PR up in a few days.
>>
>> Thanks,
>> Matt
>>
>>
>> On Aug 7, 2018, at 5:07 PM, Andrew Grande <apere...@gmail.com> wrote:
>>
>> I'd really like to see the Record suffix on the processor for 
>> discoverability, as already mentioned.
>>
>> Andrew
>>
>> On Tue, Aug 7, 2018, 2:16 PM Matt Burgess <mattyb...@apache.org> wrote:
>>>
>>> Yeah that's definitely doable, most of the logic for writing a
>>> ResultSet to a Flow File is localized (currently to JdbcCommon but
>>> also in ResultSetRecordSet), so I wouldn't think it would be too much
>>> refactor. What are folks thoughts on whether to add a Record Writer
>>> property to the existing ExecuteSQL or subclass it to a new processor
>>> called ExecuteSQLRecord? The former is more consistent with how the
>>> SiteToSite reporting tasks work, but this is a processor. The latter
>>> is more consistent with the way we've done other record processors,
>>> and the benefit there is that we don't have to add a bunch of
>>> documentation to fields that will be ignored (such as the Use Avro
>>> Logical Types property which we wouldn't need in a ExecuteSQLRecord).
>>> Having said that, we will want to offer the same options in the Avro
>>> Reader/Writer, but Peter is working on that under NIFI-5405 [1].
>>>
>>> Thanks,
>>> Matt
>>>
>>> [1] https://issues.apache.org/jira/browse/NIFI-5405
>>>
>>> On Tue, Aug 7, 2018 at 2:06 PM Andy LoPresto <alopre...@apache.org> wrote:
>>> >
>>> > Matt,
>>> >
>>> > Would extending the core ExecuteSQL processor with an ExecuteSQLRecord 
>>> > processor also work? I wonder about discoverability if only one processor 
>>> > is present and in other places we explicitly name the processors which 
>>> > handle records as such. If the ExecuteSQL processor handled all the SQL 
>>> > logic, and the ExecuteSQLRecord processor just delegated most of the 
>>> > processing in its #onTrigger() method to super, do you foresee any 
>>> > substantial difficulties? It might require some refactoring of the parent 
>>> > #onTrigger() to service methods.
>>> >
>>> >
>>> > Andy LoPresto
>>> > alopre...@apache.org
>>> > alopresto.apa...@gmail.com
>>> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>>> >
>>> > On Aug 7, 2018, at 10:25 AM, Andrew Grande <apere...@gmail.com> wrote:
>>> >
>>> > As a side note, one has to ha e a serious justification _not_ to use 
>>> > record-based processors. The benefits, including performance, are too 
>>> > numerous to call out here.
>>> >
>>> > Andrew
>>> >
>>> > On Tue, Aug 7, 2018, 1:15 PM Mark Payne <marka...@hotmail.com> wrote:
>>> >>
>>> >> Boris,
>>> >>
>>> >> Using a Record-based processor does not mean that you need to define a 
>>> >> schema upfront. This is
>>> >> necessary if the source itself cannot provide a schema. However, since 
>>> >> it is pulling structured data
>>> >> and the schema can be inferred from the database, you wouldn't need to. 
>>> >> As Matt was saying, your
>>> >> Record Writer can simply be configured to Inherit Record Schema. It can 
>>> >> then write the schema to
>>> >> the "avro.schema" attribute or you can choose "Do Not Write Schema". 
>>> >> This would still allow the data
>>> >> to be written in JSON, CSV, etc.
>>> >>
>>> >> You could also have the Record Writer choose to write the schema using 
>>> >> the "avro.schema" attribute,
>>> >> as mentioned above, and then have any down-stream processors read the 
>>> >> schema from this attribute.
>>> >> This would allow you to use any record-oriented processors you'd like 
>>> >> without having to define the
>>> >> schema yourself, if you don't want to.
>>> >>
>>> >> Thanks
>>> >> -Mark
>>> >>
>>> >>
>>> >>
>>> >> On Aug 7, 2018, at 12:37 PM, Boris Tyukin <bo...@boristyukin.com> wrote:
>>> >>
>>> >> thanks for all the responses! it means I am not the only one interested 
>>> >> in this topic.
>>> >>
>>> >> Record-aware version would be really nice, but a lot of times I do not 
>>> >> want to use record-based processors since I need to define a schema for 
>>> >> input/output upfront and just want to run SQL query and get whatever 
>>> >> results back. It just adds an extra step that will be subject to 
>>> >> break/support.
>>> >>
>>> >> Similar to Kafka processors, it is nice to have an option of 
>>> >> record-based processor vs. message oriented processor. But if one 
>>> >> processor can do it all, it is even better :)
>>> >>
>>> >>
>>> >> On Tue, Aug 7, 2018 at 9:28 AM Matt Burgess <mattyb...@apache.org> wrote:
>>> >>>
>>> >>> I'm definitely interested in supporting a record-aware version as well
>>> >>> (I wrote the Jira up last year [1] but haven't gotten around to
>>> >>> implementing it), however I agree with Peter's comment on the Jira.
>>> >>> Since ExecuteSQL is an oft-touched processor, if we had two processors
>>> >>> that only differed in how the output is formatted, it could be harder
>>> >>> to maintain (bugs to be fixed in two places, e.g.). I think we should
>>> >>> add an optional RecordWriter property to ExecuteSQL, and the
>>> >>> documentation would reflect that if it is not set, the output will be
>>> >>> Avro with embedded schema as it has always been. If the RecordWriter
>>> >>> is set, either the schema can be hardcoded, or they can use "Inherit
>>> >>> Record Schema" even though there's no reader, and that would mimic the
>>> >>> current behavior where the schema is inferred from the database
>>> >>> columns and used for the writer. There is precedence for this pattern
>>> >>> in the SiteToSite reporting tasks.
>>> >>>
>>> >>> To Bryan's point about history, Avro at the time was the most
>>> >>> descriptive of the solutions because it maintains the schema and
>>> >>> datatypes with the data, unlike JSON, CSV, etc. Also before the record
>>> >>> readers/writers, as Bryan said, you pretty much had to split,
>>> >>> transform, merge. We just need to make that processor (and others with
>>> >>> specific input/output formats) "record-aware" for better performance.
>>> >>>
>>> >>> Regards,
>>> >>> Matt
>>> >>>
>>> >>> [1] https://issues.apache.org/jira/browse/NIFI-4517
>>> >>> On Tue, Aug 7, 2018 at 9:20 AM Bryan Bende <bbe...@gmail.com> wrote:
>>> >>> >
>>> >>> > I would also add that the pattern of splitting to 1 record per flow
>>> >>> > file was common before the record processors existed, and generally
>>> >>> > this can/should be avoided now in favor of processing/manipulating
>>> >>> > records in place, and keeping them together in large batches.
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > On Tue, Aug 7, 2018 at 9:10 AM, Andrew Grande <apere...@gmail.com> 
>>> >>> > wrote:
>>> >>> > > Careful, that makes too much sense, Joe ;)
>>> >>> > >
>>> >>> > >
>>> >>> > > On Tue, Aug 7, 2018, 8:45 AM Joe Witt <joe.w...@gmail.com> wrote:
>>> >>> > >>
>>> >>> > >> i think we just need to make an ExecuteSqlRecord processor.
>>> >>> > >>
>>> >>> > >> thanks
>>> >>> > >>
>>> >>> > >> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <mikerthom...@gmail.com> 
>>> >>> > >> wrote:
>>> >>> > >>>
>>> >>> > >>> My guess is that it is due to the fact that Avro is the only 
>>> >>> > >>> record type
>>> >>> > >>> that can match sql pretty closely feature to feature on data 
>>> >>> > >>> types.
>>> >>> > >>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin 
>>> >>> > >>> <bo...@boristyukin.com>
>>> >>> > >>> wrote:
>>> >>> > >>>>
>>> >>> > >>>> I've been wondering since I started learning NiFi why ExecuteSQL
>>> >>> > >>>> processor only returns AVRO formatted data. All community 
>>> >>> > >>>> examples I've seen
>>> >>> > >>>> then convert AVRO to json and pretty much all of them then split 
>>> >>> > >>>> json to
>>> >>> > >>>> multiple flows.
>>> >>> > >>>>
>>> >>> > >>>> I found myself doing the same thing over and over and over again.
>>> >>> > >>>>
>>> >>> > >>>> Since everyone is doing it, is there a strong reason why AVRO is 
>>> >>> > >>>> liked
>>> >>> > >>>> so much? And why everyone continues doing this 3 step pattern 
>>> >>> > >>>> rather than
>>> >>> > >>>> providing users with an option to output json instead and 
>>> >>> > >>>> another option to
>>> >>> > >>>> output one flowfile or multiple (one per record).
>>> >>> > >>>>
>>> >>> > >>>> thanks
>>> >>> > >>>> Boris
>>> >>
>>> >>
>>> >

Re: AVRO is the only output format with ExecuteSQL

Reply via email to