Re: ExecuteSQL just once

Matt Burgess Mon, 31 Oct 2016 08:19:23 -0700

Daniel,

One thing you can do (albeit manually) for a "run once" is to schedule
ExecuteSQL for some large interval (3600 sec for example), then start
and immediately stop the processor. It will run once (stopping the
processor doesn't interrupt the current execution, it just prevents it
from being scheduled again). The Jira Bryan mentioned is [1], if
implemented you could schedule a processor to run N times, and it
won't be scheduled after that number of executions has occurred.


Back to the custom query, can you provide an example of such a query
that you'd like to use? What if we were to add the ability to
QueryDatabaseTable and GenerateTableFetch where you can add WHERE
clauses, GROUP BYs, ORDER BYs, etc.?  We avoided the ability for the
user to provide a custom query, because we'd have to parse the
arbitrary SQL, insert our own clauses, and offer a way for the user to
access state properties (possibly with Expression Language but they'd
be "transient attributes", only available to that processor). It gets
a little complex, but there is precedence with UpdateAttribute [2].

There are a couple of workarounds. For example you could do the
inverse of trying to provide a custom query for the DB processor(s),
where you let GenerateTableFetch generate SQL queries, then use
ReplaceText to update them inline with your clauses. Alternatively,
the Column Names property is used verbatim in the SELECT clause, so
you could put a more complex clause in there (maybe a nested select
ending with AS so the generated SQL makes it look like a table alias
or something).

It's definitely a place where we can improve the user experience
and/or feature set, I'm interested to hear your thoughts on this and
anything else :)

Regards,
Matt

[1] https://issues.apache.org/jira/browse/NIFI-1407
[2] https://issues.apache.org/jira/browse/NIFI-1582

On Sun, Oct 30, 2016 at 8:03 PM, Daniel Einspanjer
<daniel.einspan...@designet.com> wrote:
> Unfortunately, neither of those allow you to specify a custom query, so if
> you are doing anything other than a select *, you have a problem.
>
> The use case I was working with was a "getting started" kind of thing.
> Slurp some data in from a DB and try to work with it in various flows.
> Unfortunately, my laptop got rather unhappy when I accidentally left the
> schedule for ExecuteSQL set to 0 seconds and ran a large query.  The step
> spewed out several repetitions of that query which bogged things down and
> prevented me from being able to stop other processors and clear queues
> because the downstream processors had already started processing the first
> of the large flow files.
>
> I'd say probably something that would make a convenient "run once unless
> this flag is cleared" type setting would have been what I was looking for.
> I admit it isn't quite as likely to be needed in a normal production
> environment though, just development.
>
> -Daniel
>
> On Sun, Oct 30, 2016 at 7:55 PM Matt Burgess <mattyb...@gmail.com> wrote:
>>
>> We have QueryDatabaseTable and GenerateTableFetch for such things,
>> especially the former for 1.0.0. QueryDatabaseTable allows you to pick
>> "thedate" as a max-value column that it keeps track of, and you can specify
>> the initial maximum value (not sure what version that was added in as I'm
>> AFK).
>>
>> Sent from my iPhone
>>
>> On Oct 30, 2016, at 7:11 PM, Joe Witt <joe.w...@gmail.com> wrote:
>>
>> Daniel
>>
>> We don't have mechanisms provided for one time execution at this point.
>> Everything is oriented toward continuous execution.  That said this is
>> starting to come up more and more so seems like we should offer run once
>> scheduling or something like that.
>>
>> Would that give you what you are looking for?  Can you talk about your use
>> case a bit more ?
>>
>> Thanks
>> Joe
>>
>> On Oct 30, 2016 7:07 PM, "Daniel Einspanjer"
>> <daniel.einspan...@designet.com> wrote:
>> >
>> > If I put a SQL statement into an ExecuteSQL processor along the lines
>> > of:
>> >
>> > SELECT * FROM foo WHERE thedate >= '2016-01-01'
>> >
>> > and I build a flow that does something with that data, I'f found if I
>> > leave the ExecuteSQL processor running, it will continually run the query,
>> > resulting in lots of duplicate data.
>> >
>> > What is the proper way to handle something that would be sort of a one
>> > time event?
>> >
>> > Right now, I'm just setting a long run schedule time and making sure I
>> > turn the step off after it finishes the first run.
>> >
>> > -Daniel

Re: ExecuteSQL just once

Reply via email to