Thank you Matt. Let me better explain why I (think) I need a different
approach.

I have a master level directory, TOP
Under TOP I have YYYY directories - 2017, 2016, 2015, 2014, ....
Under each of those I have twelve MM subdirectories, 01 - 12
Under each MM I have a subdirectory for each DD

I find that I cannot point ListFile at TOP or even at any YYYY level - it
takes it too long to scan through all the files each processing cycle to
identify those that are new or updated.

I need to develop a workflow that works back in time from month prior to
current, iterates a ListFile-like scan for the month, and feeds any newly
appearing or updated files to FetchFile. When it is done with a month, it
needs to back up to the month previous to that in chronological order and
repeat. It needs to repeat the cycle until there are no more YYYY/MMs in
the directory tree.

It doesn't appear to me that ListFile can be used inside of an iterating
workflow like this because it accepts no connecting input from a prior
processor. So I woul dnot be able to provide it an expression language
expression for its Input Dir that would incorporate attributes I would set
at end of each loop (nextYYYY, nextMM).

Am I mistaken? Is there a way to feed ListFile attributes I set and
reference in the dynamic expression for Input Dir?

(You're probably asking why work backwards? The probability of updates is
highest in the "newest" legacy content. Customer wants to see them as
quickly as possible. As we move back through time, updates begin to trail
off in frequency).

Handling content that arrives in the current day is easy for ListDir. I
will handle that in a separate independent workflow. I simply point it at
Input Dir expression
/TOP/${now():format('yyyy'):toString()}/${now():format('MM'):toString()}/${now():format('dd'):toString()}
and so it adjusts automagically for me through the expression to always
look at current.

But working backwards is another matter.

Jim

On Fri, Oct 20, 2017 at 10:30 AM, Matt Burgess <[email protected]> wrote:

> I have an example (albeit a trivial one) of this in my ExecuteScript
> Cookbook post [1].  As far as a separate workaround, I can't tell from
> the description what you need to do differently than ListFile.  It
> starts with no state, lists all the files, saves the time of the
> newest file in state, then only sends files in that directory with a
> timestamp later than the timestamp in state.  Are you trying to store
> the current time in state, versus the time of the newest file?
>
> Regards,
> Matt
>
> [1] https://community.hortonworks.com/articles/77739/
> executescript-cookbook-part-3.html
>
> On Fri, Oct 20, 2017 at 10:14 AM, James McMahon <[email protected]>
> wrote:
> > Does anyone have an example where setState and getState are called from a
> > python script? I need to run the script in an ExecuteScript processor,
> > initializing a datetime to some initial zero condition as ListFile might
> do,
> > saving the time I do a listing of files to state, and recalling that
> value
> > when I do subsequent iterations through ExecuteScript.
> >
> > My Execute Script will create a JSON object that is a list of all files
> in
> > directory with DT stamp later than the value in state. I'll pass that
> result
> > to SplitJSON. My intention is for the resulting flowfiles to be used by
> > FetchFile.
> >
> > Normally I would let ListFile do this for me, but in my situation I must
> > iterate, and as far as I can see ListFile allows no input connections. I
> > can't see how it can be used in any iterative fashion. Am trying to come
> up
> > with a reasonable workaround.
> >
> >
>

Reply via email to