Re: Concept of async event-driven processors but not the experimental scheduling policy version

Mark Petronic Sun, 29 Nov 2015 09:12:49 -0800

Mark, I was thinking exactly this same thought about adding a dir.created
type attribute to key off and thereby continuing to work within the current
framework. Thanks for you thoughts.


On Sun, Nov 29, 2015 at 10:51 AM, Mark Payne <[email protected]> wrote:

> Mark,
>
> I can't say that I've ever really given thought to an explicitly "Eventing
> Model" like the one that you are describing.
> However, the way that you are describing it is really just a new
> relationship on the PutHDFS processor, so it would
> be a very processor-specific change.
>
> Rather than adding a new "Event" type of relationship, though, I would
> lean more toward creating an attribute on the existing
> FlowFile that is routed to 'success'. So an attribute named, say,
> "hdfs.directory.created" could be added to the FlowFile
> and if you care about that information, you can route the FlowFiles o a
> RouteOnAttribute processor, which
> is able to route the FlowFIle accordingly.
>
> Does this give you what you need?
>
> Thanks
> -Mark
>
>
>
> > On Nov 29, 2015, at 9:36 AM, Mark Petronic <[email protected]>
> wrote:
> >
> > I know there is an experimental event-driven scheduling policy. This is
> not that. Has anyone considered a pattern where processors might emit
> events based on certain criteria and other processors might ONLY act on
> those events? I'm just thinking out loud on this thought at the moment and
> just wanted to see if anyone else had pondered this concept. Here's my use
> case. Consider the RouteText processor feeding into a PutHDFS. RouteText is
> grouping records on yyyymmdd values using a regex because I want to
> partition files into HDFS directories by yyyymmdd and then use Hive to
> query the data. PutHDFS simply uses the RouteText.group attribute to create
> the year/month/day HDFS directory structure like:
> >
> > /stats/year=2015/month=11/day=28/the_stats_file_000001.csv
> >
> > However, I need to ALSO run a Hive HQL command to "alter table X add
> partition Y" to allow Hive to see this new partition of data. So, the
> "event" part of this concept would be some way to instruct PutHDFS to emit
> an event ONLY when it actually creates a new directory. There could be an
> "event" relationship that could feed some other processor, like ExecuteSQL,
> that would then add the partition ONLY when this event occurs. It would NOT
> act on any flowfiles - only on events. There will be lots of files being
> put that will fall into an ALREADY existing directory since PutHDFS only
> has to create the directory structure ONCE. I only need to know about that
> ONE event so as to run the HQL command ONCE. I know there are ways to
> "wire" this up using existing processors, like use ExecuteStreamCommand to
> run a script that checks if the directory exists, and, if not, create it
> and run the SQL processor to run a SQL command against Hive to build the
> partition and then let PutHDFS do it's thing. But that means running this
> script on EVERY flow file which is a waster of resources. Only PutHDFS
> really knows when it needs to create the directory ONCE. I was just
> wondering if there was any thought of building in some asyc event handling?
> >
> > Anyway, just an idea.
> >
> > Mark
>
>

Re: Concept of async event-driven processors but not the experimental scheduling policy version

Reply via email to