Mark, I was thinking exactly this same thought about adding a dir.created type attribute to key off and thereby continuing to work within the current framework. Thanks for you thoughts.
On Sun, Nov 29, 2015 at 10:51 AM, Mark Payne <[email protected]> wrote: > Mark, > > I can't say that I've ever really given thought to an explicitly "Eventing > Model" like the one that you are describing. > However, the way that you are describing it is really just a new > relationship on the PutHDFS processor, so it would > be a very processor-specific change. > > Rather than adding a new "Event" type of relationship, though, I would > lean more toward creating an attribute on the existing > FlowFile that is routed to 'success'. So an attribute named, say, > "hdfs.directory.created" could be added to the FlowFile > and if you care about that information, you can route the FlowFiles o a > RouteOnAttribute processor, which > is able to route the FlowFIle accordingly. > > Does this give you what you need? > > Thanks > -Mark > > > > > On Nov 29, 2015, at 9:36 AM, Mark Petronic <[email protected]> > wrote: > > > > I know there is an experimental event-driven scheduling policy. This is > not that. Has anyone considered a pattern where processors might emit > events based on certain criteria and other processors might ONLY act on > those events? I'm just thinking out loud on this thought at the moment and > just wanted to see if anyone else had pondered this concept. Here's my use > case. Consider the RouteText processor feeding into a PutHDFS. RouteText is > grouping records on yyyymmdd values using a regex because I want to > partition files into HDFS directories by yyyymmdd and then use Hive to > query the data. PutHDFS simply uses the RouteText.group attribute to create > the year/month/day HDFS directory structure like: > > > > /stats/year=2015/month=11/day=28/the_stats_file_000001.csv > > > > However, I need to ALSO run a Hive HQL command to "alter table X add > partition Y" to allow Hive to see this new partition of data. So, the > "event" part of this concept would be some way to instruct PutHDFS to emit > an event ONLY when it actually creates a new directory. There could be an > "event" relationship that could feed some other processor, like ExecuteSQL, > that would then add the partition ONLY when this event occurs. It would NOT > act on any flowfiles - only on events. There will be lots of files being > put that will fall into an ALREADY existing directory since PutHDFS only > has to create the directory structure ONCE. I only need to know about that > ONE event so as to run the HQL command ONCE. I know there are ways to > "wire" this up using existing processors, like use ExecuteStreamCommand to > run a script that checks if the directory exists, and, if not, create it > and run the SQL processor to run a SQL command against Hive to build the > partition and then let PutHDFS do it's thing. But that means running this > script on EVERY flow file which is a waster of resources. Only PutHDFS > really knows when it needs to create the directory ONCE. I was just > wondering if there was any thought of building in some asyc event handling? > > > > Anyway, just an idea. > > > > Mark > >
