I know there is an experimental event-driven scheduling policy. This is not that. Has anyone considered a pattern where processors might emit events based on certain criteria and other processors might ONLY act on those events? I'm just thinking out loud on this thought at the moment and just wanted to see if anyone else had pondered this concept. Here's my use case. Consider the RouteText processor feeding into a PutHDFS. RouteText is grouping records on yyyymmdd values using a regex because I want to partition files into HDFS directories by yyyymmdd and then use Hive to query the data. PutHDFS simply uses the RouteText.group attribute to create the year/month/day HDFS directory structure like:
/stats/year=2015/month=11/day=28/the_stats_file_000001.csv However, I need to ALSO run a Hive HQL command to "alter table X add partition Y" to allow Hive to see this new partition of data. So, the "event" part of this concept would be some way to instruct PutHDFS to emit an event ONLY when it actually creates a new directory. There could be an "event" relationship that could feed some other processor, like ExecuteSQL, that would then add the partition ONLY when this event occurs. It would NOT act on any flowfiles - only on events. There will be lots of files being put that will fall into an ALREADY existing directory since PutHDFS only has to create the directory structure ONCE. I only need to know about that ONE event so as to run the HQL command ONCE. I know there are ways to "wire" this up using existing processors, like use ExecuteStreamCommand to run a script that checks if the directory exists, and, if not, create it and run the SQL processor to run a SQL command against Hive to build the partition and then let PutHDFS do it's thing. But that means running this script on EVERY flow file which is a waster of resources. Only PutHDFS really knows when it needs to create the directory ONCE. I was just wondering if there was any thought of building in some asyc event handling? Anyway, just an idea. Mark
