Concept of async event-driven processors but not the experimental scheduling policy version

Mark Petronic Sun, 29 Nov 2015 06:36:55 -0800

I know there is an experimental event-driven scheduling policy. This is not
that. Has anyone considered a pattern where processors might emit events
based on certain criteria and other processors might ONLY act on those
events? I'm just thinking out loud on this thought at the moment and just
wanted to see if anyone else had pondered this concept. Here's my use case.
Consider the RouteText processor feeding into a PutHDFS. RouteText is
grouping records on yyyymmdd values using a regex because I want to
partition files into HDFS directories by yyyymmdd and then use Hive to
query the data. PutHDFS simply uses the RouteText.group attribute to create
the year/month/day HDFS directory structure like:


/stats/year=2015/month=11/day=28/the_stats_file_000001.csv

However, I need to ALSO run a Hive HQL command to "alter table X add
partition Y" to allow Hive to see this new partition of data. So, the
"event" part of this concept would be some way to instruct PutHDFS to emit
an event ONLY when it actually creates a new directory. There could be an
"event" relationship that could feed some other processor, like ExecuteSQL,
that would then add the partition ONLY when this event occurs. It would NOT
act on any flowfiles - only on events. There will be lots of files being
put that will fall into an ALREADY existing directory since PutHDFS only
has to create the directory structure ONCE. I only need to know about that
ONE event so as to run the HQL command ONCE. I know there are ways to
"wire" this up using existing processors, like use ExecuteStreamCommand to
run a script that checks if the directory exists, and, if not, create it
and run the SQL processor to run a SQL command against Hive to build the
partition and then let PutHDFS do it's thing. But that means running this
script on EVERY flow file which is a waster of resources.
Only PutHDFS really knows when it needs to create the directory ONCE. I was
just wondering if there was any thought of building in some asyc event
handling?

Anyway, just an idea.

Mark

Concept of async event-driven processors but not the experimental scheduling policy version

Reply via email to