Re: Concept of async event-driven processors but not the experimental scheduling policy version

Mark Payne Sun, 29 Nov 2015 07:51:46 -0800

Mark,

I can't say that I've ever really given thought to an explicitly "Eventing 
Model" like the one that you are describing.
However, the way that you are describing it is really just a new relationship 
on the PutHDFS processor, so it would
be a very processor-specific change.


Rather than adding a new "Event" type of relationship, though, I would lean 
more toward creating an attribute on the existing
FlowFile that is routed to 'success'. So an attribute named, say, 
"hdfs.directory.created" could be added to the FlowFile
and if you care about that information, you can route the FlowFiles o a 
RouteOnAttribute processor, which
is able to route the FlowFIle accordingly.

Does this give you what you need?

Thanks
-Mark



> On Nov 29, 2015, at 9:36 AM, Mark Petronic <[email protected]> wrote:
> 
> I know there is an experimental event-driven scheduling policy. This is not 
> that. Has anyone considered a pattern where processors might emit events 
> based on certain criteria and other processors might ONLY act on those 
> events? I'm just thinking out loud on this thought at the moment and just 
> wanted to see if anyone else had pondered this concept. Here's my use case. 
> Consider the RouteText processor feeding into a PutHDFS. RouteText is 
> grouping records on yyyymmdd values using a regex because I want to partition 
> files into HDFS directories by yyyymmdd and then use Hive to query the data. 
> PutHDFS simply uses the RouteText.group attribute to create the 
> year/month/day HDFS directory structure like:
> 
> /stats/year=2015/month=11/day=28/the_stats_file_000001.csv
> 
> However, I need to ALSO run a Hive HQL command to "alter table X add 
> partition Y" to allow Hive to see this new partition of data. So, the "event" 
> part of this concept would be some way to instruct PutHDFS to emit an event 
> ONLY when it actually creates a new directory. There could be an "event" 
> relationship that could feed some other processor, like ExecuteSQL, that 
> would then add the partition ONLY when this event occurs. It would NOT act on 
> any flowfiles - only on events. There will be lots of files being put that 
> will fall into an ALREADY existing directory since PutHDFS only has to create 
> the directory structure ONCE. I only need to know about that ONE event so as 
> to run the HQL command ONCE. I know there are ways to "wire" this up using 
> existing processors, like use ExecuteStreamCommand to run a script that 
> checks if the directory exists, and, if not, create it and run the SQL 
> processor to run a SQL command against Hive to build the partition and then 
> let PutHDFS do it's thing. But that means running this script on EVERY flow 
> file which is a waster of resources. Only PutHDFS really knows when it needs 
> to create the directory ONCE. I was just wondering if there was any thought 
> of building in some asyc event handling?
> 
> Anyway, just an idea.
> 
> Mark

Re: Concept of async event-driven processors but not the experimental scheduling policy version

Reply via email to