Hi, I’m not a NiFi expert by any stretch of the imagination and there others on this list far better informed than me that can speak with authority on many of the questions you raise, but I’ll have a go…
It is probably not necessary to create a custom processor to do the parsing (using PMPF) – your executescript processor probably is sufficient. The one reason that this may not be desirable is if the Parsey model initialisation is expensive and therefore to do for each script invocation would cause a bottleneck in processing, if it isn’t then using ListenKafka -> ExecuteScript (Parsey) -> PutKafka would do what you want I would have though (conceptually). However, what you are missing from this pipeline is the analysis of the Parsey output as you say. Now this may be something that a custom processor would be suitable – quite a simple text processing one using standard Java text processing / regexp to then write to a new flowfile for putting back on Kafka queue. If however you feel the Parsey being run via an ExecuteScript processor isn’t suitable then I guess there are a number of options available – to make it thread safe etc. and available from each node in your Nifi cluster in a consistent way, I would be inclined to wrap Parsey up in an Http service and invoke via REST (as an idea) – posting in the data to parse and receiving output – could even do the analysis to format the output appropriately (as Json perhaps) to return back – invoked via GetHttp processor. This may all be able to be done in custom processor too and probably the best option IF you can understand the Parsey model initialisation within the custom processor. In any case, my advice (for what it’s worth) would be turn to custom processors as last resort and try and leverage the built in processors where possible. Whilst it is (fairly) trivial (as you have found out) to write your own processor it comes with its own overhead over time in maintenance etc. whereas using the built in ones come with a reassurance that they are well tried and tested. Sorry I can’t be more specific on your (very interesting) use case. Regards Conrad From: Pat Trainor <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Monday, 6 June 2016 at 12:02 To: "[email protected]" <[email protected]> Subject: Re: Nifi & Parsey McParseface! RegEx in a Processor... Conrad, Thanks for writing! You do get the gist of it. Last night I realized how easy it is to make a custom processor. I was a little confused at first why I needed to pass on a new Flowfile in my simple onTrigger function, but the error in the Nifi GUI about versions/timestamp made it obvious. I guess I wasn't thinking and didn't check the nifi logs! Anyway, if I am correct, I might be able to add an attribute to an existing Flowfile from my little processor. As of late last night I could change one that was there already, but today I will try to create one. If I can, then this should go well. Unfortunately, and tell me if I am wrong, this new processor will still need to be loaded each time a sentence needs to be analyzed by 'Parsey'. On a small scale, this is no big deal, but normally people would be hammering it. In looking for a clean, fast and [hopefully] elegant solution to accessing running services from a processor, is it bad design to simply make my parser run as a service, and have it listen to Kafka for text to parse? It could send it back as well via another topic... But that is only 1/2 the problem. The other 1/2 is parsing out the output from Parsey, and maybe for that I should make my processor-not getting text sent & returned from Parsey... Because storing the output of Parsey (text) isn't a direct operation (see the sample output text in prev/original email), it's output needs to be analyzed first. So let me know if this plan is viable: 1. Make the Parsey interaction via a java loop (daemon/service). 2. This daemon loads the Parsey model chosen once, then waits for Kafka messages to process, outputting each on another Kafka topic. It expects to receive 3 things: * Flowfile as text to parse. * The Kafka Topic to listen to (processor can't configure this, but will reflect user's choice). * The Kafka Topic to send it back on (this I can send to the java daemon, and configure each 'return' at runtime) * This way, I am imagining many processors can send to Parsey via one fixed topic, and they can each wait for the return data via a unique Topic for just that processor. * I cannot see a way to adjust the listening Topic at runtime, so the user would make one for all processors to use, then enter that as a processor attribute. 1. My simple processor sends a flowfile to it via the topic the user selects as a Processor attribute "Send Topic". 2. The parser, well, parses. Then it sends back the reply on a Topic set in the processor as well as the "Receive Topic". * Is it better to just do the Kafka transfer in the processor, or hand it off to PutKafka & GetKafka? My thinking is that this would be harder to do, and I would need to write 2 processors... Thoughts? 1. The custom processor I'm writing then has the parsed text, but not in a format that will allow it to be put into a [graph] database. Knowing a word is a NNP isn't enough-you must know which branch on the tree it was (how important it is). * This is where the [X] extraction counts, or a better mechanism that I'm not thinking of. 1. At this point, I am very tempted to keep going in this processor, but what if the user wants HDFS, Titan, ? Best here is to stop & put the results in it's own "relationship", with the original text that was parsed in another, and perhaps even the 'raw parsed' tree-looking text in another Relationship. * So 4 relationships: * Submitted * Post Parsey * Indexed * Failure (of any of 2 or 3) I will make the (Indexed) output of this processor a standard, of sorts, which another processor can change into a query for the DB of choice. The 'tree level' could be used for logic like: 1. NNP/NNPS at [1] is a vertex. 2. NN/NNS > [2] are destination vertices of the above. 3. VBG at ROOT is an edge. 4. ... Would it be OK to leave cobbling together their query to INSERT into their DB of choice to them? Once such a query crafted, they can use any standard Nifi Put* processor, is my thinking... Your feedback appreciated! On Jun 6, 2016 3:18 AM, "Conrad Crampton" <[email protected]<mailto:[email protected]>> wrote: Hi, This may be a long shot as I don’t know how many combinations of the column lengths with | and + there are, but you could try using ReplaceTextWithMapping processor where you have all combinations of +--| etc. in a text file with what they represent in term of counts e.g +-- [0] | +-- [1] | +-- [3] etc. (tab separated) Also, I’m not a particularly experienced in the area of sed, awk etc. but I’m guessing some bash guru would be able to come up with some sort of script that does this that could be called from ExcecuteScript processor. Regards Conrad From: Pat Trainor <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Sunday, 5 June 2016 at 18:33 To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Nifi & Parsey McParseface! RegEx in a Processor... I have had success with using ReplaceText processor out of the box to modify the output of a nifi-called script. I'm applying nifi to running the parsey mcparseface system (Syntaxnet) from google. The ouput of the application looks like this: --- Input: It is to two English scholars , father and son , Edward Pococke , senior and junior , that the world is indebted for the knowledge of one of the most charming productions Arabian philosophy can boast of . Parse: is VBZ ROOT +-- It PRP nsubj +-- to IN prep | +-- scholars NNS pobj | +-- two CD num | +-- English JJ amod | +-- , , punct | +-- father NN conj | | +-- and CC cc | | +-- son NN conj | +-- Pococke NNP appos [...] --- As you can see, my ExecuteProcessorStream is working fine. But there is a bit of importance that needs to be taken from this text. My ReplaceText Processor used (the first one) is shown in the attached. It only removes characters. How many 'spaces' each of the '+' signs is is important. Simply removing leading spaces, + and | characters moves the first word in each line to the first column, without telling you how many columns over the words started in the original input. WHat is needed is a way to count the number of columns in the beginning of each line that precedes the first alphanumeric. It doesn't matter if the same processor can also clean things out to my present efforts: Input: It is to two English scholars , father and son , Edward Pococke , senior and junior , that the world is indebted for the knowledge of one of the most charming productions Arabian philosophy can boast of . Parse: is VBZ ROOT It PRP nsubj to IN prep [...] I am hoping to somehow use the expressions (a la ${line:blah...) in Nifi, or another mechanism I'm not aware of, to gather the column count, making it available for later processing/storage. [0]is VBZ ROOT [1]It PRP nsubj [1]to IN prep [2] ... With the [X] being the # of columns over from the left that the alpha-numeric character was. The reasoning for this is that the position signifies how 'important' that attribute is in the sentence. It looks like a tree, but the numer (indentation) is the length of the branch the word is on. Is there a clever way to accomplish most/all of this, either with () regex or named attributes, in Nifi? Thanks! pat<http://about.me/PatTrainor> ( ͡° ͜ʖ ͡°) "A wise man can learn more from a foolish question than a fool can learn from a wise answer". ~ Bruce Lee. ***This email originated outside SecureData*** Click here<https://www.mailcontrol.com/sr/MZbqvYs5QwJvpeaetUwhCQ==> to report this email as spam. SecureData, combating cyber threats ________________________________ The information contained in this message or any of its attachments may be privileged and confidential and intended for the exclusive use of the intended recipient. If you are not the intended recipient any disclosure, reproduction, distribution or other dissemination or use of this communications is strictly prohibited. The views expressed in this email are those of the individual and not necessarily of SecureData Europe Ltd. Any prices quoted are only valid if followed up by a formal written quote. SecureData Europe Limited. Registered in England & Wales 04365896. Registered Address: SecureData House, Hermitage Court, Hermitage Lane, Maidstone, Kent, ME16 9NT
