Re: Nifi & Parsey McParseface! RegEx in a Processor...

Conrad Crampton Mon, 06 Jun 2016 04:31:27 -0700

Hi,
I’m not a NiFi expert by any stretch of the imagination and there others on 
this list far better informed than me that can speak with authority on many of 
the questions you raise, but I’ll have a go…


It is probably not necessary to  create a custom processor to do the parsing 
(using PMPF) – your executescript processor probably is sufficient. The one 
reason that this may not be desirable is if the Parsey model initialisation is 
expensive and therefore to do for each script invocation would cause a 
bottleneck in processing, if it isn’t then using ListenKafka -> ExecuteScript 
(Parsey) -> PutKafka would do what you want I would have though (conceptually).
However, what you are missing from this pipeline is the analysis of the Parsey 
output as you say. Now this may be something that a custom processor would be 
suitable – quite a simple text processing one using standard Java text 
processing / regexp to then write to a new flowfile for putting back on Kafka 
queue.

If however you feel the Parsey being run via an ExecuteScript processor isn’t 
suitable then I guess there are a number of options available – to make it 
thread safe etc. and available from each node in your Nifi cluster in a 
consistent way, I would be inclined to wrap Parsey up in an Http service and 
invoke via REST (as an idea) – posting in the data to parse and receiving 
output – could even do the analysis to format the output appropriately (as Json 
perhaps) to return back – invoked via GetHttp processor. This may all be able 
to be done in custom processor too and probably the best option IF you can 
understand the Parsey model initialisation within the custom processor.

In any case, my advice (for what it’s worth) would be turn to custom processors 
as last resort and try and leverage the built in processors where possible. 
Whilst it is (fairly) trivial (as you have found out) to write your own 
processor it comes with its own overhead over time in maintenance etc. whereas 
using the built in ones come with a reassurance that they are well tried and 
tested.

Sorry I can’t be more specific on your (very interesting) use case.

Regards
Conrad

From: Pat Trainor <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Monday, 6 June 2016 at 12:02
To: "[email protected]" <[email protected]>
Subject: Re: Nifi & Parsey McParseface! RegEx in a Processor...


Conrad,

Thanks for writing! You do get the gist of it. Last night I realized how easy 
it is to make a custom processor. I was a little confused at first why I needed 
to pass on a new Flowfile in my simple onTrigger function, but the error in the 
Nifi GUI about versions/timestamp made it obvious. I guess I wasn't thinking 
and didn't check the nifi logs!

Anyway, if I am correct, I might be able to add an attribute to an existing 
Flowfile from my little processor. As of late last night I could change one 
that was there already, but today I will try to create one. If I can, then this 
should go well.

Unfortunately, and tell me if I am wrong, this new processor will still need to 
be loaded each time a sentence needs to be analyzed by 'Parsey'. On a small 
scale, this is no big deal, but normally people would be hammering it.

In looking for a clean, fast and [hopefully] elegant solution to accessing 
running services from a processor, is it bad design to simply make my parser 
run as a service, and have it listen to Kafka for text to parse? It could send 
it back as well via another topic...

But that is only 1/2 the problem. The other 1/2 is parsing out the output from 
Parsey, and maybe for that I should make my processor-not getting text sent & 
returned from Parsey... Because storing the output of Parsey (text) isn't a 
direct operation (see the sample output text in prev/original email), it's 
output needs to be analyzed first.

So let me know if this plan is viable:

  1.  Make the Parsey interaction via a java loop (daemon/service).
  2.  This daemon loads the Parsey model chosen once, then waits for Kafka 
messages to process, outputting each on another Kafka topic. It expects to 
receive 3 things:

     *   Flowfile as text to parse.
     *   The Kafka Topic to listen to (processor can't configure this, but will 
reflect user's choice).
     *   The Kafka Topic to send it back on (this I can send to the java 
daemon, and configure each 'return' at runtime)

        *   This way, I am imagining many processors can send to Parsey via one 
fixed topic, and they can each wait for the return data via a unique Topic for 
just that processor.
        *   I cannot see a way to adjust the listening Topic at runtime, so the 
user would make one for all processors to use, then enter that as a processor 
attribute.

  1.  My simple processor sends a flowfile to it via the topic the user selects 
as a Processor attribute "Send Topic".
  2.  The parser, well, parses. Then it sends back the reply on  a Topic set in 
the processor as well as the "Receive Topic".

     *   Is it better to just do the Kafka transfer in the processor, or hand 
it off to PutKafka & GetKafka? My thinking is that this would be harder to do, 
and I would need to write 2 processors... Thoughts?

  1.  The custom processor I'm writing then has the parsed text, but not in a 
format that will allow it to be put into a [graph] database. Knowing a word is 
a NNP isn't enough-you must know which branch on the tree it was (how important 
it is).

     *   This is where the [X] extraction counts, or a better mechanism that 
I'm not thinking of.

  1.  At this point, I am very tempted to keep going in this processor, but 
what if the user wants HDFS, Titan, ? Best here is to stop & put the results in 
it's own "relationship", with the original text that was parsed in another, and 
perhaps even the 'raw parsed' tree-looking text in another Relationship.

     *   So 4 relationships:

        *   Submitted
        *   Post Parsey
        *   Indexed
        *   Failure (of any of 2 or 3)



I will make the (Indexed) output of this processor a standard, of sorts, which 
another processor can change into a query for the DB of choice. The 'tree 
level' could be used for logic like:

  1.  NNP/NNPS at [1] is a vertex.
  2.  NN/NNS > [2] are destination vertices of the above.
  3.  VBG at ROOT is an edge.
  4.  ...
Would it be OK to leave cobbling together their query to INSERT into their DB 
of choice to them? Once such a query crafted, they can use any standard Nifi 
Put* processor, is my thinking...

Your feedback appreciated!
On Jun 6, 2016 3:18 AM, "Conrad Crampton" 
<[email protected]<mailto:[email protected]>> wrote:
Hi,
This may be a long shot as I don’t know how many combinations of the column 
lengths with | and + there are, but you could try using ReplaceTextWithMapping 
processor where you have all combinations of +--| etc. in a text file with what 
they represent in term of counts e.g
+--           [0]
|  +--       [1]
|      +--   [3]

etc. (tab separated)

Also, I’m not a particularly experienced in the area of sed, awk etc. but I’m 
guessing some bash guru would be able to come up with some sort of script that 
does this that could be called from ExcecuteScript processor.

Regards
Conrad

From: Pat Trainor <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Sunday, 5 June 2016 at 18:33
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Nifi & Parsey McParseface! RegEx in a Processor...

I have had success with using ReplaceText processor out of the box to modify 
the output of a nifi-called script. I'm applying nifi to running the parsey 
mcparseface system (Syntaxnet) from google. The ouput of the application looks 
like this:

---
Input: It is to two English scholars , father and son , Edward Pococke , senior 
and junior , that the world is indebted for the knowledge of one of the most 
charming productions Arabian philosophy can boast of .
Parse:
is VBZ ROOT
+-- It PRP nsubj
+-- to IN prep
|   +-- scholars NNS pobj
|       +-- two CD num
|       +-- English JJ amod
|       +-- , , punct
|       +-- father NN conj
|       |   +-- and CC cc
|       |   +-- son NN conj
|       +-- Pococke NNP appos
[...]
---

As you can see, my ExecuteProcessorStream is working fine. But there is a bit 
of importance that needs to be taken from this text. My ReplaceText Processor 
used (the first one) is shown in the attached. It only removes characters.

How many 'spaces' each of the '+' signs is is important. Simply removing 
leading spaces, + and | characters moves the first word in each line to the 
first column, without telling you how many columns over the words started in 
the original input.

WHat is needed is a way to count the number of columns in the beginning of each 
line that precedes the first alphanumeric. It doesn't matter if the same 
processor can also clean things out to my present efforts:

Input: It is to two English scholars , father and son , Edward Pococke , senior 
and junior , that the world is indebted for the knowledge of one of the most 
charming productions Arabian philosophy can boast of .
Parse:
is VBZ ROOT
It PRP nsubj
to IN prep
[...]

I am hoping to somehow use the expressions (a la ${line:blah...) in Nifi, or 
another mechanism I'm not aware of, to gather the column count, making it 
available for later processing/storage.

[0]is VBZ ROOT
[1]It PRP nsubj
[1]to IN prep
[2] ...

With the [X] being the # of columns over from the left that the alpha-numeric 
character was.

The reasoning for this is that the position signifies how 'important' that 
attribute is in the sentence. It looks like a tree, but the numer (indentation) 
is the length of the branch the word is on.

Is there a clever way to accomplish most/all of this, either with () regex or 
named attributes, in Nifi?

Thanks!
pat<http://about.me/PatTrainor>
( ͡° ͜ʖ ͡°)

"A wise man can learn more from a foolish question than a fool can learn from a 
wise answer". ~ Bruce Lee.


***This email originated outside SecureData***

Click here<https://www.mailcontrol.com/sr/MZbqvYs5QwJvpeaetUwhCQ==> to report 
this email as spam.

SecureData, combating cyber threats

________________________________

The information contained in this message or any of its attachments may be 
privileged and confidential and intended for the exclusive use of the intended 
recipient. If you are not the intended recipient any disclosure, reproduction, 
distribution or other dissemination or use of this communications is strictly 
prohibited. The views expressed in this email are those of the individual and 
not necessarily of SecureData Europe Ltd. Any prices quoted are only valid if 
followed up by a formal written quote.

SecureData Europe Limited. Registered in England & Wales 04365896. Registered 
Address: SecureData House, Hermitage Court, Hermitage Lane, Maidstone, Kent, 
ME16 9NT

Re: Nifi & Parsey McParseface! RegEx in a Processor...

Reply via email to