Re: Nifi & Parsey McParseface! RegEx in a Processor...

Pat Trainor Mon, 06 Jun 2016 04:03:07 -0700

Conrad,

Thanks for writing! You do get the gist of it. Last night I realized how easy
it is to make a custom processor. I was a little confused at first why I
needed to pass on a new Flowfile in my simple onTrigger function, but the
error in the Nifi GUI about versions/timestamp made it obvious. I guess I
wasn't thinking and didn't check the nifi logs!


Anyway, if I am correct, I might be able to add an attribute to an existing
Flowfile from my little processor. As of late last night I could _change_ one
that was there already, but today I will try to _create_ one. If I can, then
this should go well.  

Unfortunately, and tell me if I am wrong, this new processor will still need
to be loaded each time a sentence needs to be analyzed by 'Parsey'. On a small
scale, this is no big deal, but normally people would be hammering it.

In looking for a clean, fast and [hopefully] elegant solution to accessing
running services from a processor, is it bad design to simply make my parser
run as a service, and have it listen to Kafka for text to parse? It could send
it back as well via another topic...

But that is only 1/2 the problem. The other 1/2 is parsing out the output from
Parsey, and maybe for that I should make my processor-not getting text sent
&amp; returned from Parsey... Because storing the output of Parsey (text)
isn't a direct operation (see the sample output text in prev/original email),
it's output needs to be analyzed first.

So let me know if this plan is viable:

  1. Make the Parsey interaction via a java loop (daemon/service).
  2. This daemon loads the Parsey model chosen _once_, then waits for Kafka 
messages to process, outputting each on another Kafka topic. It expects to 
receive 3 things:
    1. Flowfile as text to parse.
    2. The Kafka Topic to listen to (processor can't configure this, but will 
reflect user's choice).
    3. The Kafka Topic to send it back on (this I _can_ send to the java 
daemon, and configure each 'return' at runtime)
      1. This way, I am imagining many processors can send to Parsey via one 
fixed topic, and they can each wait for the return data via a unique Topic for 
just that processor.
      2. I cannot see a way to adjust the listening Topic at runtime, so the 
user would make one for all processors to use, then enter that as a processor 
attribute.
  3. My simple processor sends a flowfile to it via the topic the user selects 
as a Processor attribute "Send Topic".
  4. The parser, well, _parses_. Then it sends back the reply on  a Topic set 
in the processor as well as the "Receive Topic".
    1. Is it better to just do the Kafka transfer in the processor, or hand it 
off to PutKafka &amp; GetKafka? My thinking is that this would be harder to do, 
and I would need to write 2 processors... Thoughts?
  5. The custom processor I'm writing then has the parsed text, but not in a 
format that will allow it to be put into a [graph] database. Knowing a word is 
a NNP isn't enough-you must know which _branch_ on the tree it was (how 
important it is).
    1. This is where the [X] extraction counts, or a better mechanism that I'm 
not thinking of.
  6. At this point, I am very tempted to keep going in this processor, but what 
if the user wants HDFS, Titan, ? Best here is to stop &amp; put the results in 
it's own "relationship", with the original text that was parsed in another, and 
perhaps even the 'raw parsed' tree-looking text in another Relationship.
    1. So 4 relationships:
      1. Submitted
      2. Post Parsey
      3. Indexed
      4. Failure (of any of 2 or 3)

  

I will make the (Indexed) output of this processor a standard, of sorts, which
another processor can change into a query for the DB of choice. The 'tree
level' could be used for logic like:

  1. NNP/NNPS at [1] is a vertex.
  2. NN/NNS &gt; [2] are destination vertices of the above. 
  3. VBG at ROOT is an edge.
  4. ...

Would it be OK to leave cobbling together their query to INSERT into their DB
of choice to them? Once such a query crafted, they can use any standard Nifi
Put* processor, is my thinking...

  

Your feedback appreciated!

On Jun 6, 2016 3:18 AM, "Conrad Crampton"
&lt;[[email protected]](mailto:[email protected])&gt;
wrote:  

> Hi,____

>

> This may be a long shot as I don’t know how many combinations of the column
lengths with | and + there are, but you could try using ReplaceTextWithMapping
processor where you have all combinations of +--| etc. in a text file with
what they represent in term of counts e.g____

>

> +--           [0]____

>

> |  +--       [1]____

>

> |      +--   [3]____

>

> __ __

>

> etc. (tab separated)____

>

> __ __

>

> Also, I’m not a particularly experienced in the area of sed, awk etc. but
I’m guessing some bash guru would be able to come up with some sort of script
that does this that could be called from ExcecuteScript processor.____

>

> __ __

>

> Regards____

>

> Conrad____

>

> __ __

>

> **From:  **Pat Trainor
&lt;[[email protected]](mailto:[email protected])&gt;  
**Reply-To: **"[[email protected]](mailto:[email protected])" 
&lt;[[email protected]](mailto:[email protected])&gt;  
**Date: **Sunday, 5 June 2016 at 18:33  
**To: **"[[email protected]](mailto:[email protected])" 
&lt;[[email protected]](mailto:[email protected])&gt;  
**Subject: **Nifi &amp; Parsey McParseface! RegEx in a Processor...____
>

> __ __

>

> I have had success with using ReplaceText processor out of the box to modify
the output of a nifi-called script. I'm applying nifi to running the parsey
mcparseface system (Syntaxnet) from google. The ouput of the application looks
like this: ____

>

> __ __

>

> \---____

>

> Input: It is to two English scholars , father and son , Edward Pococke ,
senior and junior , that the world is indebted for the knowledge of one of the
most charming productions Arabian philosophy can boast of .  
Parse:  
is VBZ ROOT  
+-- It PRP nsubj  
+-- to IN prep  
|   +-- scholars NNS pobj  
|       +-- two CD num  
|       +-- English JJ amod  
|       +-- , , punct  
|       +-- father NN conj  
|       |   +-- and CC cc  
|       |   +-- son NN conj  
|       +-- Pococke NNP appos____

>

> [...] ____

>

> \---____

>

> __ __

>

> As you can see, my ExecuteProcessorStream is working fine. But there is a
bit of importance that needs to be taken from this text. My _ReplaceText
_Processor used (the first one) is shown in the attached. It only removes
characters.____

>

> __ __

>

> How many 'spaces' each of the '+' signs is is important. Simply removing
leading spaces, + and | characters moves the first word in each line to the
first column, without telling you how many columns over the words started in
the original input.____

>

> __ __

>

> WHat is needed is a way to count the number of columns in the beginning of
each line that precedes the first alphanumeric. It doesn't matter if the same
processor can also clean things out to my present efforts:____

>

> __ __

>

> Input: It is to two English scholars , father and son , Edward Pococke ,
senior and junior , that the world is indebted for the knowledge of one of the
most charming productions Arabian philosophy can boast of .  
Parse:  
is VBZ ROOT  
It PRP nsubj  
to IN prep____

>

> [...]____

>

> __ __

>

> I am hoping to somehow use the expressions (a la ${line:blah...) in Nifi, or
another mechanism I'm not aware of, to gather the column count, making it
available for later processing/storage.____

>

> __ __

>

> [0]is VBZ ROOT____

>

> [1]It PRP nsubj____

>

> [1]to IN prep____

>

> [2] ...____

>

> __ __

>

> With the [X] being the # of columns over from the left that the alpha-
numeric character was.____

>

> __ __

>

> The reasoning for this is that the position signifies how 'important' that
attribute is in the sentence. It looks like a tree, but the numer
(indentation) is the length of the branch the word is on.____

>

> __ __

>

> Is there a clever way to accomplish most/all of this, either with () regex
or named attributes, in Nifi?____

>

> __ __

>

> Thanks!____

>

> [pat](http://about.me/PatTrainor)____

>

> ( ͡° ͜ʖ ͡°)____

>

> __ __

>

> "A wise man can learn more from a foolish _question _than a fool can learn
from a wise _answer_". ~ Bruce Lee.____

>

> __ __

>

> ***This email originated outside SecureData***____

>

> Click [ here](https://www.mailcontrol.com/sr/MZbqvYs5QwJvpeaetUwhCQ==) to
report this email as spam.____

>

>  
  

>

> SecureData, combating cyber threats

>

>  

>

> * * *

>

> The information contained in this message or any of its attachments may be
privileged and confidential and intended for the exclusive use of the intended
recipient. If you are not the intended recipient any disclosure, reproduction,
distribution or other dissemination or use of this communications is strictly
prohibited. The views expressed in this email are those of the individual and
not necessarily of SecureData Europe Ltd. Any prices quoted are only valid if
followed up by a formal written quote.

>

> SecureData Europe Limited. Registered in England &amp; Wales 04365896.
Registered Address: SecureData House, Hermitage Court, Hermitage Lane,
Maidstone, Kent, ME16 9NT

Re: Nifi & Parsey McParseface! RegEx in a Processor...

Reply via email to