Nifi & Parsey McParseface! RegEx in a Processor...

Pat Trainor Sun, 05 Jun 2016 10:33:52 -0700

I have had success with using ReplaceText processor out of the box to modify
the output of a nifi-called script. I'm applying nifi to running the parsey
mcparseface system (Syntaxnet) from google. The ouput of the application looks
like this:


  

\---

Input: It is to two English scholars , father and son , Edward Pococke ,
senior and junior , that the world is indebted for the knowledge of one of the
most charming productions Arabian philosophy can boast of .  
Parse:  
is VBZ ROOT  
+-- It PRP nsubj  
+-- to IN prep  
|   +-- scholars NNS pobj  
|       +-- two CD num  
|       +-- English JJ amod  
|       +-- , , punct  
|       +-- father NN conj  
|       |   +-- and CC cc  
|       |   +-- son NN conj  
|       +-- Pococke NNP appos  

[...]

\---

  

As you can see, my ExecuteProcessorStream is working fine. But there is a bit
of importance that needs to be taken from this text. My _ReplaceText
_Processor used (the first one) is shown in the attached. It only removes
characters.

  

How many 'spaces' each of the '+' signs is is important. Simply removing
leading spaces, + and | characters moves the first word in each line to the
first column, without telling you how many columns over the words started in
the original input.

  

WHat is needed is a way to count the number of columns in the beginning of
each line that precedes the first alphanumeric. It doesn't matter if the same
processor can also clean things out to my present efforts:

  

Input: It is to two English scholars , father and son , Edward Pococke ,
senior and junior , that the world is indebted for the knowledge of one of the
most charming productions Arabian philosophy can boast of .  
Parse:  
is VBZ ROOT  
It PRP nsubj  
to IN prep

[...]

  

I am hoping to somehow use the expressions (a la ${line:blah...) in Nifi, or
another mechanism I'm not aware of, to gather the column count, making it
available for later processing/storage.

  

[0]is VBZ ROOT

[1]It PRP nsubj

[1]to IN prep

[2] ...

  

With the [X] being the # of columns over from the left that the alpha-numeric
character was.

  

The reasoning for this is that the position signifies how 'important' that
attribute is in the sentence. It looks like a tree, but the numer
(indentation) is the length of the branch the word is on.

  

Is there a clever way to accomplish most/all of this, either with () regex or
named attributes, in Nifi?

  

Thanks!  
  

[pat](http://about.me/PatTrainor)  

( ͡° ͜ʖ ͡°)  

  

"A wise man can learn more from a foolish _question _than a fool can learn
from a wise _answer_". ~ Bruce Lee.

Nifi & Parsey McParseface! RegEx in a Processor...

Reply via email to