Re: Nifi & Parsey McParseface! RegEx in a Processor...

Conrad Crampton Mon, 06 Jun 2016 00:19:17 -0700

Hi,
This may be a long shot as I don’t know how many combinations of the column 
lengths with | and + there are, but you could try using ReplaceTextWithMapping 
processor where you have all combinations of +--| etc. in a text file with what 
they represent in term of counts e.g
+--           [0]
|  +--       [1]
|      +--   [3]


etc. (tab separated)

Also, I’m not a particularly experienced in the area of sed, awk etc. but I’m 
guessing some bash guru would be able to come up with some sort of script that 
does this that could be called from ExcecuteScript processor.

Regards
Conrad

From: Pat Trainor <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Sunday, 5 June 2016 at 18:33
To: "[email protected]" <[email protected]>
Subject: Nifi & Parsey McParseface! RegEx in a Processor...

I have had success with using ReplaceText processor out of the box to modify 
the output of a nifi-called script. I'm applying nifi to running the parsey 
mcparseface system (Syntaxnet) from google. The ouput of the application looks 
like this:

---
Input: It is to two English scholars , father and son , Edward Pococke , senior 
and junior , that the world is indebted for the knowledge of one of the most 
charming productions Arabian philosophy can boast of .
Parse:
is VBZ ROOT
+-- It PRP nsubj
+-- to IN prep
|   +-- scholars NNS pobj
|       +-- two CD num
|       +-- English JJ amod
|       +-- , , punct
|       +-- father NN conj
|       |   +-- and CC cc
|       |   +-- son NN conj
|       +-- Pococke NNP appos
[...]
---

As you can see, my ExecuteProcessorStream is working fine. But there is a bit 
of importance that needs to be taken from this text. My ReplaceText Processor 
used (the first one) is shown in the attached. It only removes characters.

How many 'spaces' each of the '+' signs is is important. Simply removing 
leading spaces, + and | characters moves the first word in each line to the 
first column, without telling you how many columns over the words started in 
the original input.

WHat is needed is a way to count the number of columns in the beginning of each 
line that precedes the first alphanumeric. It doesn't matter if the same 
processor can also clean things out to my present efforts:

Input: It is to two English scholars , father and son , Edward Pococke , senior 
and junior , that the world is indebted for the knowledge of one of the most 
charming productions Arabian philosophy can boast of .
Parse:
is VBZ ROOT
It PRP nsubj
to IN prep
[...]

I am hoping to somehow use the expressions (a la ${line:blah...) in Nifi, or 
another mechanism I'm not aware of, to gather the column count, making it 
available for later processing/storage.

[0]is VBZ ROOT
[1]It PRP nsubj
[1]to IN prep
[2] ...

With the [X] being the # of columns over from the left that the alpha-numeric 
character was.

The reasoning for this is that the position signifies how 'important' that 
attribute is in the sentence. It looks like a tree, but the numer (indentation) 
is the length of the branch the word is on.

Is there a clever way to accomplish most/all of this, either with () regex or 
named attributes, in Nifi?

Thanks!
pat<http://about.me/PatTrainor>
( ͡° ͜ʖ ͡°)

"A wise man can learn more from a foolish question than a fool can learn from a 
wise answer". ~ Bruce Lee.


***This email originated outside SecureData***

Click here<https://www.mailcontrol.com/sr/MZbqvYs5QwJvpeaetUwhCQ==> to report 
this email as spam.


SecureData, combating cyber threats
______________________________________________________________________ 
The information contained in this message or any of its attachments may be 
privileged and confidential and intended for the exclusive use of the intended 
recipient. If you are not the intended recipient any disclosure, reproduction, 
distribution or other dissemination or use of this communications is strictly 
prohibited. The views expressed in this email are those of the individual and 
not necessarily of SecureData Europe Ltd. Any prices quoted are only valid if 
followed up by a formal written quote.

SecureData Europe Limited. Registered in England & Wales 04365896. Registered 
Address: SecureData House, Hermitage Court, Hermitage Lane, Maidstone, Kent, 
ME16 9NT

Re: Nifi & Parsey McParseface! RegEx in a Processor...

Reply via email to