Re: ExtractText Processor

Conrad Crampton Thu, 25 Feb 2016 03:51:05 -0800

Hi,
I don’t think you can do what you want to using ExtractText processor.
The relevant section of the code


if (matcher.find())  Line 320 (v0.4.1) ExtractText.java (I would have included 
more of this to put in context but got blocked by email filtering)

Because matcher.find() is used it will only match once. To get each match of 
the repeated group, it would have to be in a while (matcher.find()) …. with 
each matching group returned with matcher.group() call.

Unless someone else can suggest anything different, I would say you would have 
to write your own custom processor for this (or extend ExtractText processor 
with another property for repeating groups and have a different part of code 
run if set which uses while matcher.find()

HTH,
Conrad


From: John Burns <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Thursday, 25 February 2016 at 09:44
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: ExtractText Processor

Hi,

Thank you for the reply. I am trying to solve something I thought would be 
fairly simple but not having much success:

Consider the string "my friend and I went for a long walk. It was raining and 
it was very cold". When tested against one single Java regex (.{9}and.{9})+ 
results in two matches: "y friend and I went f" and "raining and it was v".

In NiFi I wish to do something similar, ie, capture all the matching strings 
for a given regex (similar to grep). When I run the above regex in NiFi I see 
only the first match but not the second.

Could you advise how I can access all matches for the regex. The use case here 
is to monitor websites for specific a word and extract (say) 10 characters 
either side of the matching word - for all matches on the site.

Thanks again

John


On Mon, Feb 22, 2016 at 7:05 AM, Conrad Crampton 
<[email protected]<mailto:[email protected]>> wrote:
Hi John,
If you use a property for your regexp called matches for example that has many 
capture groups in it e.g.
matches (?:^(.+) (\d+)$)
If this matches the incoming flow file, then you will end up after processing 
with 3 attributes.
matches
matches.1
matches.2

With the matches and matches.1 being the same value (of the first capture 
group). If you set the ‘Include Capture Group 0’ to be true you get an 
additional attribute matches.0 that is the whole match group (as with Java 
RegExp class.

HTH,
Conrad

From: John Burns <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Sunday, 21 February 2016 at 20:04
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: ExtractText Processor

Hi,

I'm using ExtractText processor to monitor a website for specific content terms 
and log matches to a database. However, according to the documents on 
ExtractText ".....If the Regular Expression matches more than once, only the 
first match will be used"

Do I understand this correctly as meaning that only the first regex match of a 
given term will be captured (as opposed to how grep works for example). I want 
to capture all occurrences of the match not just the first.

Any help would be appreciated.

Many thanks

John



***This email originated outside SecureData***

Click here<https://www.mailcontrol.com/sr/MZbqvYs5QwJvpeaetUwhCQ==> to report 
this email as spam.


SecureData, combating cyber threats

________________________________

The information contained in this message or any of its attachments may be 
privileged and confidential and intended for the exclusive use of the intended 
recipient. If you are not the intended recipient any disclosure, reproduction, 
distribution or other dissemination or use of this communications is strictly 
prohibited. The views expressed in this email are those of the individual and 
not necessarily of SecureData Europe Ltd. Any prices quoted are only valid if 
followed up by a formal written quote.

SecureData Europe Limited. Registered in England & Wales 04365896. Registered 
Address: SecureData House, Hermitage Court, Hermitage Lane, Maidstone, Kent, 
ME16 9NT

Re: ExtractText Processor

Reply via email to