Re: RegEx not catching all tags

Andy LoPresto Thu, 02 Jun 2016 20:45:50 -0700

Thanks Sven. Could I ask you to open a Jira [1] requesting a boolean option in 
the ExtractText processor properties that allows for global results?


[1] https://issues.apache.org/jira/browse/NIFI

Andy LoPresto
[email protected]
[email protected]
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Jun 1, 2016, at 3:23 AM, Sven Davison <[email protected]> wrote:
> 
> Thanks. I did some more reading in the documentation and Nifi's documentation 
> says it only returns the first one. HOWEVER... The Jain object returned had 
> an element of tags already!
> 
> $.entities.hashtags.*.text or... Something. I got it working late last night!
> 
> 
> 
> -Sven Davison
> (sent from my iPhone)
> 
> On May 31, 2016, at 10:47 PM, Andy LoPresto <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>> Hi Sven,
>> 
>> Are you using an ExtractText processor [1] here? If so, you can extract 
>> multiple capture groups which will be stored in flowfile attributes such as 
>> “regexattr.1”, “regexattr.2”, etc. when assigned to the regular expression 
>> name “regexattr”.
>> 
>> Try the regular expression I’ve provided here [2] (explanation available on 
>> the site). This captures a literal ‘#’, any “word” character one or more 
>> times until a word boundary, and does this “globally”, aka does not stop 
>> searching after the first result. I didn’t check exhaustively if hashtags 
>> can contain special characters like ‘-‘, etc. but that should be 
>> well-documented by Twitter.
>> 
>> /(#[\w]+\b)/g
>> 
>> [1] 
>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.ExtractText/index.html
>>  
>> <https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.ExtractText/index.html>
>> [2] https://regex101.com/r/gV3mO5/1 <https://regex101.com/r/gV3mO5/1>
>> 
>> 
>> Andy LoPresto
>> [email protected] <mailto:[email protected]>
>> [email protected] <mailto:[email protected]>
>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>> 
>>> On May 31, 2016, at 3:32 PM, Sven Davison <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> 
>>> http://prntscr.com/basrzy <http://prntscr.com/basrzy>
>>> 
>>> the above is a screenshot showing a hashtags var only containing the first 
>>> instance of a hashtag. i want to get a list of ALL hashtags from 
>>> twitter.text not just the first one. i'm fairly sure my RegEx is wrong... 
>>> here's what i have.
>>> 
>>> (#{1}[a-zA-Z0-9_]*)
>>> 
>>> i'm using https://regex101.com/ <https://regex101.com/> to simulate traffic 
>>> and tests.. but i can't get it to recognize more than the first instance of 
>>> the regex.
>>

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: RegEx not catching all tags

Reply via email to