Under nifi-commons/nifi-utils/ there is a package called
org.apache.nifi.util.search  That is where the good stuff is.  Wild
man Tony Kurc bringing the high speed search heat there.

Thanks
Joe

On Thu, Nov 12, 2015 at 9:09 PM, Matt Burgess <[email protected]> wrote:
> That is awesome to hear, I didn't realize ScanContent worked that way, very 
> cool!
>
> Sent from my iPhone
>
>> On Nov 12, 2015, at 8:40 PM, Joe Witt <[email protected]> wrote:
>>
>> User Experience - everything we do needs to be about continually
>> improving the user experience.  So yes for sure if you've got ideas on
>> how to provide a more intuitive play - yes please.  You will find an
>> implementation of aho corasick under the standard processors
>> (ScanContent) and the associated library under search tools.
>> Amazingly fast.
>>
>> Thanks!
>> Joe
>>
>>> On Thu, Nov 12, 2015 at 8:33 PM, Matt Burgess <[email protected]> wrote:
>>> Not sure if it would prove useful but I've started messing around with the
>>> Aho-Corasick algorithm in the hopes of the user being able to paste in some
>>> sample data and getting a regex out. If the data is "regular", the user
>>> wouldn't need to know an expression language, they would just need a
>>> representative sample of their data.
>>>
>>> Depending on how crazy I want to get, I might do cross-fold validation
>>> (rated against the algorithm on the whole set) for the sample input to see
>>> if it's really "regular" or that guessing a regex is just too hard for the
>>> given data.
>>>
>>> Anyway, do you think a "regex guesser" or "NiFi expression guesser" would be
>>> a valuable feature? The missing link is the translator from Finite State
>>> Machine (from Aho-Corasick) to the target model (regex or otherwise). The
>>> research has been done and there is code available (under GPL) so on purpose
>>> I did not read the paper or look at the source.
>>>
>>> Sorry in advance if I've gone too far afield here, I've just felt the pains
>>> of users trying to get the right recognizers for their data fields.
>>>
>>> Cheers,
>>> Matt
>>>
>>> Sent from my iPhone
>>>
>>> On Nov 12, 2015, at 7:54 PM, Joe Witt <[email protected]> wrote:
>>>
>>> We have to make this easier...
>>>
>>> Maybe we should give someone access to an inline expression editor and see
>>> the results.  Like in regexpal...
>>>
>>>> On Nov 12, 2015 7:26 PM, "Charlie Frasure" <[email protected]> 
>>>> wrote:
>>>>
>>>> Good call.  I added trim() to the matches command, and it seems to have
>>>> resolved the issue.  I was checking for sane lengths, but maybe there was a
>>>> \n or something in there.  Problem for another day.  Thanks.
>>>>
>>>>
>>>> On Thu, Nov 12, 2015 at 7:13 PM, Matthew Clarke
>>>> <[email protected]> wrote:
>>>>>
>>>>> Make sure your attribute name and value does not have white space on
>>>>> either side. A 'space' is a valid character and is often over looked. "
>>>>> encoding" does not equal "encoding" or "encoding ". The same applies for 
>>>>> the
>>>>> attribute values.
>>>>>
>>>>> On Nov 12, 2015 7:07 PM, "Charlie Frasure" <[email protected]>
>>>>> wrote:
>>>>>>
>>>>>> Thanks.  I did use the matches syntax already and checked the attribute
>>>>>> values in each processor using Data Provenance, but I will try adding the
>>>>>> additional bulletin to see if something else surfaces.
>>>>>>
>>>>>> On Thu, Nov 12, 2015 at 7:00 PM, Matthew Clarke
>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>> Try adding a logAttribute processor after your encoding test to see
>>>>>>> what values are actually getting assigned to the encoding attribute.
>>>>>>> Attribute are always stores as strings, so I don't think you need to 
>>>>>>> use the
>>>>>>> literal function. I would suggest trying ${encoding: matches
>>>>>>> ('utf-8|utf-16|utf-16be|utf-16le|us-ascii|iso-8859-1')}
>>>>>>>
>>>>>>> Matches is an exact match and values are case sensitive.
>>>>>>>
>>>>>>> If you set the bulletin level on the logAttribute processor to 'info',
>>>>>>> all the attribute key/value pairs will be displayed on the processor by
>>>>>>> hovering over the bulletin (yellow post-it). They will also e dumped to 
>>>>>>> the
>>>>>>> app log.
>>>>>>>
>>>>>>> On Nov 12, 2015 6:40 PM, "Charlie Frasure" <[email protected]>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I am attempting to convert many files with various encoding to a
>>>>>>>> common character set.  I have an attribute called 'encoding' that 
>>>>>>>> stores the
>>>>>>>> result of an encoding test.  When passing that value as the source to 
>>>>>>>> the
>>>>>>>> ConvertCharacterSet processor, it didn't match the processor's expected
>>>>>>>> values.  I added an UpdateAttribute processor that is attempting to 
>>>>>>>> compare
>>>>>>>> 'encoding' to known valid Java character sets.  That comparison is 
>>>>>>>> where I
>>>>>>>> am having trouble.  In SQL it would be "where encoding in ('utf-8',
>>>>>>>> 'utf-16', 'utf-16be', 'utf-16le', 'us-ascii', 'iso-8859-1')."
>>>>>>>>
>>>>>>>> Based on this document, I thought that 'literal' would be a good
>>>>>>>> function combined with 'contains'.
>>>>>>>>
>>>>>>>> https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#literal
>>>>>>>>
>>>>>>>> Once the comparison is working, I will send the matching files to the
>>>>>>>> ConvertCharacterSet processor.
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2015 at 6:24 PM, Matthew Clarke
>>>>>>>> <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> Charlie,
>>>>>>>>>     I am not sure what your use case is here. 'Literal' is not a
>>>>>>>>> NiFI expression language function. If you can give me some detail on 
>>>>>>>>> what
>>>>>>>>> you are trying to do, I can help you with the NiFi expression language
>>>>>>>>> strategy to accomplish it. Did you create a FlowFile attribute named
>>>>>>>>> 'encoding'?
>>>>>>>>>
>>>>>>>>> Matt
>>>>>>>>>
>>>>>>>>> On Nov 12, 2015 6:15 PM, "Charlie Frasure" <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Typos on my regex were just in the email, not the processor.  It
>>>>>>>>>> should have read ${encoding:match...
>>>>>>>>>>
>>>>>>>>>> On Thu, Nov 12, 2015 at 6:03 PM, Charlie Frasure
>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> This expression does not parse without error:
>>>>>>>>>>> ${literal('utf-8 utf-16 utf-16be utf-16le us-ascii
>>>>>>>>>>> iso-8859-1'):contains(encoding)}
>>>>>>>>>>>
>>>>>>>>>>> Is it not possible to use an attribute in a comparison function?
>>>>>>>>>>> Unexpected token 'encoding' at line 1, column 73. Query:
>>>>>>>>>>> ${literal(utf-8 utf-16 utf-16be utf-16le us-ascii
>>>>>>>>>>> iso-8859-1):contains(encoding)}
>>>>>>>>>>>
>>>>>>>>>>> Alternatively, I think a regex should work, but didn't immediately
>>>>>>>>>>> get a match using:
>>>>>>>>>>>
>>>>>>>>>>> ${enconding.match('utf-8|utf-16|utf-16be|utf-16le|us-ascii|iso-8859-1')}
>>>>>>>>>>>
>>>>>>>>>>> Charlie
>>>

Reply via email to