That is awesome to hear, I didn't realize ScanContent worked that way, very 
cool!

Sent from my iPhone

> On Nov 12, 2015, at 8:40 PM, Joe Witt <[email protected]> wrote:
> 
> User Experience - everything we do needs to be about continually
> improving the user experience.  So yes for sure if you've got ideas on
> how to provide a more intuitive play - yes please.  You will find an
> implementation of aho corasick under the standard processors
> (ScanContent) and the associated library under search tools.
> Amazingly fast.
> 
> Thanks!
> Joe
> 
>> On Thu, Nov 12, 2015 at 8:33 PM, Matt Burgess <[email protected]> wrote:
>> Not sure if it would prove useful but I've started messing around with the
>> Aho-Corasick algorithm in the hopes of the user being able to paste in some
>> sample data and getting a regex out. If the data is "regular", the user
>> wouldn't need to know an expression language, they would just need a
>> representative sample of their data.
>> 
>> Depending on how crazy I want to get, I might do cross-fold validation
>> (rated against the algorithm on the whole set) for the sample input to see
>> if it's really "regular" or that guessing a regex is just too hard for the
>> given data.
>> 
>> Anyway, do you think a "regex guesser" or "NiFi expression guesser" would be
>> a valuable feature? The missing link is the translator from Finite State
>> Machine (from Aho-Corasick) to the target model (regex or otherwise). The
>> research has been done and there is code available (under GPL) so on purpose
>> I did not read the paper or look at the source.
>> 
>> Sorry in advance if I've gone too far afield here, I've just felt the pains
>> of users trying to get the right recognizers for their data fields.
>> 
>> Cheers,
>> Matt
>> 
>> Sent from my iPhone
>> 
>> On Nov 12, 2015, at 7:54 PM, Joe Witt <[email protected]> wrote:
>> 
>> We have to make this easier...
>> 
>> Maybe we should give someone access to an inline expression editor and see
>> the results.  Like in regexpal...
>> 
>>> On Nov 12, 2015 7:26 PM, "Charlie Frasure" <[email protected]> wrote:
>>> 
>>> Good call.  I added trim() to the matches command, and it seems to have
>>> resolved the issue.  I was checking for sane lengths, but maybe there was a
>>> \n or something in there.  Problem for another day.  Thanks.
>>> 
>>> 
>>> On Thu, Nov 12, 2015 at 7:13 PM, Matthew Clarke
>>> <[email protected]> wrote:
>>>> 
>>>> Make sure your attribute name and value does not have white space on
>>>> either side. A 'space' is a valid character and is often over looked. "
>>>> encoding" does not equal "encoding" or "encoding ". The same applies for 
>>>> the
>>>> attribute values.
>>>> 
>>>> On Nov 12, 2015 7:07 PM, "Charlie Frasure" <[email protected]>
>>>> wrote:
>>>>> 
>>>>> Thanks.  I did use the matches syntax already and checked the attribute
>>>>> values in each processor using Data Provenance, but I will try adding the
>>>>> additional bulletin to see if something else surfaces.
>>>>> 
>>>>> On Thu, Nov 12, 2015 at 7:00 PM, Matthew Clarke
>>>>> <[email protected]> wrote:
>>>>>> 
>>>>>> Try adding a logAttribute processor after your encoding test to see
>>>>>> what values are actually getting assigned to the encoding attribute.
>>>>>> Attribute are always stores as strings, so I don't think you need to use 
>>>>>> the
>>>>>> literal function. I would suggest trying ${encoding: matches
>>>>>> ('utf-8|utf-16|utf-16be|utf-16le|us-ascii|iso-8859-1')}
>>>>>> 
>>>>>> Matches is an exact match and values are case sensitive.
>>>>>> 
>>>>>> If you set the bulletin level on the logAttribute processor to 'info',
>>>>>> all the attribute key/value pairs will be displayed on the processor by
>>>>>> hovering over the bulletin (yellow post-it). They will also e dumped to 
>>>>>> the
>>>>>> app log.
>>>>>> 
>>>>>> On Nov 12, 2015 6:40 PM, "Charlie Frasure" <[email protected]>
>>>>>> wrote:
>>>>>>> 
>>>>>>> I am attempting to convert many files with various encoding to a
>>>>>>> common character set.  I have an attribute called 'encoding' that 
>>>>>>> stores the
>>>>>>> result of an encoding test.  When passing that value as the source to 
>>>>>>> the
>>>>>>> ConvertCharacterSet processor, it didn't match the processor's expected
>>>>>>> values.  I added an UpdateAttribute processor that is attempting to 
>>>>>>> compare
>>>>>>> 'encoding' to known valid Java character sets.  That comparison is 
>>>>>>> where I
>>>>>>> am having trouble.  In SQL it would be "where encoding in ('utf-8',
>>>>>>> 'utf-16', 'utf-16be', 'utf-16le', 'us-ascii', 'iso-8859-1')."
>>>>>>> 
>>>>>>> Based on this document, I thought that 'literal' would be a good
>>>>>>> function combined with 'contains'.
>>>>>>> 
>>>>>>> https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#literal
>>>>>>> 
>>>>>>> Once the comparison is working, I will send the matching files to the
>>>>>>> ConvertCharacterSet processor.
>>>>>>> 
>>>>>>> On Thu, Nov 12, 2015 at 6:24 PM, Matthew Clarke
>>>>>>> <[email protected]> wrote:
>>>>>>>> 
>>>>>>>> Charlie,
>>>>>>>>     I am not sure what your use case is here. 'Literal' is not a
>>>>>>>> NiFI expression language function. If you can give me some detail on 
>>>>>>>> what
>>>>>>>> you are trying to do, I can help you with the NiFi expression language
>>>>>>>> strategy to accomplish it. Did you create a FlowFile attribute named
>>>>>>>> 'encoding'?
>>>>>>>> 
>>>>>>>> Matt
>>>>>>>> 
>>>>>>>> On Nov 12, 2015 6:15 PM, "Charlie Frasure" <[email protected]>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Typos on my regex were just in the email, not the processor.  It
>>>>>>>>> should have read ${encoding:match...
>>>>>>>>> 
>>>>>>>>> On Thu, Nov 12, 2015 at 6:03 PM, Charlie Frasure
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>> 
>>>>>>>>>> This expression does not parse without error:
>>>>>>>>>> ${literal('utf-8 utf-16 utf-16be utf-16le us-ascii
>>>>>>>>>> iso-8859-1'):contains(encoding)}
>>>>>>>>>> 
>>>>>>>>>> Is it not possible to use an attribute in a comparison function?
>>>>>>>>>> Unexpected token 'encoding' at line 1, column 73. Query:
>>>>>>>>>> ${literal(utf-8 utf-16 utf-16be utf-16le us-ascii
>>>>>>>>>> iso-8859-1):contains(encoding)}
>>>>>>>>>> 
>>>>>>>>>> Alternatively, I think a regex should work, but didn't immediately
>>>>>>>>>> get a match using:
>>>>>>>>>> 
>>>>>>>>>> ${enconding.match('utf-8|utf-16|utf-16be|utf-16le|us-ascii|iso-8859-1')}
>>>>>>>>>> 
>>>>>>>>>> Charlie
>> 

Reply via email to