User Experience - everything we do needs to be about continually
improving the user experience.  So yes for sure if you've got ideas on
how to provide a more intuitive play - yes please.  You will find an
implementation of aho corasick under the standard processors
(ScanContent) and the associated library under search tools.
Amazingly fast.

Thanks!
Joe

On Thu, Nov 12, 2015 at 8:33 PM, Matt Burgess <[email protected]> wrote:
> Not sure if it would prove useful but I've started messing around with the
> Aho-Corasick algorithm in the hopes of the user being able to paste in some
> sample data and getting a regex out. If the data is "regular", the user
> wouldn't need to know an expression language, they would just need a
> representative sample of their data.
>
> Depending on how crazy I want to get, I might do cross-fold validation
> (rated against the algorithm on the whole set) for the sample input to see
> if it's really "regular" or that guessing a regex is just too hard for the
> given data.
>
> Anyway, do you think a "regex guesser" or "NiFi expression guesser" would be
> a valuable feature? The missing link is the translator from Finite State
> Machine (from Aho-Corasick) to the target model (regex or otherwise). The
> research has been done and there is code available (under GPL) so on purpose
> I did not read the paper or look at the source.
>
> Sorry in advance if I've gone too far afield here, I've just felt the pains
> of users trying to get the right recognizers for their data fields.
>
> Cheers,
> Matt
>
> Sent from my iPhone
>
> On Nov 12, 2015, at 7:54 PM, Joe Witt <[email protected]> wrote:
>
> We have to make this easier...
>
> Maybe we should give someone access to an inline expression editor and see
> the results.  Like in regexpal...
>
> On Nov 12, 2015 7:26 PM, "Charlie Frasure" <[email protected]> wrote:
>>
>> Good call.  I added trim() to the matches command, and it seems to have
>> resolved the issue.  I was checking for sane lengths, but maybe there was a
>> \n or something in there.  Problem for another day.  Thanks.
>>
>>
>> On Thu, Nov 12, 2015 at 7:13 PM, Matthew Clarke
>> <[email protected]> wrote:
>>>
>>> Make sure your attribute name and value does not have white space on
>>> either side. A 'space' is a valid character and is often over looked. "
>>> encoding" does not equal "encoding" or "encoding ". The same applies for the
>>> attribute values.
>>>
>>> On Nov 12, 2015 7:07 PM, "Charlie Frasure" <[email protected]>
>>> wrote:
>>>>
>>>> Thanks.  I did use the matches syntax already and checked the attribute
>>>> values in each processor using Data Provenance, but I will try adding the
>>>> additional bulletin to see if something else surfaces.
>>>>
>>>> On Thu, Nov 12, 2015 at 7:00 PM, Matthew Clarke
>>>> <[email protected]> wrote:
>>>>>
>>>>> Try adding a logAttribute processor after your encoding test to see
>>>>> what values are actually getting assigned to the encoding attribute.
>>>>> Attribute are always stores as strings, so I don't think you need to use 
>>>>> the
>>>>> literal function. I would suggest trying ${encoding: matches
>>>>> ('utf-8|utf-16|utf-16be|utf-16le|us-ascii|iso-8859-1')}
>>>>>
>>>>> Matches is an exact match and values are case sensitive.
>>>>>
>>>>> If you set the bulletin level on the logAttribute processor to 'info',
>>>>> all the attribute key/value pairs will be displayed on the processor by
>>>>> hovering over the bulletin (yellow post-it). They will also e dumped to 
>>>>> the
>>>>> app log.
>>>>>
>>>>> On Nov 12, 2015 6:40 PM, "Charlie Frasure" <[email protected]>
>>>>> wrote:
>>>>>>
>>>>>> I am attempting to convert many files with various encoding to a
>>>>>> common character set.  I have an attribute called 'encoding' that stores 
>>>>>> the
>>>>>> result of an encoding test.  When passing that value as the source to the
>>>>>> ConvertCharacterSet processor, it didn't match the processor's expected
>>>>>> values.  I added an UpdateAttribute processor that is attempting to 
>>>>>> compare
>>>>>> 'encoding' to known valid Java character sets.  That comparison is where 
>>>>>> I
>>>>>> am having trouble.  In SQL it would be "where encoding in ('utf-8',
>>>>>> 'utf-16', 'utf-16be', 'utf-16le', 'us-ascii', 'iso-8859-1')."
>>>>>>
>>>>>> Based on this document, I thought that 'literal' would be a good
>>>>>> function combined with 'contains'.
>>>>>>
>>>>>> https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#literal
>>>>>>
>>>>>> Once the comparison is working, I will send the matching files to the
>>>>>> ConvertCharacterSet processor.
>>>>>>
>>>>>> On Thu, Nov 12, 2015 at 6:24 PM, Matthew Clarke
>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>> Charlie,
>>>>>>>      I am not sure what your use case is here. 'Literal' is not a
>>>>>>> NiFI expression language function. If you can give me some detail on 
>>>>>>> what
>>>>>>> you are trying to do, I can help you with the NiFi expression language
>>>>>>> strategy to accomplish it. Did you create a FlowFile attribute named
>>>>>>> 'encoding'?
>>>>>>>
>>>>>>> Matt
>>>>>>>
>>>>>>> On Nov 12, 2015 6:15 PM, "Charlie Frasure" <[email protected]>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Typos on my regex were just in the email, not the processor.  It
>>>>>>>> should have read ${encoding:match...
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2015 at 6:03 PM, Charlie Frasure
>>>>>>>> <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> This expression does not parse without error:
>>>>>>>>> ${literal('utf-8 utf-16 utf-16be utf-16le us-ascii
>>>>>>>>> iso-8859-1'):contains(encoding)}
>>>>>>>>>
>>>>>>>>> Is it not possible to use an attribute in a comparison function?
>>>>>>>>> Unexpected token 'encoding' at line 1, column 73. Query:
>>>>>>>>> ${literal(utf-8 utf-16 utf-16be utf-16le us-ascii
>>>>>>>>> iso-8859-1):contains(encoding)}
>>>>>>>>>
>>>>>>>>> Alternatively, I think a regex should work, but didn't immediately
>>>>>>>>> get a match using:
>>>>>>>>>
>>>>>>>>> ${enconding.match('utf-8|utf-16|utf-16be|utf-16le|us-ascii|iso-8859-1')}
>>>>>>>>>
>>>>>>>>> Charlie
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>
>

Reply via email to