That is awesome to hear, I didn't realize ScanContent worked that way, very cool!
Sent from my iPhone > On Nov 12, 2015, at 8:40 PM, Joe Witt <[email protected]> wrote: > > User Experience - everything we do needs to be about continually > improving the user experience. So yes for sure if you've got ideas on > how to provide a more intuitive play - yes please. You will find an > implementation of aho corasick under the standard processors > (ScanContent) and the associated library under search tools. > Amazingly fast. > > Thanks! > Joe > >> On Thu, Nov 12, 2015 at 8:33 PM, Matt Burgess <[email protected]> wrote: >> Not sure if it would prove useful but I've started messing around with the >> Aho-Corasick algorithm in the hopes of the user being able to paste in some >> sample data and getting a regex out. If the data is "regular", the user >> wouldn't need to know an expression language, they would just need a >> representative sample of their data. >> >> Depending on how crazy I want to get, I might do cross-fold validation >> (rated against the algorithm on the whole set) for the sample input to see >> if it's really "regular" or that guessing a regex is just too hard for the >> given data. >> >> Anyway, do you think a "regex guesser" or "NiFi expression guesser" would be >> a valuable feature? The missing link is the translator from Finite State >> Machine (from Aho-Corasick) to the target model (regex or otherwise). The >> research has been done and there is code available (under GPL) so on purpose >> I did not read the paper or look at the source. >> >> Sorry in advance if I've gone too far afield here, I've just felt the pains >> of users trying to get the right recognizers for their data fields. >> >> Cheers, >> Matt >> >> Sent from my iPhone >> >> On Nov 12, 2015, at 7:54 PM, Joe Witt <[email protected]> wrote: >> >> We have to make this easier... >> >> Maybe we should give someone access to an inline expression editor and see >> the results. Like in regexpal... >> >>> On Nov 12, 2015 7:26 PM, "Charlie Frasure" <[email protected]> wrote: >>> >>> Good call. I added trim() to the matches command, and it seems to have >>> resolved the issue. I was checking for sane lengths, but maybe there was a >>> \n or something in there. Problem for another day. Thanks. >>> >>> >>> On Thu, Nov 12, 2015 at 7:13 PM, Matthew Clarke >>> <[email protected]> wrote: >>>> >>>> Make sure your attribute name and value does not have white space on >>>> either side. A 'space' is a valid character and is often over looked. " >>>> encoding" does not equal "encoding" or "encoding ". The same applies for >>>> the >>>> attribute values. >>>> >>>> On Nov 12, 2015 7:07 PM, "Charlie Frasure" <[email protected]> >>>> wrote: >>>>> >>>>> Thanks. I did use the matches syntax already and checked the attribute >>>>> values in each processor using Data Provenance, but I will try adding the >>>>> additional bulletin to see if something else surfaces. >>>>> >>>>> On Thu, Nov 12, 2015 at 7:00 PM, Matthew Clarke >>>>> <[email protected]> wrote: >>>>>> >>>>>> Try adding a logAttribute processor after your encoding test to see >>>>>> what values are actually getting assigned to the encoding attribute. >>>>>> Attribute are always stores as strings, so I don't think you need to use >>>>>> the >>>>>> literal function. I would suggest trying ${encoding: matches >>>>>> ('utf-8|utf-16|utf-16be|utf-16le|us-ascii|iso-8859-1')} >>>>>> >>>>>> Matches is an exact match and values are case sensitive. >>>>>> >>>>>> If you set the bulletin level on the logAttribute processor to 'info', >>>>>> all the attribute key/value pairs will be displayed on the processor by >>>>>> hovering over the bulletin (yellow post-it). They will also e dumped to >>>>>> the >>>>>> app log. >>>>>> >>>>>> On Nov 12, 2015 6:40 PM, "Charlie Frasure" <[email protected]> >>>>>> wrote: >>>>>>> >>>>>>> I am attempting to convert many files with various encoding to a >>>>>>> common character set. I have an attribute called 'encoding' that >>>>>>> stores the >>>>>>> result of an encoding test. When passing that value as the source to >>>>>>> the >>>>>>> ConvertCharacterSet processor, it didn't match the processor's expected >>>>>>> values. I added an UpdateAttribute processor that is attempting to >>>>>>> compare >>>>>>> 'encoding' to known valid Java character sets. That comparison is >>>>>>> where I >>>>>>> am having trouble. In SQL it would be "where encoding in ('utf-8', >>>>>>> 'utf-16', 'utf-16be', 'utf-16le', 'us-ascii', 'iso-8859-1')." >>>>>>> >>>>>>> Based on this document, I thought that 'literal' would be a good >>>>>>> function combined with 'contains'. >>>>>>> >>>>>>> https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#literal >>>>>>> >>>>>>> Once the comparison is working, I will send the matching files to the >>>>>>> ConvertCharacterSet processor. >>>>>>> >>>>>>> On Thu, Nov 12, 2015 at 6:24 PM, Matthew Clarke >>>>>>> <[email protected]> wrote: >>>>>>>> >>>>>>>> Charlie, >>>>>>>> I am not sure what your use case is here. 'Literal' is not a >>>>>>>> NiFI expression language function. If you can give me some detail on >>>>>>>> what >>>>>>>> you are trying to do, I can help you with the NiFi expression language >>>>>>>> strategy to accomplish it. Did you create a FlowFile attribute named >>>>>>>> 'encoding'? >>>>>>>> >>>>>>>> Matt >>>>>>>> >>>>>>>> On Nov 12, 2015 6:15 PM, "Charlie Frasure" <[email protected]> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Typos on my regex were just in the email, not the processor. It >>>>>>>>> should have read ${encoding:match... >>>>>>>>> >>>>>>>>> On Thu, Nov 12, 2015 at 6:03 PM, Charlie Frasure >>>>>>>>> <[email protected]> wrote: >>>>>>>>>> >>>>>>>>>> This expression does not parse without error: >>>>>>>>>> ${literal('utf-8 utf-16 utf-16be utf-16le us-ascii >>>>>>>>>> iso-8859-1'):contains(encoding)} >>>>>>>>>> >>>>>>>>>> Is it not possible to use an attribute in a comparison function? >>>>>>>>>> Unexpected token 'encoding' at line 1, column 73. Query: >>>>>>>>>> ${literal(utf-8 utf-16 utf-16be utf-16le us-ascii >>>>>>>>>> iso-8859-1):contains(encoding)} >>>>>>>>>> >>>>>>>>>> Alternatively, I think a regex should work, but didn't immediately >>>>>>>>>> get a match using: >>>>>>>>>> >>>>>>>>>> ${enconding.match('utf-8|utf-16|utf-16be|utf-16le|us-ascii|iso-8859-1')} >>>>>>>>>> >>>>>>>>>> Charlie >>
