Not sure if it would prove useful but I've started messing around with the Aho-Corasick algorithm in the hopes of the user being able to paste in some sample data and getting a regex out. If the data is "regular", the user wouldn't need to know an expression language, they would just need a representative sample of their data.
Depending on how crazy I want to get, I might do cross-fold validation (rated against the algorithm on the whole set) for the sample input to see if it's really "regular" or that guessing a regex is just too hard for the given data. Anyway, do you think a "regex guesser" or "NiFi expression guesser" would be a valuable feature? The missing link is the translator from Finite State Machine (from Aho-Corasick) to the target model (regex or otherwise). The research has been done and there is code available (under GPL) so on purpose I did not read the paper or look at the source. Sorry in advance if I've gone too far afield here, I've just felt the pains of users trying to get the right recognizers for their data fields. Cheers, Matt Sent from my iPhone > On Nov 12, 2015, at 7:54 PM, Joe Witt <[email protected]> wrote: > > We have to make this easier... > > Maybe we should give someone access to an inline expression editor and see > the results. Like in regexpal... > >> On Nov 12, 2015 7:26 PM, "Charlie Frasure" <[email protected]> wrote: >> Good call. I added trim() to the matches command, and it seems to have >> resolved the issue. I was checking for sane lengths, but maybe there was a >> \n or something in there. Problem for another day. Thanks. >> >> >>> On Thu, Nov 12, 2015 at 7:13 PM, Matthew Clarke <[email protected]> >>> wrote: >>> Make sure your attribute name and value does not have white space on either >>> side. A 'space' is a valid character and is often over looked. " encoding" >>> does not equal "encoding" or "encoding ". The same applies for the >>> attribute values. >>> >>>> On Nov 12, 2015 7:07 PM, "Charlie Frasure" <[email protected]> >>>> wrote: >>>> Thanks. I did use the matches syntax already and checked the attribute >>>> values in each processor using Data Provenance, but I will try adding the >>>> additional bulletin to see if something else surfaces. >>>> >>>>> On Thu, Nov 12, 2015 at 7:00 PM, Matthew Clarke >>>>> <[email protected]> wrote: >>>>> Try adding a logAttribute processor after your encoding test to see what >>>>> values are actually getting assigned to the encoding attribute. Attribute >>>>> are always stores as strings, so I don't think you need to use the >>>>> literal function. I would suggest trying ${encoding: matches >>>>> ('utf-8|utf-16|utf-16be|utf-16le|us-ascii|iso-8859-1')} >>>>> >>>>> Matches is an exact match and values are case sensitive. >>>>> >>>>> If you set the bulletin level on the logAttribute processor to 'info', >>>>> all the attribute key/value pairs will be displayed on the processor by >>>>> hovering over the bulletin (yellow post-it). They will also e dumped to >>>>> the app log. >>>>> >>>>>> On Nov 12, 2015 6:40 PM, "Charlie Frasure" <[email protected]> >>>>>> wrote: >>>>>> I am attempting to convert many files with various encoding to a common >>>>>> character set. I have an attribute called 'encoding' that stores the >>>>>> result of an encoding test. When passing that value as the source to >>>>>> the ConvertCharacterSet processor, it didn't match the processor's >>>>>> expected values. I added an UpdateAttribute processor that is >>>>>> attempting to compare 'encoding' to known valid Java character sets. >>>>>> That comparison is where I am having trouble. In SQL it would be "where >>>>>> encoding in ('utf-8', 'utf-16', 'utf-16be', 'utf-16le', 'us-ascii', >>>>>> 'iso-8859-1')." >>>>>> >>>>>> Based on this document, I thought that 'literal' would be a good >>>>>> function combined with 'contains'. >>>>>> https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#literal >>>>>> >>>>>> Once the comparison is working, I will send the matching files to the >>>>>> ConvertCharacterSet processor. >>>>>> >>>>>>> On Thu, Nov 12, 2015 at 6:24 PM, Matthew Clarke >>>>>>> <[email protected]> wrote: >>>>>>> Charlie, >>>>>>> I am not sure what your use case is here. 'Literal' is not a NiFI >>>>>>> expression language function. If you can give me some detail on what >>>>>>> you are trying to do, I can help you with the NiFi expression language >>>>>>> strategy to accomplish it. Did you create a FlowFile attribute named >>>>>>> 'encoding'? >>>>>>> >>>>>>> Matt >>>>>>> >>>>>>>> On Nov 12, 2015 6:15 PM, "Charlie Frasure" <[email protected]> >>>>>>>> wrote: >>>>>>>> Typos on my regex were just in the email, not the processor. It >>>>>>>> should have read ${encoding:match... >>>>>>>> >>>>>>>>> On Thu, Nov 12, 2015 at 6:03 PM, Charlie Frasure >>>>>>>>> <[email protected]> wrote: >>>>>>>>> This expression does not parse without error: >>>>>>>>> ${literal('utf-8 utf-16 utf-16be utf-16le us-ascii >>>>>>>>> iso-8859-1'):contains(encoding)} >>>>>>>>> >>>>>>>>> Is it not possible to use an attribute in a comparison function? >>>>>>>>> Unexpected token 'encoding' at line 1, column 73. Query: >>>>>>>>> ${literal(utf-8 utf-16 utf-16be utf-16le us-ascii >>>>>>>>> iso-8859-1):contains(encoding)} >>>>>>>>> >>>>>>>>> Alternatively, I think a regex should work, but didn't immediately >>>>>>>>> get a match using: >>>>>>>>> ${enconding.match('utf-8|utf-16|utf-16be|utf-16le|us-ascii|iso-8859-1')} >>>>>>>>> >>>>>>>>> Charlie
