Under nifi-commons/nifi-utils/ there is a package called org.apache.nifi.util.search That is where the good stuff is. Wild man Tony Kurc bringing the high speed search heat there.
Thanks Joe On Thu, Nov 12, 2015 at 9:09 PM, Matt Burgess <[email protected]> wrote: > That is awesome to hear, I didn't realize ScanContent worked that way, very > cool! > > Sent from my iPhone > >> On Nov 12, 2015, at 8:40 PM, Joe Witt <[email protected]> wrote: >> >> User Experience - everything we do needs to be about continually >> improving the user experience. So yes for sure if you've got ideas on >> how to provide a more intuitive play - yes please. You will find an >> implementation of aho corasick under the standard processors >> (ScanContent) and the associated library under search tools. >> Amazingly fast. >> >> Thanks! >> Joe >> >>> On Thu, Nov 12, 2015 at 8:33 PM, Matt Burgess <[email protected]> wrote: >>> Not sure if it would prove useful but I've started messing around with the >>> Aho-Corasick algorithm in the hopes of the user being able to paste in some >>> sample data and getting a regex out. If the data is "regular", the user >>> wouldn't need to know an expression language, they would just need a >>> representative sample of their data. >>> >>> Depending on how crazy I want to get, I might do cross-fold validation >>> (rated against the algorithm on the whole set) for the sample input to see >>> if it's really "regular" or that guessing a regex is just too hard for the >>> given data. >>> >>> Anyway, do you think a "regex guesser" or "NiFi expression guesser" would be >>> a valuable feature? The missing link is the translator from Finite State >>> Machine (from Aho-Corasick) to the target model (regex or otherwise). The >>> research has been done and there is code available (under GPL) so on purpose >>> I did not read the paper or look at the source. >>> >>> Sorry in advance if I've gone too far afield here, I've just felt the pains >>> of users trying to get the right recognizers for their data fields. >>> >>> Cheers, >>> Matt >>> >>> Sent from my iPhone >>> >>> On Nov 12, 2015, at 7:54 PM, Joe Witt <[email protected]> wrote: >>> >>> We have to make this easier... >>> >>> Maybe we should give someone access to an inline expression editor and see >>> the results. Like in regexpal... >>> >>>> On Nov 12, 2015 7:26 PM, "Charlie Frasure" <[email protected]> >>>> wrote: >>>> >>>> Good call. I added trim() to the matches command, and it seems to have >>>> resolved the issue. I was checking for sane lengths, but maybe there was a >>>> \n or something in there. Problem for another day. Thanks. >>>> >>>> >>>> On Thu, Nov 12, 2015 at 7:13 PM, Matthew Clarke >>>> <[email protected]> wrote: >>>>> >>>>> Make sure your attribute name and value does not have white space on >>>>> either side. A 'space' is a valid character and is often over looked. " >>>>> encoding" does not equal "encoding" or "encoding ". The same applies for >>>>> the >>>>> attribute values. >>>>> >>>>> On Nov 12, 2015 7:07 PM, "Charlie Frasure" <[email protected]> >>>>> wrote: >>>>>> >>>>>> Thanks. I did use the matches syntax already and checked the attribute >>>>>> values in each processor using Data Provenance, but I will try adding the >>>>>> additional bulletin to see if something else surfaces. >>>>>> >>>>>> On Thu, Nov 12, 2015 at 7:00 PM, Matthew Clarke >>>>>> <[email protected]> wrote: >>>>>>> >>>>>>> Try adding a logAttribute processor after your encoding test to see >>>>>>> what values are actually getting assigned to the encoding attribute. >>>>>>> Attribute are always stores as strings, so I don't think you need to >>>>>>> use the >>>>>>> literal function. I would suggest trying ${encoding: matches >>>>>>> ('utf-8|utf-16|utf-16be|utf-16le|us-ascii|iso-8859-1')} >>>>>>> >>>>>>> Matches is an exact match and values are case sensitive. >>>>>>> >>>>>>> If you set the bulletin level on the logAttribute processor to 'info', >>>>>>> all the attribute key/value pairs will be displayed on the processor by >>>>>>> hovering over the bulletin (yellow post-it). They will also e dumped to >>>>>>> the >>>>>>> app log. >>>>>>> >>>>>>> On Nov 12, 2015 6:40 PM, "Charlie Frasure" <[email protected]> >>>>>>> wrote: >>>>>>>> >>>>>>>> I am attempting to convert many files with various encoding to a >>>>>>>> common character set. I have an attribute called 'encoding' that >>>>>>>> stores the >>>>>>>> result of an encoding test. When passing that value as the source to >>>>>>>> the >>>>>>>> ConvertCharacterSet processor, it didn't match the processor's expected >>>>>>>> values. I added an UpdateAttribute processor that is attempting to >>>>>>>> compare >>>>>>>> 'encoding' to known valid Java character sets. That comparison is >>>>>>>> where I >>>>>>>> am having trouble. In SQL it would be "where encoding in ('utf-8', >>>>>>>> 'utf-16', 'utf-16be', 'utf-16le', 'us-ascii', 'iso-8859-1')." >>>>>>>> >>>>>>>> Based on this document, I thought that 'literal' would be a good >>>>>>>> function combined with 'contains'. >>>>>>>> >>>>>>>> https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#literal >>>>>>>> >>>>>>>> Once the comparison is working, I will send the matching files to the >>>>>>>> ConvertCharacterSet processor. >>>>>>>> >>>>>>>> On Thu, Nov 12, 2015 at 6:24 PM, Matthew Clarke >>>>>>>> <[email protected]> wrote: >>>>>>>>> >>>>>>>>> Charlie, >>>>>>>>> I am not sure what your use case is here. 'Literal' is not a >>>>>>>>> NiFI expression language function. If you can give me some detail on >>>>>>>>> what >>>>>>>>> you are trying to do, I can help you with the NiFi expression language >>>>>>>>> strategy to accomplish it. Did you create a FlowFile attribute named >>>>>>>>> 'encoding'? >>>>>>>>> >>>>>>>>> Matt >>>>>>>>> >>>>>>>>> On Nov 12, 2015 6:15 PM, "Charlie Frasure" <[email protected]> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Typos on my regex were just in the email, not the processor. It >>>>>>>>>> should have read ${encoding:match... >>>>>>>>>> >>>>>>>>>> On Thu, Nov 12, 2015 at 6:03 PM, Charlie Frasure >>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>> This expression does not parse without error: >>>>>>>>>>> ${literal('utf-8 utf-16 utf-16be utf-16le us-ascii >>>>>>>>>>> iso-8859-1'):contains(encoding)} >>>>>>>>>>> >>>>>>>>>>> Is it not possible to use an attribute in a comparison function? >>>>>>>>>>> Unexpected token 'encoding' at line 1, column 73. Query: >>>>>>>>>>> ${literal(utf-8 utf-16 utf-16be utf-16le us-ascii >>>>>>>>>>> iso-8859-1):contains(encoding)} >>>>>>>>>>> >>>>>>>>>>> Alternatively, I think a regex should work, but didn't immediately >>>>>>>>>>> get a match using: >>>>>>>>>>> >>>>>>>>>>> ${enconding.match('utf-8|utf-16|utf-16be|utf-16le|us-ascii|iso-8859-1')} >>>>>>>>>>> >>>>>>>>>>> Charlie >>>
