Would this improvement be worthwhile? I'd also like to apply it to all the regex search/replacement processors, route on content for instance. I have a working POC in my environment. I just need to clean things up, create a ticket and get my contribution access approved.
On Wed., Dec. 16, 2020, 2:06 p.m. Eric Secules, <[email protected]> wrote: > Hello everyone, > > I was wondering if there could be an improvement to ExtractText so that > the entire content of the flowfile is scanned for matches in chunks of > MAX_BUFFER_SIZE which overlap by MAX_CAPTURE_GROUP_LENGTH. That way we can > do pattern extraction over arbitrary size files while keeping memory > consumption limited. > > Consider the use case where I am looking to extract a small pattern of > maybe 100 bytes from files that could be 1MB or 500MB. Looking at the > ExtractText source code, it always allocates a byte array of the maximum > size, so it probably wouldn't be appropriate to set that parameter too > high. It's essential to have the chunks overlap by the maximum length of > the capture group because the match may straddle two chunks. For the same > reason it's not advisable to split the flowfile into chunks of > MAX_BUFFER_SIZE using existing processors. > > Thanks, > Eric >
