Hi Eric, I do think this would be interesting. Please submit a PR when you feel this is ready for a review.
Thanks, Pierre Le lun. 21 déc. 2020 à 21:56, Eric Secules <[email protected]> a écrit : > Would this improvement be worthwhile? I'd also like to apply it to all the > regex search/replacement processors, route on content for instance. I have > a working POC in my environment. I just need to clean things up, create a > ticket and get my contribution access approved. > > On Wed., Dec. 16, 2020, 2:06 p.m. Eric Secules, <[email protected]> > wrote: > >> Hello everyone, >> >> I was wondering if there could be an improvement to ExtractText so that >> the entire content of the flowfile is scanned for matches in chunks of >> MAX_BUFFER_SIZE which overlap by MAX_CAPTURE_GROUP_LENGTH. That way we can >> do pattern extraction over arbitrary size files while keeping memory >> consumption limited. >> >> Consider the use case where I am looking to extract a small pattern of >> maybe 100 bytes from files that could be 1MB or 500MB. Looking at the >> ExtractText source code, it always allocates a byte array of the maximum >> size, so it probably wouldn't be appropriate to set that parameter too >> high. It's essential to have the chunks overlap by the maximum length of >> the capture group because the match may straddle two chunks. For the same >> reason it's not advisable to split the flowfile into chunks of >> MAX_BUFFER_SIZE using existing processors. >> >> Thanks, >> Eric >> >
