Hello everyone, I was wondering if there could be an improvement to ExtractText so that the entire content of the flowfile is scanned for matches in chunks of MAX_BUFFER_SIZE which overlap by MAX_CAPTURE_GROUP_LENGTH. That way we can do pattern extraction over arbitrary size files while keeping memory consumption limited.
Consider the use case where I am looking to extract a small pattern of maybe 100 bytes from files that could be 1MB or 500MB. Looking at the ExtractText source code, it always allocates a byte array of the maximum size, so it probably wouldn't be appropriate to set that parameter too high. It's essential to have the chunks overlap by the maximum length of the capture group because the match may straddle two chunks. For the same reason it's not advisable to split the flowfile into chunks of MAX_BUFFER_SIZE using existing processors. Thanks, Eric
