Hello everyone,

I was wondering if there could be an improvement to ExtractText so that the
entire content of the flowfile is scanned for matches in chunks of
MAX_BUFFER_SIZE which overlap by MAX_CAPTURE_GROUP_LENGTH. That way we can
do pattern extraction over arbitrary size files while keeping memory
consumption limited.

Consider the use case where I am looking to extract a small pattern of
maybe 100 bytes from files that could be 1MB or 500MB. Looking at the
ExtractText source code, it always allocates a byte array of the maximum
size, so it probably wouldn't be appropriate to set that parameter too
high. It's essential to have the chunks overlap by the maximum length of
the capture group because the match may straddle two chunks. For the same
reason it's not advisable to split the flowfile into chunks of
MAX_BUFFER_SIZE using existing processors.

Thanks,
Eric

Reply via email to