Would this improvement be worthwhile? I'd also like to apply it to all the
regex search/replacement processors, route on content for instance. I have
a working POC in my environment. I just need to clean things up, create a
ticket and get my contribution access approved.

On Wed., Dec. 16, 2020, 2:06 p.m. Eric Secules, <[email protected]> wrote:

> Hello everyone,
>
> I was wondering if there could be an improvement to ExtractText so that
> the entire content of the flowfile is scanned for matches in chunks of
> MAX_BUFFER_SIZE which overlap by MAX_CAPTURE_GROUP_LENGTH. That way we can
> do pattern extraction over arbitrary size files while keeping memory
> consumption limited.
>
> Consider the use case where I am looking to extract a small pattern of
> maybe 100 bytes from files that could be 1MB or 500MB. Looking at the
> ExtractText source code, it always allocates a byte array of the maximum
> size, so it probably wouldn't be appropriate to set that parameter too
> high. It's essential to have the chunks overlap by the maximum length of
> the capture group because the match may straddle two chunks. For the same
> reason it's not advisable to split the flowfile into chunks of
> MAX_BUFFER_SIZE using existing processors.
>
> Thanks,
> Eric
>

Reply via email to