Re: ExtractText Improvement

Pierre Villard Tue, 22 Dec 2020 02:35:53 -0800

Hi Eric,

I do think this would be interesting. Please submit a PR when you feel this
is ready for a review.


Thanks,
Pierre

Le lun. 21 déc. 2020 à 21:56, Eric Secules <[email protected]> a écrit :

> Would this improvement be worthwhile? I'd also like to apply it to all the
> regex search/replacement processors, route on content for instance. I have
> a working POC in my environment. I just need to clean things up, create a
> ticket and get my contribution access approved.
>
> On Wed., Dec. 16, 2020, 2:06 p.m. Eric Secules, <[email protected]>
> wrote:
>
>> Hello everyone,
>>
>> I was wondering if there could be an improvement to ExtractText so that
>> the entire content of the flowfile is scanned for matches in chunks of
>> MAX_BUFFER_SIZE which overlap by MAX_CAPTURE_GROUP_LENGTH. That way we can
>> do pattern extraction over arbitrary size files while keeping memory
>> consumption limited.
>>
>> Consider the use case where I am looking to extract a small pattern of
>> maybe 100 bytes from files that could be 1MB or 500MB. Looking at the
>> ExtractText source code, it always allocates a byte array of the maximum
>> size, so it probably wouldn't be appropriate to set that parameter too
>> high. It's essential to have the chunks overlap by the maximum length of
>> the capture group because the match may straddle two chunks. For the same
>> reason it's not advisable to split the flowfile into chunks of
>> MAX_BUFFER_SIZE using existing processors.
>>
>> Thanks,
>> Eric
>>
>

Re: ExtractText Improvement

Reply via email to