endzeit created NIFI-12595:
------------------------------

             Summary: Introduce an "Entity Tracking Strategy" to 
ListedEntityTracker
                 Key: NIFI-12595
                 URL: https://issues.apache.org/jira/browse/NIFI-12595
             Project: Apache NiFi
          Issue Type: New Feature
    Affects Versions: 1.24.0
            Reporter: endzeit


h1. Situation

The existing {{ListX}} processors support different "Listing Strategies". One 
commonly used  "Listing Strategy" is "Tracking Entities" whereby crucial 
information of all recently listed entities, e.g. files, is remembered.
On every listing, in case information to an entity has been remembered before, 
the entity is not listed again (unless it was modified).

This has several benefits over other available "Listing Strategies". For 
example, unlike with "No Tracking" the same entity is not listed repeatedly.  
Other than "Tracking Timestamps", entities with an older timestamp than ones 
previously listed can be picked up. 
However, the strategy comes with its own problems.

h1. Problem

Due to the ever given constraints to available memory and performance, entities 
cannot be tracked indefinitely.
That's why the {{ListedEntityTracker}}, used for implementing "Tracking 
Entities" by most processors, introduces the notion of an "Entity Tracking Time 
Window".
All remembered entities that are out of the time window (they are older than 
the current time minus the time window) are removed from the tracking cache, to 
limit memory use. Additionally, not yet listed entities that are out of the 
time window are exempt from listing, as they would be removed from the "cache" 
on the next run immediately, resulting in them being listed over and over. 

However, this results in entities "older" than the specified "Entity Tracking 
Time Window" not being picked up. For example, given entities are listed from a 
remote server and this server is not available for some time. Once the server 
is available again, the listing continues. However, all entities / files that 
were created before the defined time window, will be silently ignored.

As of now, this can be solved by manual intervention, re-starting the ListX 
processor. The 
"Entity Tracking Time Window" can be ignored upon initial listing, when the 
"Entity Tracking Initial Listing Target" is set to "All Available" (default).

However, this requires the NiFi user to be aware of lingering old entities 
being available on the connected remote source. Additionally, the need for 
manual intervention might be undesired / impractical when having a plentiful of 
sources connected.

Additionally, the "Entity Tracking Time Window" can be increased to account for 
longer time frames. However, this only betters the situation somewhat and does 
not solve the problem. Also there is a limit to this, as it increases the 
memory needed.

h1. Proposal

This issue proposes introducing the notion of a "Entity Tracking Strategy", 
whereby the current behavior could be understand as "Tracking Time Window".

An new strategy of "Tracking Recently Listed" is added. Other than they 
existing "Tracking Time Window" strategy, this would not impose any 
prerequisites on the entities regarding they timestamp (see 
{{minTimestampToList}}). Instead, all entities would be considered. 
However, this strategy needs a way to limit / clean the entity cache as well. 
Instead of removing entries that leave the time window, the strategy should 
remember only the last N listed entities. That is, every time an entity is 
listed, it is moved to the front of an "ordered list". In case the entity has 
been listed before, its entry in the "list" is moved to the front nonetheless. 
After every listing, only the first up to N entries are kept. All other, less 
recently listed entities, are removed from cache.

While this strategy solves the problem of listing "old" entities it comes with 
its own downsides. The NiFi user has to configure a sensible value for the 
amount N of maximum cache entries. Failing to do so can result in listing an 
entity more than once (similar to "No Tracking"), when the source provides more 
than N entities at once (e.g. due to a load peak) and the entities are not 
removed from the source until the next listing. Thus this approach works best 
for Flows where the entity is removed from the source short after the listing. 
This behavior should be mentioned in the documentation of the property.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to