[ https://issues.apache.org/jira/browse/NIFI-12595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
endzeit updated NIFI-12595: --------------------------- Summary: Introduce an "Entity Tracking Strategy" and "Tracking Recently Listed" to ListedEntityTracker (was: Introduce an "Entity Tracking Strategy" to ListedEntityTracker) > Introduce an "Entity Tracking Strategy" and "Tracking Recently Listed" to > ListedEntityTracker > --------------------------------------------------------------------------------------------- > > Key: NIFI-12595 > URL: https://issues.apache.org/jira/browse/NIFI-12595 > Project: Apache NiFi > Issue Type: New Feature > Affects Versions: 1.24.0 > Reporter: endzeit > Priority: Major > > h1. Situation > The existing {{ListX}} processors support different "Listing Strategies". One > commonly used "Listing Strategy" is "Tracking Entities" whereby crucial > information of all recently listed entities, e.g. files, is remembered. > On every listing, in case information to an entity has been remembered > before, the entity is not listed again (unless it was modified). > This has several benefits over other available "Listing Strategies". For > example, unlike with "No Tracking" the same entity is not listed repeatedly. > Other than "Tracking Timestamps", entities with an older timestamp than ones > previously listed can be picked up. > However, the strategy comes with its own problems. > h1. Problem > Due to the ever given constraints to available memory and performance, > entities cannot be tracked indefinitely. > That's why the {{ListedEntityTracker}}, used for implementing "Tracking > Entities" by most processors, introduces the notion of an "Entity Tracking > Time Window". > All remembered entities that are out of the time window (they are older than > the current time minus the time window) are removed from the tracking cache, > to limit memory use. Additionally, not yet listed entities that are out of > the time window are exempt from listing, as they would be removed from the > "cache" on the next run immediately, resulting in them being listed over and > over. > However, this results in entities "older" than the specified "Entity Tracking > Time Window" not being picked up. For example, given entities are listed from > a remote server and this server is not available for some time. Once the > server is available again, the listing continues. However, all entities / > files that were created before the defined time window, will be silently > ignored. > As of now, this can be solved by manual intervention, re-starting the ListX > processor. The > "Entity Tracking Time Window" can be ignored upon initial listing, when the > "Entity Tracking Initial Listing Target" is set to "All Available" (default). > However, this requires the NiFi user to be aware of lingering old entities > being available on the connected remote source. Additionally, the need for > manual intervention might be undesired / impractical when having a plentiful > of sources connected. > Additionally, the "Entity Tracking Time Window" can be increased to account > for longer time frames. However, this only betters the situation somewhat and > does not solve the problem. Also there is a limit to this, as it increases > the memory needed. > h1. Proposal > This issue proposes introducing the notion of a "Entity Tracking Strategy", > whereby the current behavior could be understand as "Tracking Time Window". > An new strategy of "Tracking Recently Listed" is added. Other than they > existing "Tracking Time Window" strategy, this would not impose any > prerequisites on the entities regarding they timestamp (see > {{minTimestampToList}}). Instead, all entities would be considered. > However, this strategy needs a way to limit / clean the entity cache as well. > Instead of removing entries that leave the time window, the strategy should > remember only the last N listed entities. That is, every time an entity is > listed, it is moved to the front of an "ordered list". In case the entity has > been listed before, its entry in the "list" is moved to the front > nonetheless. After every listing, only the first up to N entries are kept. > All other, less recently listed entities, are removed from cache. > While this strategy solves the problem of listing "old" entities it comes > with its own downsides. The NiFi user has to configure a sensible value for > the amount N of maximum cache entries. Failing to do so can result in listing > an entity more than once (similar to "No Tracking"), when the source provides > more than N entities at once (e.g. due to a load peak) and the entities are > not removed from the source until the next listing. Thus this approach works > best for Flows where the entity is removed from the source short after the > listing. This behavior should be mentioned in the documentation of the > property. -- This message was sent by Atlassian Jira (v8.20.10#820010)