endzeit created NIFI-12595: ------------------------------ Summary: Introduce an "Entity Tracking Strategy" to ListedEntityTracker Key: NIFI-12595 URL: https://issues.apache.org/jira/browse/NIFI-12595 Project: Apache NiFi Issue Type: New Feature Affects Versions: 1.24.0 Reporter: endzeit
h1. Situation The existing {{ListX}} processors support different "Listing Strategies". One commonly used "Listing Strategy" is "Tracking Entities" whereby crucial information of all recently listed entities, e.g. files, is remembered. On every listing, in case information to an entity has been remembered before, the entity is not listed again (unless it was modified). This has several benefits over other available "Listing Strategies". For example, unlike with "No Tracking" the same entity is not listed repeatedly. Other than "Tracking Timestamps", entities with an older timestamp than ones previously listed can be picked up. However, the strategy comes with its own problems. h1. Problem Due to the ever given constraints to available memory and performance, entities cannot be tracked indefinitely. That's why the {{ListedEntityTracker}}, used for implementing "Tracking Entities" by most processors, introduces the notion of an "Entity Tracking Time Window". All remembered entities that are out of the time window (they are older than the current time minus the time window) are removed from the tracking cache, to limit memory use. Additionally, not yet listed entities that are out of the time window are exempt from listing, as they would be removed from the "cache" on the next run immediately, resulting in them being listed over and over. However, this results in entities "older" than the specified "Entity Tracking Time Window" not being picked up. For example, given entities are listed from a remote server and this server is not available for some time. Once the server is available again, the listing continues. However, all entities / files that were created before the defined time window, will be silently ignored. As of now, this can be solved by manual intervention, re-starting the ListX processor. The "Entity Tracking Time Window" can be ignored upon initial listing, when the "Entity Tracking Initial Listing Target" is set to "All Available" (default). However, this requires the NiFi user to be aware of lingering old entities being available on the connected remote source. Additionally, the need for manual intervention might be undesired / impractical when having a plentiful of sources connected. Additionally, the "Entity Tracking Time Window" can be increased to account for longer time frames. However, this only betters the situation somewhat and does not solve the problem. Also there is a limit to this, as it increases the memory needed. h1. Proposal This issue proposes introducing the notion of a "Entity Tracking Strategy", whereby the current behavior could be understand as "Tracking Time Window". An new strategy of "Tracking Recently Listed" is added. Other than they existing "Tracking Time Window" strategy, this would not impose any prerequisites on the entities regarding they timestamp (see {{minTimestampToList}}). Instead, all entities would be considered. However, this strategy needs a way to limit / clean the entity cache as well. Instead of removing entries that leave the time window, the strategy should remember only the last N listed entities. That is, every time an entity is listed, it is moved to the front of an "ordered list". In case the entity has been listed before, its entry in the "list" is moved to the front nonetheless. After every listing, only the first up to N entries are kept. All other, less recently listed entities, are removed from cache. While this strategy solves the problem of listing "old" entities it comes with its own downsides. The NiFi user has to configure a sensible value for the amount N of maximum cache entries. Failing to do so can result in listing an entity more than once (similar to "No Tracking"), when the source provides more than N entities at once (e.g. due to a load peak) and the entities are not removed from the source until the next listing. Thus this approach works best for Flows where the entity is removed from the source short after the listing. This behavior should be mentioned in the documentation of the property. -- This message was sent by Atlassian Jira (v8.20.10#820010)