[jira] [Commented] (NIFI-2853) Improve ListHDFS state tracking
[ https://issues.apache.org/jira/browse/NIFI-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409944#comment-16409944 ] Sivaprasanna Sethuraman commented on NIFI-2853: --- It's a bug. Created a ticket NIFI-5000 and also raised a PR. > Improve ListHDFS state tracking > --- > > Key: NIFI-2853 > URL: https://issues.apache.org/jira/browse/NIFI-2853 > Project: Apache NiFi > Issue Type: Improvement >Affects Versions: 1.0.0 >Reporter: Bryan Bende >Priority: Minor > > Currently ListHDFS tracks two properties in state management, > "listing.timestamp" and "emitted.timestamp". In the 1.0.0 release, the > directory property now supports expression language which means the directory > being listed could dynamically change on any execution of the processor. > The processor should be changed to store state specific to the directory that > was listed, for example "listing.timestamp.dir1" and "emitted.timestamp.dir1". > This would also help in a clustered scenario... currently ListHDFS has to be > run on primary node only, otherwise each node will be overwriting each others > state and producing unexpected results. With the above improvement, if the > directory evaluated to a unique path for each node, it would store the state > of each of those path separately. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NIFI-2853) Improve ListHDFS state tracking
[ https://issues.apache.org/jira/browse/NIFI-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398926#comment-16398926 ] Sivaprasanna Sethuraman commented on NIFI-2853: --- [~bende] I actually saw the code when I was going through this a few days ago but somehow when I replicate the scenario that I had mentioned above, it happens so I thought it has something to do with the isConfigurationRestored() check and also I dint quite understand what isConfigurationRestored() does or its logic. Can you please try to replicate the above scenario on your end to confirm? > Improve ListHDFS state tracking > --- > > Key: NIFI-2853 > URL: https://issues.apache.org/jira/browse/NIFI-2853 > Project: Apache NiFi > Issue Type: Improvement >Affects Versions: 1.0.0 >Reporter: Bryan Bende >Priority: Minor > > Currently ListHDFS tracks two properties in state management, > "listing.timestamp" and "emitted.timestamp". In the 1.0.0 release, the > directory property now supports expression language which means the directory > being listed could dynamically change on any execution of the processor. > The processor should be changed to store state specific to the directory that > was listed, for example "listing.timestamp.dir1" and "emitted.timestamp.dir1". > This would also help in a clustered scenario... currently ListHDFS has to be > run on primary node only, otherwise each node will be overwriting each others > state and producing unexpected results. With the above improvement, if the > directory evaluated to a unique path for each node, it would store the state > of each of those path separately. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NIFI-2853) Improve ListHDFS state tracking
[ https://issues.apache.org/jira/browse/NIFI-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398906#comment-16398906 ] Bryan Bende commented on NIFI-2853: --- [~sivaprasanna] I'm not sure this scenario is accurate... Any time the directory or filter is changed in the processor, the state tracking resets: [https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-hadoop-bundle/nifi-hdfs-processors/src/main/java/org/apache/nifi/processors/hadoop/ListHDFS.java#L204-L206] So in your example when you changed from "/tmp/sub-dir" to "/tmp" it would reset and pick up everything again. I don't think there is anything to change about the state tracking. We can either close the JIRA, or apply the default scheduling strategy that Pierre suggested. > Improve ListHDFS state tracking > --- > > Key: NIFI-2853 > URL: https://issues.apache.org/jira/browse/NIFI-2853 > Project: Apache NiFi > Issue Type: Improvement >Affects Versions: 1.0.0 >Reporter: Bryan Bende >Priority: Minor > > Currently ListHDFS tracks two properties in state management, > "listing.timestamp" and "emitted.timestamp". In the 1.0.0 release, the > directory property now supports expression language which means the directory > being listed could dynamically change on any execution of the processor. > The processor should be changed to store state specific to the directory that > was listed, for example "listing.timestamp.dir1" and "emitted.timestamp.dir1". > This would also help in a clustered scenario... currently ListHDFS has to be > run on primary node only, otherwise each node will be overwriting each others > state and producing unexpected results. With the above improvement, if the > directory evaluated to a unique path for each node, it would store the state > of each of those path separately. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NIFI-2853) Improve ListHDFS state tracking
[ https://issues.apache.org/jira/browse/NIFI-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394497#comment-16394497 ] Sivaprasanna Sethuraman commented on NIFI-2853: --- Although not a critical one, I believe this is an important feature that is needed. And I also think it is better to have not just the root level directory name appended to the "listing.timestamp" and "emitted.timestamp" but also include the sub directories, like "listing.timestamp.dir1.subdir2", "listing.timestamp.dir1.subdir3.subdir3_1" to avoid edgecase scenarios. The reason is, if we don't do that, files might not get picked up in some scenario. Ex: # Create a directory "/tmp/sub-dir1" # Create a file "file1.txt" under "/tmp/sub-dir1" # Create a couple of files under "/tmp" # Create another file "file2.txt" under "/tmp/sub-dir1" Now set ListHDFS as "Directory" : /tmp/sub-dir1. Run the flow. It will set the timestamp to the last accessed file which is "/tmp/sub-dir1/file2.txt". Now change the directory of ListHDFS to "/tmp", it won't pull in the files that were created in step 3 because those files modified time would be lesser than the timestamp stored as part of the processor's state. It will not happen with the said approach. Thoughts? > Improve ListHDFS state tracking > --- > > Key: NIFI-2853 > URL: https://issues.apache.org/jira/browse/NIFI-2853 > Project: Apache NiFi > Issue Type: Improvement >Affects Versions: 1.0.0 >Reporter: Bryan Bende >Priority: Minor > > Currently ListHDFS tracks two properties in state management, > "listing.timestamp" and "emitted.timestamp". In the 1.0.0 release, the > directory property now supports expression language which means the directory > being listed could dynamically change on any execution of the processor. > The processor should be changed to store state specific to the directory that > was listed, for example "listing.timestamp.dir1" and "emitted.timestamp.dir1". > This would also help in a clustered scenario... currently ListHDFS has to be > run on primary node only, otherwise each node will be overwriting each others > state and producing unexpected results. With the above improvement, if the > directory evaluated to a unique path for each node, it would store the state > of each of those path separately. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NIFI-2853) Improve ListHDFS state tracking
[ https://issues.apache.org/jira/browse/NIFI-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15795308#comment-15795308 ] Pierre Villard commented on NIFI-2853: -- With NIFI-1526, it is now possible to define a default scheduling strategy in the processors. I think it would make sense to add: {{@DefaultSchedule(strategy = SchedulingStrategy.PRIMARY_NODE_ONLY)}} in the concerned processors. Thoughts? > Improve ListHDFS state tracking > --- > > Key: NIFI-2853 > URL: https://issues.apache.org/jira/browse/NIFI-2853 > Project: Apache NiFi > Issue Type: Improvement >Affects Versions: 1.0.0 >Reporter: Bryan Bende >Priority: Minor > > Currently ListHDFS tracks two properties in state management, > "listing.timestamp" and "emitted.timestamp". In the 1.0.0 release, the > directory property now supports expression language which means the directory > being listed could dynamically change on any execution of the processor. > The processor should be changed to store state specific to the directory that > was listed, for example "listing.timestamp.dir1" and "emitted.timestamp.dir1". > This would also help in a clustered scenario... currently ListHDFS has to be > run on primary node only, otherwise each node will be overwriting each others > state and producing unexpected results. With the above improvement, if the > directory evaluated to a unique path for each node, it would store the state > of each of those path separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-2853) Improve ListHDFS state tracking
[ https://issues.apache.org/jira/browse/NIFI-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583302#comment-15583302 ] Bryan Bende commented on NIFI-2853: --- [~joewitt] Sticking with running ListHDFS only on primary node makes sense to me. Do you think we should consider an annotation for @PrimaryNodeOnly so that the framework can prevent certain processors from being schedule on all nodes? or maybe that is overkill and we should just ensure proper documentation? > Improve ListHDFS state tracking > --- > > Key: NIFI-2853 > URL: https://issues.apache.org/jira/browse/NIFI-2853 > Project: Apache NiFi > Issue Type: Improvement >Affects Versions: 1.0.0 >Reporter: Bryan Bende >Priority: Minor > > Currently ListHDFS tracks two properties in state management, > "listing.timestamp" and "emitted.timestamp". In the 1.0.0 release, the > directory property now supports expression language which means the directory > being listed could dynamically change on any execution of the processor. > The processor should be changed to store state specific to the directory that > was listed, for example "listing.timestamp.dir1" and "emitted.timestamp.dir1". > This would also help in a clustered scenario... currently ListHDFS has to be > run on primary node only, otherwise each node will be overwriting each others > state and producing unexpected results. With the above improvement, if the > directory evaluated to a unique path for each node, it would store the state > of each of those path separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NIFI-2853) Improve ListHDFS state tracking
[ https://issues.apache.org/jira/browse/NIFI-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570581#comment-15570581 ] Joseph Witt commented on NIFI-2853: --- If very clearly documented I could see how ListHDFS could be allowed to run on multiple nodes if configured to look at unique directories but it also seems to be a potentially unnecessary complexity. The idea for ListHDFS was to something very lightweight from a single node and the work could then be farmed out for the heavier listing across the cluster which would be the actual FetchHDFS calls. By allowing multiple nodes to execute ListHDFS at once we have to have some way to namespace the state and while the directory being listed would do it this also means we could end up needing to store an arbitrarily large number of directories. In the single node case regardless of how many directories we're pulling from it won't matter because a single value for timestamp of listing and emitting is sufficient for all of them (just look for anything in any matching directory that has changed since that time). > Improve ListHDFS state tracking > --- > > Key: NIFI-2853 > URL: https://issues.apache.org/jira/browse/NIFI-2853 > Project: Apache NiFi > Issue Type: Improvement >Affects Versions: 1.0.0 >Reporter: Bryan Bende >Priority: Minor > > Currently ListHDFS tracks two properties in state management, > "listing.timestamp" and "emitted.timestamp". In the 1.0.0 release, the > directory property now supports expression language which means the directory > being listed could dynamically change on any execution of the processor. > The processor should be changed to store state specific to the directory that > was listed, for example "listing.timestamp.dir1" and "emitted.timestamp.dir1". > This would also help in a clustered scenario... currently ListHDFS has to be > run on primary node only, otherwise each node will be overwriting each others > state and producing unexpected results. With the above improvement, if the > directory evaluated to a unique path for each node, it would store the state > of each of those path separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332)