[jira] [Commented] (NIFI-2853) Improve ListHDFS state tracking

2018-03-22 Thread Sivaprasanna Sethuraman (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409944#comment-16409944
 ] 

Sivaprasanna Sethuraman commented on NIFI-2853:
---

It's a bug. Created a ticket NIFI-5000 and also raised a PR.

> Improve ListHDFS state tracking
> ---
>
> Key: NIFI-2853
> URL: https://issues.apache.org/jira/browse/NIFI-2853
> Project: Apache NiFi
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Bryan Bende
>Priority: Minor
>
> Currently ListHDFS tracks two properties in state management, 
> "listing.timestamp" and "emitted.timestamp". In the 1.0.0 release, the 
> directory property now supports expression language which means the directory 
> being listed could dynamically change on any execution of the processor. 
> The processor should be changed to store state specific to the directory that 
> was listed, for example "listing.timestamp.dir1" and "emitted.timestamp.dir1".
> This would also help in a clustered scenario... currently ListHDFS has to be 
> run on primary node only, otherwise each node will be overwriting each others 
> state and producing unexpected results. With the above improvement, if the 
> directory evaluated to a unique path for each node, it would store the state 
> of each of those path separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NIFI-2853) Improve ListHDFS state tracking

2018-03-14 Thread Sivaprasanna Sethuraman (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398926#comment-16398926
 ] 

Sivaprasanna Sethuraman commented on NIFI-2853:
---

[~bende] I actually saw the code when I was going through this a few days ago 
but somehow when I replicate the scenario that I had mentioned above, it 
happens so I thought it has something to do with the isConfigurationRestored() 
check and also I dint quite understand what isConfigurationRestored() does or 
its logic. Can you please try to replicate the above scenario on your end to 
confirm?

> Improve ListHDFS state tracking
> ---
>
> Key: NIFI-2853
> URL: https://issues.apache.org/jira/browse/NIFI-2853
> Project: Apache NiFi
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Bryan Bende
>Priority: Minor
>
> Currently ListHDFS tracks two properties in state management, 
> "listing.timestamp" and "emitted.timestamp". In the 1.0.0 release, the 
> directory property now supports expression language which means the directory 
> being listed could dynamically change on any execution of the processor. 
> The processor should be changed to store state specific to the directory that 
> was listed, for example "listing.timestamp.dir1" and "emitted.timestamp.dir1".
> This would also help in a clustered scenario... currently ListHDFS has to be 
> run on primary node only, otherwise each node will be overwriting each others 
> state and producing unexpected results. With the above improvement, if the 
> directory evaluated to a unique path for each node, it would store the state 
> of each of those path separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NIFI-2853) Improve ListHDFS state tracking

2018-03-14 Thread Bryan Bende (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398906#comment-16398906
 ] 

Bryan Bende commented on NIFI-2853:
---

[~sivaprasanna] I'm not sure this scenario is accurate...

Any time the directory or filter is changed in the processor, the state 
tracking resets:

[https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-hadoop-bundle/nifi-hdfs-processors/src/main/java/org/apache/nifi/processors/hadoop/ListHDFS.java#L204-L206]

So in your example when you changed from "/tmp/sub-dir" to "/tmp" it would 
reset and pick up everything again.

I don't think there is anything to change about the state tracking. We can 
either close the JIRA, or apply the default scheduling strategy that Pierre 
suggested.

> Improve ListHDFS state tracking
> ---
>
> Key: NIFI-2853
> URL: https://issues.apache.org/jira/browse/NIFI-2853
> Project: Apache NiFi
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Bryan Bende
>Priority: Minor
>
> Currently ListHDFS tracks two properties in state management, 
> "listing.timestamp" and "emitted.timestamp". In the 1.0.0 release, the 
> directory property now supports expression language which means the directory 
> being listed could dynamically change on any execution of the processor. 
> The processor should be changed to store state specific to the directory that 
> was listed, for example "listing.timestamp.dir1" and "emitted.timestamp.dir1".
> This would also help in a clustered scenario... currently ListHDFS has to be 
> run on primary node only, otherwise each node will be overwriting each others 
> state and producing unexpected results. With the above improvement, if the 
> directory evaluated to a unique path for each node, it would store the state 
> of each of those path separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NIFI-2853) Improve ListHDFS state tracking

2018-03-11 Thread Sivaprasanna Sethuraman (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394497#comment-16394497
 ] 

Sivaprasanna Sethuraman commented on NIFI-2853:
---

Although not a critical one, I believe this is an important feature that is 
needed. And I also think it is better to have not just the root level directory 
name appended to the "listing.timestamp" and "emitted.timestamp" but also 
include the sub directories, like "listing.timestamp.dir1.subdir2", 
"listing.timestamp.dir1.subdir3.subdir3_1" to avoid edgecase scenarios. The 
reason is, if we don't do that, files might not get picked up in some scenario. 
Ex:
 # Create a directory "/tmp/sub-dir1"
 # Create a file "file1.txt" under "/tmp/sub-dir1"
 # Create a couple of files under "/tmp"
 # Create another file "file2.txt" under "/tmp/sub-dir1"

Now set ListHDFS as "Directory" : /tmp/sub-dir1. Run the flow. It will set the 
timestamp to the last accessed file which is "/tmp/sub-dir1/file2.txt". Now 
change the directory of ListHDFS to "/tmp", it won't pull in the files that 
were created in step 3 because those files modified time would be lesser than 
the timestamp stored as part of the processor's state. It will not happen with 
the said approach. Thoughts?

> Improve ListHDFS state tracking
> ---
>
> Key: NIFI-2853
> URL: https://issues.apache.org/jira/browse/NIFI-2853
> Project: Apache NiFi
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Bryan Bende
>Priority: Minor
>
> Currently ListHDFS tracks two properties in state management, 
> "listing.timestamp" and "emitted.timestamp". In the 1.0.0 release, the 
> directory property now supports expression language which means the directory 
> being listed could dynamically change on any execution of the processor. 
> The processor should be changed to store state specific to the directory that 
> was listed, for example "listing.timestamp.dir1" and "emitted.timestamp.dir1".
> This would also help in a clustered scenario... currently ListHDFS has to be 
> run on primary node only, otherwise each node will be overwriting each others 
> state and producing unexpected results. With the above improvement, if the 
> directory evaluated to a unique path for each node, it would store the state 
> of each of those path separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NIFI-2853) Improve ListHDFS state tracking

2017-01-03 Thread Pierre Villard (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15795308#comment-15795308
 ] 

Pierre Villard commented on NIFI-2853:
--

With NIFI-1526, it is now possible to define a default scheduling strategy in 
the processors. I think it would make sense to add:

{{@DefaultSchedule(strategy = SchedulingStrategy.PRIMARY_NODE_ONLY)}}

in the concerned processors. Thoughts?

> Improve ListHDFS state tracking
> ---
>
> Key: NIFI-2853
> URL: https://issues.apache.org/jira/browse/NIFI-2853
> Project: Apache NiFi
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Bryan Bende
>Priority: Minor
>
> Currently ListHDFS tracks two properties in state management, 
> "listing.timestamp" and "emitted.timestamp". In the 1.0.0 release, the 
> directory property now supports expression language which means the directory 
> being listed could dynamically change on any execution of the processor. 
> The processor should be changed to store state specific to the directory that 
> was listed, for example "listing.timestamp.dir1" and "emitted.timestamp.dir1".
> This would also help in a clustered scenario... currently ListHDFS has to be 
> run on primary node only, otherwise each node will be overwriting each others 
> state and producing unexpected results. With the above improvement, if the 
> directory evaluated to a unique path for each node, it would store the state 
> of each of those path separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-2853) Improve ListHDFS state tracking

2016-10-17 Thread Bryan Bende (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583302#comment-15583302
 ] 

Bryan Bende commented on NIFI-2853:
---

[~joewitt] Sticking with running ListHDFS only on primary node makes sense to 
me. 

Do you think we should consider an annotation for @PrimaryNodeOnly so that the 
framework can prevent certain processors from being schedule on all nodes? or 
maybe that is overkill and we should just ensure proper documentation?

> Improve ListHDFS state tracking
> ---
>
> Key: NIFI-2853
> URL: https://issues.apache.org/jira/browse/NIFI-2853
> Project: Apache NiFi
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Bryan Bende
>Priority: Minor
>
> Currently ListHDFS tracks two properties in state management, 
> "listing.timestamp" and "emitted.timestamp". In the 1.0.0 release, the 
> directory property now supports expression language which means the directory 
> being listed could dynamically change on any execution of the processor. 
> The processor should be changed to store state specific to the directory that 
> was listed, for example "listing.timestamp.dir1" and "emitted.timestamp.dir1".
> This would also help in a clustered scenario... currently ListHDFS has to be 
> run on primary node only, otherwise each node will be overwriting each others 
> state and producing unexpected results. With the above improvement, if the 
> directory evaluated to a unique path for each node, it would store the state 
> of each of those path separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-2853) Improve ListHDFS state tracking

2016-10-12 Thread Joseph Witt (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570581#comment-15570581
 ] 

Joseph Witt commented on NIFI-2853:
---

If very clearly documented I could see how ListHDFS could be allowed to run on 
multiple nodes if configured to look at unique directories but it also seems to 
be a potentially unnecessary complexity.

The idea for ListHDFS was to something very lightweight from a single node and 
the work could then be farmed out for the heavier listing across the cluster 
which would be the actual FetchHDFS calls.

By allowing multiple nodes to execute ListHDFS at once we have to have some way 
to namespace the state and while the directory being listed would do it this 
also means we could end up needing to store an arbitrarily large number of 
directories.  In the single node case regardless of how many directories we're 
pulling from it won't matter because a single value for timestamp of listing 
and emitting is sufficient for all of them (just look for anything in any 
matching directory that has changed since that time).

> Improve ListHDFS state tracking
> ---
>
> Key: NIFI-2853
> URL: https://issues.apache.org/jira/browse/NIFI-2853
> Project: Apache NiFi
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Bryan Bende
>Priority: Minor
>
> Currently ListHDFS tracks two properties in state management, 
> "listing.timestamp" and "emitted.timestamp". In the 1.0.0 release, the 
> directory property now supports expression language which means the directory 
> being listed could dynamically change on any execution of the processor. 
> The processor should be changed to store state specific to the directory that 
> was listed, for example "listing.timestamp.dir1" and "emitted.timestamp.dir1".
> This would also help in a clustered scenario... currently ListHDFS has to be 
> run on primary node only, otherwise each node will be overwriting each others 
> state and producing unexpected results. With the above improvement, if the 
> directory evaluated to a unique path for each node, it would store the state 
> of each of those path separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)