[jira] [Comment Edited] (NIFI-3332) Bug in ListXXX causes matching timestamps to be ignored on later runs

2018-07-09 Thread Koji Kawamura (JIRA)


[ 
https://issues.apache.org/jira/browse/NIFI-3332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537865#comment-16537865
 ] 

Koji Kawamura edited comment on NIFI-3332 at 7/10/18 5:00 AM:
--

[~joewitt] I agree with your suggested approach, and [~doaks80]'s script 
approach makes sense to address missing files caused by the assumption on 
timestamps. I am going to start writing a new processor similar to Daniel's 
script and using K/V store.
 I come up with following spec, do you guys agree with it? If it looks 
reasonable, I will create another JIRA to track the addition of new set of 
WatchEntries processors. Thanks!
 * Add new abstract processor AbstractWatchEntries similar to 
AbstractListProcessor but uses different approach
 * Target entities have: name (path), size and last-modified-timestamp
 * Implementation Processors have following properties:
 ** 'Watch Time Window' to limit the maximum time period to hold the already 
listed entries. E.g. if set as '30min', the processor keeps entities listed in 
the last 30 mins.
 ** 'Minimum File Age' to defer listing entities potentially being written
 * Any entity added but not listed ever having last-modified-timestamp older 
than configured 'Watch Time Window' will not be listed. If user needs to pick 
these items, they have to make 'Watch Time Window' longer. It also increases 
the size of data the processor has to persist in the K/V store. Efficiency vs 
reliability trade-off.
 * The already-listed entities are persisted into one of supported K/V store 
through DistributedMapCacheClient service. User can chose what KVS to use from 
HBase, Redis, Couchbase and File (DistributedMapCacheServer with persistence 
file).
 * The reason to use KVS instead of ManagedState is, to avoid hammering 
Zookeeper too much with frequently updating Zk node with large amount of data. 
The number of already-listed entries can be huge depending on use-cases. Also, 
we can compress entities with DistributedMapCacheClient as it supports putting 
byte array, while ManagedState only supports Map.
 * On each onTrigger:
 ** Processor performs listing. Listed entries meeting any of the following 
condition will be written to the 'success' output FlowFile:
 *** Not exists in the already-listed entities
 *** Having newer last-modified-timestamp
 *** Having different size
 ** Already listed entries those are old enough compared to 'Watch Time Window' 
are discarded from the already-listed entries.
 * Initial supporting target is Local file system, FTP and SFTP


was (Author: ijokarumawak):
[~joewitt] I agree with your suggested approach, and [~doaks80]'s script 
approach makes sense to address missing files caused by the assumption on 
timestamps. I am going to start writing a new processor similar to Daniel's 
script and using K/V store.
I come up with following spec, do you guys agree with it? If it looks 
reasonable, I will create another JIRA to track the addition of new set of 
WatchEntries processors. Thanks!

* Add new abstract processor AbstractWatchEntries similar to 
AbstractListProcessor but uses different approach
* Target entities have: name (path), size and last-modified-timestamp
* Implementation Processors have following properties:
** 'Watch Time Range' to limit the maximum time period to hold the already 
listed entries. E.g. if set as '30min', the processor keeps entities listed in 
the last 30 mins.
** 'Minimum File Age' to defer listing entities potentially being written
* Any entity added but not listed ever having last-modified-timestamp older 
than configured 'Watch Time Range' will not be listed. If user needs to pick 
these items, they have to make 'Watch Time Range' longer. It also increases the 
size of data the processor has to persist in the K/V store. Efficiency vs 
reliability trade-off.
* The already-listed entities are persisted into one of supported K/V store 
through DistributedMapCacheClient service. User can chose what KVS to use from 
HBase, Redis, Couchbase and File (DistributedMapCacheServer with persistence 
file).
* The reason to use KVS instead of ManagedState is, to avoid hammering 
Zookeeper too much with frequently updating Zk node with large amount of data. 
The number of already-listed entries can be huge depending on use-cases. Also, 
we can compress entities with DistributedMapCacheClient as it supports putting 
byte array, while ManagedState only supports Map.
* On each onTrigger:
** Processor performs listing. Listed entries meeting any of the following 
condition will be written to the 'success' output FlowFile:
*** Not exists in the already-listed entities
*** Having newer last-modified-timestamp
*** Having different size
** Already listed entries those are old enough compared to 'Watch Time Range' 
are discarded from the already-listed entries.
* Initial supporting target is Local file system, FTP and SFTP

> Bug in ListXXX causes 

[jira] [Comment Edited] (NIFI-3332) Bug in ListXXX causes matching timestamps to be ignored on later runs

2018-06-25 Thread Mike Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/NIFI-3332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522790#comment-16522790
 ] 

Mike Liu edited comment on NIFI-3332 at 6/25/18 8:49 PM:
-

Hey [~doaks80]

I'm encountering almost same issue (same nifi version, similar symptoms, except 
it seems like all future files also do not get picked up). Did you figure out a 
potential work around?


was (Author: mikexliu):
Hey [~doaks80]

I'm encountering the same issue (same nifi version, same symptoms, etc). Did 
you figure out a potential work around?

> Bug in ListXXX causes matching timestamps to be ignored on later runs
> -
>
> Key: NIFI-3332
> URL: https://issues.apache.org/jira/browse/NIFI-3332
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
>Affects Versions: 0.7.1, 1.1.1
>Reporter: Joe Skora
>Assignee: Koji Kawamura
>Priority: Critical
> Fix For: 1.4.0
>
> Attachments: Test-showing-ListFile-timestamp-bug.log, 
> Test-showing-ListFile-timestamp-bug.patch, listfiles.png
>
>
> The new state implementation for the ListXXX processors based on 
> AbstractListProcessor creates a race conditions when processor runs occur 
> while a batch of files is being written with the same timestamp.
> The changes to state management dropped tracking of the files processed for a 
> given timestamp.  Without the record of files processed, the remainder of the 
> batch is ignored on the next processor run since their timestamp is not 
> greater than the one timestamp stored in processor state.  With the file 
> tracking it was possible to process files that matched the timestamp exactly 
> and exclude the previously processed files.
> A basic time goes as follows.
>   T0 - system creates or receives batch of files with Tx timestamp where Tx 
> is more than the current timestamp in processor state.
>   T1 - system writes 1st half of Tx batch to the ListFile source directory.
>   T2 - ListFile runs picking up 1st half of Tx batch and stores Tx timestamp 
> in processor state.
>   T3 - system writes 2nd half of Tx batch to ListFile source directory.
>   T4 - ListFile runs ignoring any files with T <= Tx, eliminating 2nd half Tx 
> timestamp batch.
> I've attached a patch[1] for TestListFile.java that adds an instrumented unit 
> test demonstrates the problem and a log[2] of the output from one such run.  
> The test writes 3 files each in two batches with processor runs after each 
> batch.  Batch 2 writes files with timestamps older than, equal to, and newer 
> than the timestamp stored when batch 1 was processed, but only the newer file 
> is picked up.  The older file is correctly ignored but file with the matchin 
> timestamp file should have been processed.
> [1] Test-showing-ListFile-timestamp-bug.patch
> [2] Test-showing-ListFile-timestamp-bug.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NIFI-3332) Bug in ListXXX causes matching timestamps to be ignored on later runs

2017-02-21 Thread Koji Kawamura (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877187#comment-15877187
 ] 

Koji Kawamura edited comment on NIFI-3332 at 2/22/17 7:03 AM:
--

[~jskora] I totally agree to add more documents to clearly describe the cases 
that ListXXX potentially miss listing files. Also, I agree that the LAG 
algorithm may not work as the test proves that.

However, I think even the previous logic (keeping filenames those are listed at 
the last iteration and whose timestamp was the latest at the iteration) is not 
a perfect solution as the test proves that as well:

h3. Table 1: Test result if we add previous logic back
|| Processor runs at || Files in a dir || Result: Managed State || Result: 
Outgoing FlowFile ||
| t3 | batch1-age3.txt, t3 \\ batch1-age4.txt, t4 \\ batch1-age5.txt, t5 | 
listing.timestamp = t3, \\ last.ids.listed = batch1-age3.txt | batch1-age3.txt 
\\batch1-age4.txt \\batch1-age5.txt |
| t2 | batch2-age2.txt, t2 + \\ batch1-age3.txt, t3 \\ batch2-age3.txt, t3 + \\ 
batch1-age4.txt, t4 \\ batch2-age4.txt, t4 + \\ batch1-age5.txt, t5 | 
listing.timestamp = t2, \\ last.ids.listed = batch2-age2.txt | batch2-age2.txt 
\\batch2-age3.txt |

Higher number of t indicates older timestamp. If t3 = 10:23:34, then t4 would 
be 10:23:12, so t5 as 10:21:43 ...

As illustrated in above table, at the 2nd run, batch2-age3.txt is listed, 
because even though it has the same timestamp with the last listing.timestamp 
but last.ids.listed doesn't contain it.

Similarly, we would expect batch2-age4.txt to be listed at the 2nd run, but it 
won't, because its timestamp was not the latest one at the previous iteration. 
If this is acceptable (to skip batch2-age4.txt), then I think skipping 
batch2-age3.txt is also fine, isn't it? If we need to support both, then we 
need to store entire snapshot of filenames at the previous run...

It may be impossible to make it work perfectly, just by using two timestamps.
So, I'd like to propose a completely different, optional approach.

h3. Add 'Ready-to-List filename' optional property to ListXXX processor

It's something like '_SUCCESS' file in Hadoop MapReduce world. Indicating a 
batch of processing has been done.
Let's see how it works with the previous example:

h3. Table 2: 'Ready-to-List filename' = _SUCCESS
|| Processor runs at || Files in a dir || Result: Managed State || Result: 
Outgoing FlowFile ||
| t3 | batch1-age3.txt, t3 \\ batch1-age4.txt, t4 \\ batch1-age5.txt, t5 | 
listing.timestamp = na | (none) |
| t2 | _SUCCESS, t2 + \\ batch2-age2.txt, t2 + \\ batch1-age3.txt, t3 \\ 
batch2-age3.txt, t3 + \\ batch1-age4.txt, t4 \\ batch2-age4.txt, t4 + \\ 
batch1-age5.txt, t5 | listing.timestamp = t2 | batch2-age2.txt \\ 
batch1-age3.txt \\ batch2-age3.txt \\ batch1-age4.txt \\ batch2-age4.txt \\ 
batch1-age5.txt |
| t1 | _SUCCESS, t2 \\ batch3-age1.txt, t1 + \\ batch2-age2.txt, t2 \\ 
batch1-age3.txt, t3 \\ ... | listing.timestamp = t2 | (none) |
| t0 | _SUCCESS, t1 + \\ batch3-age1.txt, t1 \\ batch2-age2.txt, t2 \\ 
batch1-age3.txt, t3 \\ ... | listing.timestamp = t1 | batch3-age1.txt |

- t3: None listed, because there's no _SUCCESS file.
- t2: Found _SUCCESS file, no 'listing.timestamp' is recorded, so list 
everything.
- t1: None listed, because _SUCCESS file's timestamp is not newer than 
listing.timestamp.
- t0: Found _SUCCESS file and it's newer than stored 'listing.timestamp', so 
list files whose timestamp is newer than t2.

This way, List behavior can be controlled by a program that stores files 
into a directory, and it should know more about when it's ready to be listed. 
Also, it doesn't add any state to be managed.

If we can add more documentation and this 'Ready-to-List filename', it can be 
more flexible and reliable I think.
How do you think?


was (Author: ijokarumawak):
[~jskora] I totally agree to add more documents to clearly describe the cases 
that ListXXX potentially miss listing files. Also, I agree that the LAG 
algorithm may not work as the test proves that.

However, I think even the previous logic (keeping filenames those are listed at 
the last iteration and whose timestamp was the latest at the iteration) is not 
a perfect solution as the test proves that as well:

h3. Table 1: Test result if we add previous logic back
|| Processor runs at || Files in a dir || Result: Managed State || Result: 
Outgoing FlowFile ||
| t3 | batch1-age3.txt, t3 \\ batch1-age4.txt, t4 \\ batch1-age5.txt, t5 | 
listing.timestamp = t3, \\ last.ids.listed = batch1-age3, 4 and 5.txt | 
batch1-age3.txt \\batch1-age4.txt \\batch1-age5.txt |
| t2 | batch2-age2.txt, t2 + \\ batch1-age3.txt, t3 \\ batch2-age3.txt, t3 + \\ 
batch1-age4.txt, t4 \\ batch2-age4.txt, t4 + \\ batch1-age5.txt, t5 | 
listing.timestamp = t2, \\ last.ids.listed = batch2-age2.txt | batch2-age2.txt 
\\batch2-age3.txt |

Higher number of t indicates older timestamp. 

[jira] [Comment Edited] (NIFI-3332) Bug in ListXXX causes matching timestamps to be ignored on later runs

2017-02-21 Thread Joe Skora (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877439#comment-15877439
 ] 

Joe Skora edited comment on NIFI-3332 at 2/22/17 4:15 AM:
--

[~ijokarumawak] Admittedly, discussing time orientation and time references for 
these tests can get confusing, I realize now that t3 is older than t2 and so 
on, I had mixed it up earlier today.

Table 1 looks mostly right to me.  Since it is based on the old logic, the t3 
run should output the batch1-age3.txt, batch1-age4.txt, and batch1-age5.txt 
files and store a state timestamp of t3 and processed files list containing 
just the batch1-age3.txt file since the batch1-age4.txt and batch1-age5.txt 
files are older than the state timestamp.  The subsequent t2 run looks correct 
to me, outputting the batch2-age3.txt and batch2-age2.txt files and storing 
state with the t2 timestamp and the batch2-age2.txt file.  So the Table 1 
output information is correct except for the state file list on the t3 run 
containing files not matching the state timestamp.

I'm not sure I understand your "_SUCCESS" algorithm at this point so I can't 
comment on Table 2 until I get a chance to work through that tomorrow afternoon.


was (Author: jskora):
[~ijokarumawak] Admittedly, discussing time orientation and time references for 
these tests can get confusing, I think I understand now that t3 is older than 
t2 and so on.

Table 1 looks mostly right to me.  Since it is based on the old logic, the t3 
run should output the batch1-age3.txt, batch1-age4.txt, and batch1-age5.txt 
files and store a state timestamp of t3 and processed files list containing 
just the batch1-age3.txt file since the batch1-age4.txt and batch1-age5.txt 
files are older than the state timestamp.  The subsequent t2 run looks correct 
to me, outputting the batch2-age3.txt and batch2-age2.txt files and storing 
state with the t2 timestamp and the batch2-age2.txt file.  So the Table 1 
output information is correct except for the state file list on the t3 run 
containing files not matching the state timestamp.

I'm not sure I understand your "_SUCCESS" algorithm at this point so I can't 
comment on Table 2 until I get a chance to work through that tomorrow afternoon.

> Bug in ListXXX causes matching timestamps to be ignored on later runs
> -
>
> Key: NIFI-3332
> URL: https://issues.apache.org/jira/browse/NIFI-3332
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
>Affects Versions: 0.7.1, 1.1.1
>Reporter: Joe Skora
>Assignee: Koji Kawamura
>Priority: Critical
> Attachments: Test-showing-ListFile-timestamp-bug.log, 
> Test-showing-ListFile-timestamp-bug.patch
>
>
> The new state implementation for the ListXXX processors based on 
> AbstractListProcessor creates a race conditions when processor runs occur 
> while a batch of files is being written with the same timestamp.
> The changes to state management dropped tracking of the files processed for a 
> given timestamp.  Without the record of files processed, the remainder of the 
> batch is ignored on the next processor run since their timestamp is not 
> greater than the one timestamp stored in processor state.  With the file 
> tracking it was possible to process files that matched the timestamp exactly 
> and exclude the previously processed files.
> A basic time goes as follows.
>   T0 - system creates or receives batch of files with Tx timestamp where Tx 
> is more than the current timestamp in processor state.
>   T1 - system writes 1st half of Tx batch to the ListFile source directory.
>   T2 - ListFile runs picking up 1st half of Tx batch and stores Tx timestamp 
> in processor state.
>   T3 - system writes 2nd half of Tx batch to ListFile source directory.
>   T4 - ListFile runs ignoring any files with T <= Tx, eliminating 2nd half Tx 
> timestamp batch.
> I've attached a patch[1] for TestListFile.java that adds an instrumented unit 
> test demonstrates the problem and a log[2] of the output from one such run.  
> The test writes 3 files each in two batches with processor runs after each 
> batch.  Batch 2 writes files with timestamps older than, equal to, and newer 
> than the timestamp stored when batch 1 was processed, but only the newer file 
> is picked up.  The older file is correctly ignored but file with the matchin 
> timestamp file should have been processed.
> [1] Test-showing-ListFile-timestamp-bug.patch
> [2] Test-showing-ListFile-timestamp-bug.log



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)