[ 
https://issues.apache.org/jira/browse/NIFI-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361129#comment-16361129
 ] 

Andrew Burkard edited comment on NIFI-4715 at 2/12/18 8:53 PM:
---------------------------------------------------------------

An additional problem with the processor is that it can also skip files if the 
bucket is being written to while being listed. Consider the following scenario:

 

We are in the middle of listing a bucket with millions of objects. 
Simultaneously we write two objects to the bucket:

aaaaaaaaa.txt,    at t=1

zzzzzzzzzz.txt,    at t=2

 

Because files are listed in lexographical order, the processor will eventually 
get to the 'z' file and persist state as t=2. It will never see the 'a' file. 
As a result, -we can only be certain that the processor will list all files 
with lastModified < the time at which the processor starts-. This implies that 
to avoid duplicates, we also need to store all keys seen during the loop with 
lastModified > the processor start time. As an added complication, I'm not sure 
how we resolve potential differences between S3 timestamps and the system's 
time to enable this.

 

Update: Due to eventual consistency, we can't even be certain that all objects 
with lastModified < t will be present in the bucket at t.


was (Author: aburkard):
An additional problem with the processor is that it can also skip files if the 
bucket is being written to while being listed. Consider the following scenario:

 

We are in the middle of listing a bucket with millions of objects. 
Simultaneously we write two objects to the bucket:

aaaaaaaaa.txt,    at t=1

zzzzzzzzzz.txt,    at t=2

 

Because files are listed in lexographical order, the processor will eventually 
get to the 'z' file and persist state as t=2. It will never see the 'a' file. 
As a result, we can only be certain that the processor will list all files with 
lastModified < the time at which the processor starts. This implies that to 
avoid duplicates, we also need to store all keys seen during the loop with 
lastModified > the processor start time. As an added complication, I'm not sure 
how we resolve potential differences between S3 timestamps and the system's 
time to enable this.

> ListS3 produces duplicates in frequently updated buckets
> --------------------------------------------------------
>
>                 Key: NIFI-4715
>                 URL: https://issues.apache.org/jira/browse/NIFI-4715
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 1.2.0, 1.3.0, 1.4.0
>         Environment: All
>            Reporter: Milan Das
>            Priority: Major
>         Attachments: List-S3-dup-issue.xml, screenshot-1.png
>
>
> ListS3 state is implemented using HashSet. HashSet is not thread safe. When 
> ListS3 operates in multi threaded mode, sometimes it  tries to list  same 
> file from S3 bucket.  Seems like HashSet data is getting corrupted.
> currentKeys = new HashSet<>(); // need to be implemented Thread Safe like 
> currentKeys = //ConcurrentHashMap.newKeySet();
> *{color:red}+Update+{color}*:
> This is not a HashSet issue:
> Root cause is: 
> When the file gets uploaded to S3 simultaneously  when List S3 is in progress.
> onTrigger-->  maxTimestamp is initiated as 0L.
> This is clearing keys as per the code below
> When lastModifiedTime on S3 object is same as currentTimestamp for the listed 
> key it should be skipped. As the key is cleared, it is loading the same file 
> again. 
> I think fix should be to initiate the maxTimestamp with currentTimestamp not 
> 0L.
> {code}
>  long maxTimestamp = currentTimestamp;
> {code}
> Following block is clearing keys.
> {code:title=org.apache.nifi.processors.aws.s3.ListS3.java|borderStyle=solid}
>  if (lastModified > maxTimestamp) {
>                     maxTimestamp = lastModified;
>                     currentKeys.clear();
>                     getLogger().debug("clearing keys");
>                 }
> {code}
> Update: 01/03/2018
> There is one more flavor of same defect.
> Suppose: file1 is modified at 1514987611000 on S3 and currentTimestamp = 
> 1514987311000 on state.
> 1. File will be picked up time current state will be updated to 
> currentTimestamp=1514987311000 (but OS System time is 1514987611000)
> 2. next cycle for file2 with lastmodified: 1514987611000 : keys will be 
> cleared because lastModified > maxTimeStamp 
> (=currentTimestamp=1514987311000). 
> CurrentTimeStamp will saved as 1514987611000
> 3. next cycle: currentTimestamp=1514987611000 , "file1 modified at 
> 1514987611000" will be picked up again because file1 is no longer in the keys.
> I think solution is currentTimeStamp need to persisted current system time 
> stamp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to