Re: ListFile, FetchFile Scalability

Joe Skora Thu, 19 Jan 2017 11:21:42 -0800

Mark,

This thread shook something loose in my brain from when the state changes
were made.  Testing it out, I could easily create a case where the 2
timestamp approach was insufficient to avoid missing files.  The hard part
was making a unit test for it, which I eventually succeeded at.


I filed a Jira for it, NIFI-3332
<https://issues.apache.org/jira/browse/NIFI-3332>, with the unit test but
the basic scenario is that if the processor runs while the system is
writing a batch of files with the same timestamp the processor will pickup
what has already been written but then ignore the remainder of the batch on
the next iteration.  It is an edge case, but I can definitely see it happen
on a system under load if data is transferred in from other places and then
rolled into NiFi.

Can you take a look at that ticket and let me know what you think?

Thanks,
Joe

On Tue, Jan 10, 2017 at 10:09 PM, James McMahon <[email protected]>
wrote:

> These have been invaluable insights Mark. Thank you very much for your
> help. -Jim
>
> On Tue, Jan 10, 2017 at 2:13 PM, Mark Payne <[email protected]> wrote:
>
>> Jim,
>>
>> Off the top of my head, I don't remember the reason for two dates,
>> specifically. I think it may have had to do
>> with ensuring that if we run at time X, we could potentially pick up a
>> file that also has a timestamp of X. Then,
>> we could potentially have 1+ files come in at time X also, after the
>> processor finished running. If we only looked
>> at the one timestamp, we could miss those 1+ files that came in later,
>> but during the same second or millisecond
>> or whatever precision your operating system provides for file
>> modification precision. Someone else on the list
>> may have more insight into the exact meaning of the two timestamps, as I
>> didn't come up with the algorithm.
>>
>> Yes, the ListFile processor will scan through the directory each time
>> that it runs to find any new files. Would recommend
>> that you not schedule ListFile to run with the default "0 sec" run
>> schedule but instead set it to something like "1 min" or
>> however often you can afford/need to. I believe that if it is scheduled
>> to run too frequently, it will actually yield itself,
>> which would cause it to 'pause' for 1 second (by default; this is
>> configured in the Settings for the Processor as well).
>>
>> The files that you mention there are simply the internals of the
>> Write-Ahead Log. When the WAL is updated,
>> it picks partition to write the update to (the partition directories) and
>> appends to whichever journal file it is
>> currently writing to. If we did this forever, those files would grow
>> indefinitely and aside from running out of disk
>> space, restarting NiFi would take ages. So periodically (by default,
>> every 2 minutes), the WAL is checkpointed.
>>
>> When this happens it creates the 'snapshot' file and writes to the file
>> the current state of the system and then
>> starts a new journal file for each partition. So there's a 'snapshot'
>> file that is a snapshot of the system state
>> and then the journal files that indicate a series of changes from the
>> snapshot to get back to the most recent
>> state.
>>
>> You may occasionally see some other files, such as multiple journal
>> files, snapshot.part files, etc. that are temporary
>> artifacts generated in order to provide better performance and ensure
>> reliability across system crashes/restarts.
>>
>> The wali.lock is simply there to ensure that we don't start NiFi twice
>> and have 2 different processes trying to write to
>> those files at the same time.
>>
>> Hope this helps!
>>
>> Thanks
>> -Mark
>>
>>
>> On Jan 10, 2017, at 10:01 AM, James McMahon <[email protected]> wrote:
>>
>> Thank you very much Mark. This is very helpful. Can I ask you just a few
>> quick follow-up questions in an effort to better understand?
>>
>> How does NiFi use those two dates? It seems that the timestamp of last
>> listing would be sufficient to permit NiFi to identify newly received
>> content. Why is it necessary to maintain the timestamp of the most recent
>> file it has sent out?
>>
>> How does NiFi quickly determine which files throughout the nested
>> directory structure were received after the last date it logged? Is it
>> scanning through listings of all the associated directories flagging for
>> processing those files with later dates?
>>
>> I looked more closely at my ./state/local directory and subdirectories.
>> Can you offer a few words about the purpose of each of the following?
>> * file snapshot
>> * file wali.lock
>> * the partition[0-15] subdirectories, each of which appears to own a
>> journal file
>> * the journal file
>> Where are the dates you referenced?
>>
>> Thank you again for your insights.
>>
>> On Tue, Jan 10, 2017 at 8:51 AM, Mark Payne <[email protected]> wrote:
>>
>>> Hi Jim,
>>>
>>> ListFile does not maintain a list of files w/ datetime stamps. Instead,
>>> it store just two timestamps:
>>> the timestamp of when a listing was last performed, and the timestamp of
>>> the newest file that it has
>>> sent out. This is done precisely because we need it to be able to scale
>>> as the input becomes large.
>>>
>>> The location of where this information is stored depends on a couple of
>>> things. ListFile has a property named
>>> "Input Directory Location." If that is set to "Remote" and the NiFi
>>> instance is clustered, then this information is
>>> stored in ZooKeeper. This allows the Processor to run on Primary Node
>>> only and if a new node is elected Primary,
>>> then it is able to pick up where the previous Primary Node left off.
>>>
>>> if the Input Directory Location is set to "Local" (or if NiFi is not
>>> clustered) then the state will be stored to the Local
>>> State manager, which is backed by a write-ahead log. By default it is
>>> written to ./state/local but this can be configured
>>> in the conf/state-management.xml. So if you want to be really sure that
>>> you don't lose the information, you could
>>> potentially change the location to some place that has a RAID
>>> configuration for redundancy.
>>>
>>> Thanks
>>> -Mark
>>>
>>>
>>> > On Jan 10, 2017, at 8:38 AM, James McMahon <[email protected]>
>>> wrote:
>>> >
>>> > I am using ListFile followed by FetchFile to recurse and detect new
>>> files that show up in a large nested directory structure that grows over
>>> time. I need to better understand how this approach scales. What are the
>>> practical and the performance limitations to using this tandem of
>>> processors for feeding new files to NiFi? If anyone has used this approach
>>> in a large-scale data environment to manage new content to NiFi, I would
>>> welcome your thoughts.
>>> >
>>> > Where does ListFile maintain its list of files with datetime stamps?
>>> Does this get persisted as a hash map in memory? Is it also persisted into
>>> one of the NiFi repositories as a backup? My concern is avoiding having to
>>> reprocess the entire directory structure should that list ever get lost or
>>> destroyed.
>>> >
>>> > Thank you in advance once again for your assistance. -Jim
>>>
>>>
>>
>>
>

Re: ListFile, FetchFile Scalability

Reply via email to