Re: ListS3 question

2017-08-08 Thread Adam Lamar
Laurens,

Just to add slightly to this question:

> Will ListS3 keep state correctly here for all 3 subdirectories?

The answer is yes - ListS3 will keep state correctly for all 3
subdirectories. For example, if you setup a new ListS3 processor, give it a
bucket and prefix, and start the processor, it will initially list all the
items under the prefix, including any subdirectories. On subsequent runs,
it will only list new keys under the prefix as they are added, as expected.

Things get more complicated when you take that existing ListS3 instance and
*change* the prefix. At that point, the processor state is based on the
previous prefix, and only keys that are newer than the newest key under the
old prefix will be listed. For example, if you were using date-based keys
like this where new data is produced on a daily basis (and no new data is
produced in old directories):

/year/month/day/

and your initial prefix was /2017/08/08/, you'd only pick up new keys for
that day. But if you changed the prefix to /2017/08/, you'd only get new
keys, and nothing from /2017/08/07/ or older. Again, this assumes no late
arriving data. If data did arrive late, you'd see those keys listed since
the state is based on the last modified timestamp.

Hope that adds more clarity.

Adam


On Tue, Aug 8, 2017 at 11:16 AM, Joe Skora  wrote:

> Yes, that's my understanding too.
>
> On Tue, Aug 8, 2017 at 1:14 PM, Laurens Vets  wrote:
>
>> Thank you for this information. There's no internal notion of directories
>> in S3, only objects, so I suspect I'm good if I only set the bucketname?
>>
>> On 2017-08-08 09:55, Joe Skora wrote:
>>
>> Laurens,
>>
>> The S3 User Guide Working with Folders
>>  page
>> explains how S3 provides a conceptual directory hierarchy using key name
>> prefixes but that buckets really just hold a flat collection of objects.
>>
>> ListS3 will query S3 for the list of objects and then uses the object
>> timestamp as James pointed out to determine what's new to be processed.
>>  (Though, it uses last modified timestamp not the last read timestamp.)
>>  You can populate the "Prefix" property of the processor so that to S3 can
>> filter the object list (as if for a directory tree) before sending the list
>> back to NiFi to make things more efficient when dealing with subsets of the
>> bucket contents.
>>
>> Regards,
>> Joe S
>>
>>
>> On Tue, Aug 8, 2017 at 11:22 AM, Laurens Vets  wrote:
>>
>>> Hi list,
>>>
>>> Does the ListS3 processor keep state of multiple directories in a bucket?
>>>
>>> For instance, suppose I have a directory "logs" with subdirectories
>>> "host1", "host2" & "host3". Each directory contains logfiles which are
>>> added dailty.
>>>
>>> Will ListS3 keep state correctly here for all 3 subdirectories?
>>>
>>> Thanks in advance.
>>>
>>> Laurens
>>>
>>
>>
>


Re: ListS3 question

2017-08-08 Thread Joe Skora
Yes, that's my understanding too.

On Tue, Aug 8, 2017 at 1:14 PM, Laurens Vets  wrote:

> Thank you for this information. There's no internal notion of directories
> in S3, only objects, so I suspect I'm good if I only set the bucketname?
>
> On 2017-08-08 09:55, Joe Skora wrote:
>
> Laurens,
>
> The S3 User Guide Working with Folders
>  page
> explains how S3 provides a conceptual directory hierarchy using key name
> prefixes but that buckets really just hold a flat collection of objects.
>
> ListS3 will query S3 for the list of objects and then uses the object
> timestamp as James pointed out to determine what's new to be processed.
>  (Though, it uses last modified timestamp not the last read timestamp.)
>  You can populate the "Prefix" property of the processor so that to S3 can
> filter the object list (as if for a directory tree) before sending the list
> back to NiFi to make things more efficient when dealing with subsets of the
> bucket contents.
>
> Regards,
> Joe S
>
>
> On Tue, Aug 8, 2017 at 11:22 AM, Laurens Vets  wrote:
>
>> Hi list,
>>
>> Does the ListS3 processor keep state of multiple directories in a bucket?
>>
>> For instance, suppose I have a directory "logs" with subdirectories
>> "host1", "host2" & "host3". Each directory contains logfiles which are
>> added dailty.
>>
>> Will ListS3 keep state correctly here for all 3 subdirectories?
>>
>> Thanks in advance.
>>
>> Laurens
>>
>
>


Re: ListS3 question

2017-08-08 Thread Laurens Vets
Thank you for this information. There's no internal notion of
directories in S3, only objects, so I suspect I'm good if I only set the
bucketname? 

On 2017-08-08 09:55, Joe Skora wrote:

> Laurens, 
> 
> The S3 User Guide Working with Folders [1] page explains how S3 provides a 
> conceptual directory hierarchy using key name prefixes but that buckets 
> really just hold a flat collection of objects. 
> 
> ListS3 will query S3 for the list of objects and then uses the object 
> timestamp as James pointed out to determine what's new to be processed.  
> (Though, it uses last modified timestamp not the last read timestamp.)  You 
> can populate the "Prefix" property of the processor so that to S3 can filter 
> the object list (as if for a directory tree) before sending the list back to 
> NiFi to make things more efficient when dealing with subsets of the bucket 
> contents. 
> 
> Regards, 
> Joe S 
> 
> On Tue, Aug 8, 2017 at 11:22 AM, Laurens Vets  wrote:
> 
>> Hi list,
>> 
>> Does the ListS3 processor keep state of multiple directories in a bucket?
>> 
>> For instance, suppose I have a directory "logs" with subdirectories "host1", 
>> "host2" & "host3". Each directory contains logfiles which are added dailty.
>> 
>> Will ListS3 keep state correctly here for all 3 subdirectories?
>> 
>> Thanks in advance.
>> 
>> Laurens

 

Links:
--
[1] http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html

Re: ListS3 question

2017-08-08 Thread Joe Skora
Laurens,

The S3 User Guide Working with Folders
 page
explains how S3 provides a conceptual directory hierarchy using key name
prefixes but that buckets really just hold a flat collection of objects.

ListS3 will query S3 for the list of objects and then uses the object
timestamp as James pointed out to determine what's new to be processed.
 (Though, it uses last modified timestamp not the last read timestamp.)
 You can populate the "Prefix" property of the processor so that to S3 can
filter the object list (as if for a directory tree) before sending the list
back to NiFi to make things more efficient when dealing with subsets of the
bucket contents.

Regards,
Joe S


On Tue, Aug 8, 2017 at 11:22 AM, Laurens Vets  wrote:

> Hi list,
>
> Does the ListS3 processor keep state of multiple directories in a bucket?
>
> For instance, suppose I have a directory "logs" with subdirectories
> "host1", "host2" & "host3". Each directory contains logfiles which are
> added dailty.
>
> Will ListS3 keep state correctly here for all 3 subdirectories?
>
> Thanks in advance.
>
> Laurens
>


Re: ListS3 question

2017-08-08 Thread James Wing
Laurens,

ListS3 tracks S3 object keys within your bucket+prefix.  ListS3 primarily
works on a last read timestamp, but tracks multiple keys when the
timestamps are equal.  Directories are something of a hazy concept in S3.

Thanks,

James

On Tue, Aug 8, 2017 at 8:22 AM, Laurens Vets  wrote:

> Hi list,
>
> Does the ListS3 processor keep state of multiple directories in a bucket?
>
> For instance, suppose I have a directory "logs" with subdirectories
> "host1", "host2" & "host3". Each directory contains logfiles which are
> added dailty.
>
> Will ListS3 keep state correctly here for all 3 subdirectories?
>
> Thanks in advance.
>
> Laurens
>


ListS3 question

2017-08-08 Thread Laurens Vets

Hi list,

Does the ListS3 processor keep state of multiple directories in a 
bucket?


For instance, suppose I have a directory "logs" with subdirectories 
"host1", "host2" & "host3". Each directory contains logfiles which are 
added dailty.


Will ListS3 keep state correctly here for all 3 subdirectories?

Thanks in advance.

Laurens