Did you already look at FileSplitter/BlockReader?

https://apex.apache.org/docs/malhar/operators/file_splitter/

Would that better support your customization requirements?



--
sent from mobile


On Thu, Jun 21, 2018, 9:29 PM Aaron Bossert <aa...@punchcyber.com> wrote:

> folks,
>
> I have been working with
> com.datatorrent.lib.io.fs.AbstractFileInputOperator to accomodate a
> slightly different use case, but keep running into inneficiencies...prior
> to detailing my use case, up front, I do have this working, but I feel like
> it is horribly inefficient and definitely far from elegant.
>
>
>    - Scan multiple directories (not one as is expected)
>    - Accept changes to directories to be scanned on the fly
>    - Accept multiple file types (based on checking magic bytes/number)
>    - Assume that files may be in any of the following conditions:
>    - "Raw"
>       - Compressed
>       - Archived
>       - Compressed and Archived
>    - Associate provenance (e.g. customer and sensor) with events
>    extracted from these files
>
> My existing solution was to provide my own implementation of
> AbstractFileInputOperator.DirectoryScanner, and also to spit out
> arrays/lists of events rather than Strings (lines from each file) due to
> the binary nature of most of my input file types.
>
> I am seeing several mismatches between my use case and the
> AbstractFileInputOperator, but also see a ton of existing work within it
> that I would prefer not to redo (partitioning, fault-tolerance, etc.).  Is
> there a more appropriate class/Interface I should be looking at or is it
> appropriate to create a new interface to handle a directory scanner that
> accounts for multiple directories and the potential to deal with compressed
> and archived files (thus things like openFile would need to support
> outputting a list of inputstreams at a minimum to accomodate these
> files)...I just want to make sure I am not overdoing things in a quest for
> more efficient and clean code...
>
> --
>
> M. Aaron Bossert
> (571) 242-4021
> Punch Cyber Analytics Group
>
>
>

Reply via email to