Did you already look at FileSplitter/BlockReader? https://apex.apache.org/docs/malhar/operators/file_splitter/
Would that better support your customization requirements? -- sent from mobile On Thu, Jun 21, 2018, 9:29 PM Aaron Bossert <aa...@punchcyber.com> wrote: > folks, > > I have been working with > com.datatorrent.lib.io.fs.AbstractFileInputOperator to accomodate a > slightly different use case, but keep running into inneficiencies...prior > to detailing my use case, up front, I do have this working, but I feel like > it is horribly inefficient and definitely far from elegant. > > > - Scan multiple directories (not one as is expected) > - Accept changes to directories to be scanned on the fly > - Accept multiple file types (based on checking magic bytes/number) > - Assume that files may be in any of the following conditions: > - "Raw" > - Compressed > - Archived > - Compressed and Archived > - Associate provenance (e.g. customer and sensor) with events > extracted from these files > > My existing solution was to provide my own implementation of > AbstractFileInputOperator.DirectoryScanner, and also to spit out > arrays/lists of events rather than Strings (lines from each file) due to > the binary nature of most of my input file types. > > I am seeing several mismatches between my use case and the > AbstractFileInputOperator, but also see a ton of existing work within it > that I would prefer not to redo (partitioning, fault-tolerance, etc.). Is > there a more appropriate class/Interface I should be looking at or is it > appropriate to create a new interface to handle a directory scanner that > accounts for multiple directories and the potential to deal with compressed > and archived files (thus things like openFile would need to support > outputting a list of inputstreams at a minimum to accomodate these > files)...I just want to make sure I am not overdoing things in a quest for > more efficient and clean code... > > -- > > M. Aaron Bossert > (571) 242-4021 > Punch Cyber Analytics Group > > >