folks, I have been working with com.datatorrent.lib.io.fs.AbstractFileInputOperator to accomodate a slightly different use case, but keep running into inneficiencies...prior to detailing my use case, up front, I do have this working, but I feel like it is horribly inefficient and definitely far from elegant.
- Scan multiple directories (not one as is expected) - Accept changes to directories to be scanned on the fly - Accept multiple file types (based on checking magic bytes/number) - Assume that files may be in any of the following conditions: - "Raw" - Compressed - Archived - Compressed and Archived - Associate provenance (e.g. customer and sensor) with events extracted from these files My existing solution was to provide my own implementation of AbstractFileInputOperator.DirectoryScanner, and also to spit out arrays/lists of events rather than Strings (lines from each file) due to the binary nature of most of my input file types. I am seeing several mismatches between my use case and the AbstractFileInputOperator, but also see a ton of existing work within it that I would prefer not to redo (partitioning, fault-tolerance, etc.). Is there a more appropriate class/Interface I should be looking at or is it appropriate to create a new interface to handle a directory scanner that accounts for multiple directories and the potential to deal with compressed and archived files (thus things like openFile would need to support outputting a list of inputstreams at a minimum to accomodate these files)...I just want to make sure I am not overdoing things in a quest for more efficient and clean code... -- M. Aaron Bossert (571) 242-4021 Punch Cyber Analytics Group