Thomas, You know, I did read through the source for that, but I guess my imagination didn't kick in...Maybe I got stuck on the name being a file splitter as opposed to the potential to re-purpose it for archived files...Thanks for the pointer...I will give that a go before re-inventing the wheel...
On Fri, Jun 22, 2018 at 2:08 AM Thomas Weise <[email protected]> wrote: > Did you already look at FileSplitter/BlockReader? > > https://apex.apache.org/docs/malhar/operators/file_splitter/ > > Would that better support your customization requirements? > > > > -- > sent from mobile > > > On Thu, Jun 21, 2018, 9:29 PM Aaron Bossert <[email protected]> wrote: > >> folks, >> >> I have been working with >> com.datatorrent.lib.io.fs.AbstractFileInputOperator to accomodate a >> slightly different use case, but keep running into inneficiencies...prior >> to detailing my use case, up front, I do have this working, but I feel like >> it is horribly inefficient and definitely far from elegant. >> >> >> - Scan multiple directories (not one as is expected) >> - Accept changes to directories to be scanned on the fly >> - Accept multiple file types (based on checking magic bytes/number) >> - Assume that files may be in any of the following conditions: >> - "Raw" >> - Compressed >> - Archived >> - Compressed and Archived >> - Associate provenance (e.g. customer and sensor) with events >> extracted from these files >> >> My existing solution was to provide my own implementation of >> AbstractFileInputOperator.DirectoryScanner, and also to spit out >> arrays/lists of events rather than Strings (lines from each file) due to >> the binary nature of most of my input file types. >> >> I am seeing several mismatches between my use case and the >> AbstractFileInputOperator, but also see a ton of existing work within it >> that I would prefer not to redo (partitioning, fault-tolerance, etc.). Is >> there a more appropriate class/Interface I should be looking at or is it >> appropriate to create a new interface to handle a directory scanner that >> accounts for multiple directories and the potential to deal with compressed >> and archived files (thus things like openFile would need to support >> outputting a list of inputstreams at a minimum to accomodate these >> files)...I just want to make sure I am not overdoing things in a quest for >> more efficient and clean code... >> >> -- >> >> M. Aaron Bossert >> (571) 242-4021 >> Punch Cyber Analytics Group >> >> >> -- M. Aaron Bossert (571) 242-4021 Punch Cyber Analytics Group
