> 1) Parsing data/Schema creation: The Bro IDS logs have a 8 line header
> that contains the 'schema' for the data, each log http/dns/etc will have
> different columns with different data types. So would I create a specific
> CSV reader inherited from the general one? Also I'm assuming this would
> need to be in Scala/Java? (I suck at both of those :)
This is a good question. What I have seen others do is actually run
different streams for the different log types. This way you can customize
the schema to the specific log type.
Even without using Scala/Java you could also use the text data source
(assuming the logs are new line delimited) and then write the parser for
each line in python. There will be a performance penalty here though.
> 2) Dynamic Tailing: Does the CSV/TSV data sources support dynamic tailing
> and handle log rotations?
The file based sources work by tracking which files have been processed and
then scanning (optionally using glob patterns) for new files. There a two
assumptions here: files are immutable when they arrive and files always
have a unique name. If files are deleted, we ignore that, so you are okay
to rotate them out.
The full pipeline that I have seen often involves the logs getting uploaded
to something like S3. This is nice because you get atomic visibility of
files that have already been rotated. So I wouldn't really call this
dynamically tailing, but we do support looking for new files at some