Hi everyone. I'm trying to set up Flume to log into HDFS. Along the way, Flume attaches a number of headers (environment, hostname, etc) that I would also like to store with my log messages. Ideally, I'd like to be able to use Hive to query all of this later. I must also admit to knowing next to nothing about HDFS. That probably doesn't help. :-P
I'm confused about the HDFS sink configuration. Specifically, I'm trying to understand what these two options do (and how they interact): hdfs.fileType hdfs.writeFormat File Type: DataStream - This appears to write the event body, and loses all headers. Correct? CompressedStream - I assume just a compressed data stream. SequenceFile - I think this is what I want, since it seems to be a key/value based thing, which I assume means it will include headers. Write Format: This seems to only apply for SequenceFile above, but lots of Internet examples seem to state otherwise. I'm also unclear on the difference here. Isn't "Text" just a specific type of "Writable" in HDFS? Also, I'm unclear on why Flume, by default, seems to be set up to make such small HDFS files. Isn't HDFS designed (and more efficient) when storing larger files that are closer to the size of a full block? I was thinking it made more sense to write all log data to a single file, and roll that file hourly (or whatever, depending on volume). Thoughts here? Thanks a lot. -- Jeremy
