Hi Jeff, Thanks for your suggestions. My only Flume experience so far is with the Elasticsearch sink, which serializes (headers and body) to JSON automatically. I was expecting something similar from the HDFS sink and when it didn't do that I started questioning the file format when I should have been looking at the serializer. A misunderstanding on my part.
I just finished serializing to JSON when I saw you suggested Avro. I'll look into that. Is that what you would use if you were going to query with Hive external tables? Thanks again! -- Jeremy On Thu, Oct 31, 2013 at 4:42 PM, Jeff Lord <[email protected]> wrote: > Jeremy, > > Datastream fileType will let you write text files. > CompressedStream will do just that. > SequenceFile will create sequence files as you have guessed and you can > use either Text or Writeable (bytes) for your data here. > > So flume is configureable out of the box with regards to the size of your > files. Yes you are correct that it is better to create files that are at > least the size of a full block. > You can roll your files based on time, size, or number of events. Rolling > on an hourly basis makes perfect sense. > > With all that said we recommend writing to avro container files as that > format is most well suited for being used in the Hadoop ecosystem. > Avro has many benefits which include support for compression, code > generation, versioning and schema evolution. > You can do this with flume by specifying the avro_event type for the > serializer property in your hdfs sink. > > Hope this helps. > > -Jeff > > > On Wed, Oct 30, 2013 at 4:15 PM, Jeremy Karlson > <[email protected]>wrote: > >> Hi everyone. >> >> I'm trying to set up Flume to log into HDFS. Along the way, Flume >> attaches a number of headers (environment, hostname, etc) that I would also >> like to store with my log messages. Ideally, I'd like to be able to use >> Hive to query all of this later. I must also admit to knowing next to >> nothing about HDFS. That probably doesn't help. :-P >> >> I'm confused about the HDFS sink configuration. Specifically, I'm trying >> to understand what these two options do (and how they interact): >> >> hdfs.fileType >> hdfs.writeFormat >> >> File Type: >> >> DataStream - This appears to write the event body, and loses all headers. >> Correct? >> CompressedStream - I assume just a compressed data stream. >> SequenceFile - I think this is what I want, since it seems to be a >> key/value based thing, which I assume means it will include headers. >> >> Write Format: This seems to only apply for SequenceFile above, but lots >> of Internet examples seem to state otherwise. I'm also unclear on the >> difference here. Isn't "Text" just a specific type of "Writable" in HDFS? >> >> Also, I'm unclear on why Flume, by default, seems to be set up to make >> such small HDFS files. Isn't HDFS designed (and more efficient) when >> storing larger files that are closer to the size of a full block? I was >> thinking it made more sense to write all log data to a single file, and >> roll that file hourly (or whatever, depending on volume). Thoughts here? >> >> Thanks a lot. >> >> -- Jeremy >> >> >> >
