Yes definitely use avro instead of json if you can. HIVE-895 added support for that. Pretty much the entire Hadoop ecosystem has support for avro at this point. The ability to evolve/version the schema is one of the main benefits.
On Fri, Nov 1, 2013 at 9:50 AM, Jeremy Karlson <[email protected]>wrote: > Hi Jeff, > > Thanks for your suggestions. My only Flume experience so far is with the > Elasticsearch sink, which serializes (headers and body) to JSON > automatically. I was expecting something similar from the HDFS sink and > when it didn't do that I started questioning the file format when I should > have been looking at the serializer. A misunderstanding on my part. > > I just finished serializing to JSON when I saw you suggested Avro. I'll > look into that. Is that what you would use if you were going to query with > Hive external tables? > > Thanks again! > > -- Jeremy > > > > On Thu, Oct 31, 2013 at 4:42 PM, Jeff Lord <[email protected]> wrote: > >> Jeremy, >> >> Datastream fileType will let you write text files. >> CompressedStream will do just that. >> SequenceFile will create sequence files as you have guessed and you can >> use either Text or Writeable (bytes) for your data here. >> >> So flume is configureable out of the box with regards to the size of your >> files. Yes you are correct that it is better to create files that are at >> least the size of a full block. >> You can roll your files based on time, size, or number of events. Rolling >> on an hourly basis makes perfect sense. >> >> With all that said we recommend writing to avro container files as that >> format is most well suited for being used in the Hadoop ecosystem. >> Avro has many benefits which include support for compression, code >> generation, versioning and schema evolution. >> You can do this with flume by specifying the avro_event type for the >> serializer property in your hdfs sink. >> >> Hope this helps. >> >> -Jeff >> >> >> On Wed, Oct 30, 2013 at 4:15 PM, Jeremy Karlson >> <[email protected]>wrote: >> >>> Hi everyone. >>> >>> I'm trying to set up Flume to log into HDFS. Along the way, Flume >>> attaches a number of headers (environment, hostname, etc) that I would also >>> like to store with my log messages. Ideally, I'd like to be able to use >>> Hive to query all of this later. I must also admit to knowing next to >>> nothing about HDFS. That probably doesn't help. :-P >>> >>> I'm confused about the HDFS sink configuration. Specifically, I'm >>> trying to understand what these two options do (and how they interact): >>> >>> hdfs.fileType >>> hdfs.writeFormat >>> >>> File Type: >>> >>> DataStream - This appears to write the event body, and loses all >>> headers. Correct? >>> CompressedStream - I assume just a compressed data stream. >>> SequenceFile - I think this is what I want, since it seems to be a >>> key/value based thing, which I assume means it will include headers. >>> >>> Write Format: This seems to only apply for SequenceFile above, but lots >>> of Internet examples seem to state otherwise. I'm also unclear on the >>> difference here. Isn't "Text" just a specific type of "Writable" in HDFS? >>> >>> Also, I'm unclear on why Flume, by default, seems to be set up to make >>> such small HDFS files. Isn't HDFS designed (and more efficient) when >>> storing larger files that are closer to the size of a full block? I was >>> thinking it made more sense to write all log data to a single file, and >>> roll that file hourly (or whatever, depending on volume). Thoughts here? >>> >>> Thanks a lot. >>> >>> -- Jeremy >>> >>> >>> >> >
