Re: HDFS Sink Config Help

Jeff Lord Fri, 01 Nov 2013 10:39:39 -0700

Yes definitely use avro instead of json if you can.
HIVE-895 added support for that. Pretty much the entire Hadoop ecosystem
has support for avro at this point. The ability to evolve/version the
schema is one of the main benefits.



On Fri, Nov 1, 2013 at 9:50 AM, Jeremy Karlson <[email protected]>wrote:

> Hi Jeff,
>
> Thanks for your suggestions.  My only Flume experience so far is with the
> Elasticsearch sink, which serializes (headers and body) to JSON
> automatically.  I was expecting something similar from the HDFS sink and
> when it didn't do that I started questioning the file format when I should
> have been looking at the serializer.  A misunderstanding on my part.
>
> I just finished serializing to JSON when I saw you suggested Avro.  I'll
> look into that.  Is that what you would use if you were going to query with
> Hive external tables?
>
> Thanks again!
>
> -- Jeremy
>
>
>
> On Thu, Oct 31, 2013 at 4:42 PM, Jeff Lord <[email protected]> wrote:
>
>> Jeremy,
>>
>> Datastream fileType will let you write text files.
>> CompressedStream will do just that.
>> SequenceFile will create sequence files as you have guessed and you can
>> use either Text or Writeable (bytes) for your data here.
>>
>> So flume is configureable out of the box with regards to the size of your
>> files. Yes you are correct that it is better to create files that are at
>> least the size of a full block.
>> You can roll your files based on time, size, or number of events. Rolling
>> on an hourly basis makes perfect sense.
>>
>> With all that said we recommend writing to avro container files as that
>> format is most well suited for being used in the Hadoop ecosystem.
>> Avro has many benefits which include support for compression, code
>> generation, versioning and schema evolution.
>> You can do this with flume by specifying the avro_event type for the
>> serializer property in your hdfs sink.
>>
>> Hope this helps.
>>
>> -Jeff
>>
>>
>> On Wed, Oct 30, 2013 at 4:15 PM, Jeremy Karlson 
>> <[email protected]>wrote:
>>
>>> Hi everyone.
>>>
>>> I'm trying to set up Flume to log into HDFS.  Along the way, Flume
>>> attaches a number of headers (environment, hostname, etc) that I would also
>>> like to store with my log messages.  Ideally, I'd like to be able to use
>>> Hive to query all of this later.  I must also admit to knowing next to
>>> nothing about HDFS.  That probably doesn't help.  :-P
>>>
>>> I'm confused about the HDFS sink configuration.  Specifically, I'm
>>> trying to understand what these two options do (and how they interact):
>>>
>>> hdfs.fileType
>>> hdfs.writeFormat
>>>
>>> File Type:
>>>
>>> DataStream - This appears to write the event body, and loses all
>>> headers.  Correct?
>>> CompressedStream - I assume just a compressed data stream.
>>>  SequenceFile - I think this is what I want, since it seems to be a
>>> key/value based thing, which I assume means it will include headers.
>>>
>>> Write Format: This seems to only apply for SequenceFile above, but lots
>>> of Internet examples seem to state otherwise.  I'm also unclear on the
>>> difference here.  Isn't "Text" just a specific type of "Writable" in HDFS?
>>>
>>> Also, I'm unclear on why Flume, by default, seems to be set up to make
>>> such small HDFS files.  Isn't HDFS designed (and more efficient) when
>>> storing larger files that are closer to the size of a full block?  I was
>>> thinking it made more sense to write all log data to a single file, and
>>> roll that file hourly (or whatever, depending on volume).  Thoughts here?
>>>
>>> Thanks a lot.
>>>
>>> -- Jeremy
>>>
>>>
>>>
>>
>

Re: HDFS Sink Config Help

Reply via email to