Re: Hadoop Serialization mechanisms

Harsh J Sun, 30 Mar 2014 14:09:06 -0700

> Does Hadoop provides a pluggible feature for Serialization for both the above 
> cases?


- You can override the RPC serialisation module and engine with a
custom class if you wish to, but it would not be trivial task.
- You can easily use custom data serialisation modules for I/O.

> Is Writable the default Serialization mechanism for both the above cases?

While MR's built-in examples in Apache Hadoop continue to use
Writables, the RPCs have moved to using Protocol buffers for
themselves in 2.x onwards.

> Were there any changes w.r.t. to Serialization from Hadoop 1.x to Hadoop 2.x?

Yes "partially", see above.

> Will there be a significant performance gain if the default Serialization 
> i.e. Writables is replaced with Avro, Protol Buffers or Thrift in Map Reduce 
> programming?

Yes, you should see a gain in using a more efficient data
serialisation format for data files.

On Sun, Mar 30, 2014 at 9:09 PM, Jay Vyas <[email protected]> wrote:
> Those are all great questions, and mostly difficultto answer.    I havent
> played with serialization APIs in some time, but let me try to give some
> guidance.  WRT to your bulleted questions above:
>
> 1) Serialization is file system independant:  The use of any hadoop
> compatible file system should support any kind of serialization.
>
> 2) See (1).  The "default serialization" is Writables: But you can easily
> add your own by modifiying the io.serializations configuration parameter.
>
> 3) I doubt anything significant effecting the way serialization works:  The
> main thrust of 1->2 was in the way services are deployed, not changing the
> internals of how data is serialized.  After all, the serialization APIs need
> to remain stability even as the arch. of hadoop changes.
>
> 4) It depends on the implementation.  If you have a custom writable that is
> really good at compressing your data, that will be better than using a
> thrift auto generated API for serialization that is uncustomized out of the
> box.  Example:  Say you are writing "strings" and you know the string is max
> 3 characters.  A "smart" Writable serializer with custom implementations
> optimized for your data will beat a thrift serialization approach.  But I
> think in general, the advantage of thrift/avro is that its easier to get
> really good compression natively out-of-the-box, due to the fact that many
> different data types are strongly supported by the way they apply the
> schemas (for example , a thrift struct can contain a "boolean", two
> "strings" , and an "int" These types will all be optmiized for you by
> thrift.... Where as in Writables, you cannot as easily create sophisticated
> types with optimization of nested properties.
>
>
>
>
> On Thu, Mar 27, 2014 at 2:59 AM, Radhe Radhe <[email protected]>
> wrote:
>>
>> Hello All,
>>
>> AFAIK Hadoop serialization comes into picture in the 2 areas:
>>
>> putting data on the wire i.e., for interprocess communication between
>> nodes using RPC
>> putting data on disk i.e. using the Map Reduce for persistent storage say
>> HDFS.
>>
>>
>> I have a couple of questions regarding the Serialization mechanisms used
>> in Hadoop:
>>
>> Does Hadoop provides a pluggible feature for Serialization for both the
>> above cases?
>> Is Writable the default Serialization mechanism for both the above cases?
>> Were there any changes w.r.t. to Serialization from Hadoop 1.x to Hadoop
>> 2.x?
>> Will there be a significant performance gain if the default Serialization
>> i.e. Writables is replaced with Avro, Protol Buffers or Thrift in Map Reduce
>> programming?
>>
>>
>> Thanks,
>> -RR
>
>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com



-- 
Harsh J

Re: Hadoop Serialization mechanisms

Reply via email to