> Does Hadoop provides a pluggible feature for Serialization for both the above > cases?
- You can override the RPC serialisation module and engine with a custom class if you wish to, but it would not be trivial task. - You can easily use custom data serialisation modules for I/O. > Is Writable the default Serialization mechanism for both the above cases? While MR's built-in examples in Apache Hadoop continue to use Writables, the RPCs have moved to using Protocol buffers for themselves in 2.x onwards. > Were there any changes w.r.t. to Serialization from Hadoop 1.x to Hadoop 2.x? Yes "partially", see above. > Will there be a significant performance gain if the default Serialization > i.e. Writables is replaced with Avro, Protol Buffers or Thrift in Map Reduce > programming? Yes, you should see a gain in using a more efficient data serialisation format for data files. On Sun, Mar 30, 2014 at 9:09 PM, Jay Vyas <[email protected]> wrote: > Those are all great questions, and mostly difficultto answer. I havent > played with serialization APIs in some time, but let me try to give some > guidance. WRT to your bulleted questions above: > > 1) Serialization is file system independant: The use of any hadoop > compatible file system should support any kind of serialization. > > 2) See (1). The "default serialization" is Writables: But you can easily > add your own by modifiying the io.serializations configuration parameter. > > 3) I doubt anything significant effecting the way serialization works: The > main thrust of 1->2 was in the way services are deployed, not changing the > internals of how data is serialized. After all, the serialization APIs need > to remain stability even as the arch. of hadoop changes. > > 4) It depends on the implementation. If you have a custom writable that is > really good at compressing your data, that will be better than using a > thrift auto generated API for serialization that is uncustomized out of the > box. Example: Say you are writing "strings" and you know the string is max > 3 characters. A "smart" Writable serializer with custom implementations > optimized for your data will beat a thrift serialization approach. But I > think in general, the advantage of thrift/avro is that its easier to get > really good compression natively out-of-the-box, due to the fact that many > different data types are strongly supported by the way they apply the > schemas (for example , a thrift struct can contain a "boolean", two > "strings" , and an "int" These types will all be optmiized for you by > thrift.... Where as in Writables, you cannot as easily create sophisticated > types with optimization of nested properties. > > > > > On Thu, Mar 27, 2014 at 2:59 AM, Radhe Radhe <[email protected]> > wrote: >> >> Hello All, >> >> AFAIK Hadoop serialization comes into picture in the 2 areas: >> >> putting data on the wire i.e., for interprocess communication between >> nodes using RPC >> putting data on disk i.e. using the Map Reduce for persistent storage say >> HDFS. >> >> >> I have a couple of questions regarding the Serialization mechanisms used >> in Hadoop: >> >> Does Hadoop provides a pluggible feature for Serialization for both the >> above cases? >> Is Writable the default Serialization mechanism for both the above cases? >> Were there any changes w.r.t. to Serialization from Hadoop 1.x to Hadoop >> 2.x? >> Will there be a significant performance gain if the default Serialization >> i.e. Writables is replaced with Avro, Protol Buffers or Thrift in Map Reduce >> programming? >> >> >> Thanks, >> -RR > > > > > -- > Jay Vyas > http://jayunit100.blogspot.com -- Harsh J
