But I believe w.r.t. "will we see performance gains when using avro/thrift/... over writables" -- it depends on the writable implementation. For example, If I have a writable serialization which can use a bit map to store an enum, but then read that enum as a string: It will look the same to user, but my writable implementation would be superior. We can obviously say that if you use avro/thrift/pbuffers in an efficient way, then yes, you will see a performance gain then say storing everything as Text Writable objects. But clever optimizations can be done even within the Writable framework as well.
On Sun, Mar 30, 2014 at 4:08 PM, Harsh J <ha...@cloudera.com> wrote: > > Does Hadoop provides a pluggible feature for Serialization for both the > above cases? > > - You can override the RPC serialisation module and engine with a > custom class if you wish to, but it would not be trivial task. > - You can easily use custom data serialisation modules for I/O. > > > Is Writable the default Serialization mechanism for both the above cases? > > While MR's built-in examples in Apache Hadoop continue to use > Writables, the RPCs have moved to using Protocol buffers for > themselves in 2.x onwards. > > > Were there any changes w.r.t. to Serialization from Hadoop 1.x to Hadoop > 2.x? > > Yes "partially", see above. > > > Will there be a significant performance gain if the default > Serialization i.e. Writables is replaced with Avro, Protol Buffers or > Thrift in Map Reduce programming? > > Yes, you should see a gain in using a more efficient data > serialisation format for data files. > > On Sun, Mar 30, 2014 at 9:09 PM, Jay Vyas <jayunit...@gmail.com> wrote: > > Those are all great questions, and mostly difficultto answer. I havent > > played with serialization APIs in some time, but let me try to give some > > guidance. WRT to your bulleted questions above: > > > > 1) Serialization is file system independant: The use of any hadoop > > compatible file system should support any kind of serialization. > > > > 2) See (1). The "default serialization" is Writables: But you can easily > > add your own by modifiying the io.serializations configuration parameter. > > > > 3) I doubt anything significant effecting the way serialization works: > The > > main thrust of 1->2 was in the way services are deployed, not changing > the > > internals of how data is serialized. After all, the serialization APIs > need > > to remain stability even as the arch. of hadoop changes. > > > > 4) It depends on the implementation. If you have a custom writable that > is > > really good at compressing your data, that will be better than using a > > thrift auto generated API for serialization that is uncustomized out of > the > > box. Example: Say you are writing "strings" and you know the string is > max > > 3 characters. A "smart" Writable serializer with custom implementations > > optimized for your data will beat a thrift serialization approach. But I > > think in general, the advantage of thrift/avro is that its easier to get > > really good compression natively out-of-the-box, due to the fact that > many > > different data types are strongly supported by the way they apply the > > schemas (for example , a thrift struct can contain a "boolean", two > > "strings" , and an "int" These types will all be optmiized for you by > > thrift.... Where as in Writables, you cannot as easily create > sophisticated > > types with optimization of nested properties. > > > > > > > > > > On Thu, Mar 27, 2014 at 2:59 AM, Radhe Radhe < > radhe.krishna.ra...@live.com> > > wrote: > >> > >> Hello All, > >> > >> AFAIK Hadoop serialization comes into picture in the 2 areas: > >> > >> putting data on the wire i.e., for interprocess communication between > >> nodes using RPC > >> putting data on disk i.e. using the Map Reduce for persistent storage > say > >> HDFS. > >> > >> > >> I have a couple of questions regarding the Serialization mechanisms used > >> in Hadoop: > >> > >> Does Hadoop provides a pluggible feature for Serialization for both the > >> above cases? > >> Is Writable the default Serialization mechanism for both the above > cases? > >> Were there any changes w.r.t. to Serialization from Hadoop 1.x to Hadoop > >> 2.x? > >> Will there be a significant performance gain if the default > Serialization > >> i.e. Writables is replaced with Avro, Protol Buffers or Thrift in Map > Reduce > >> programming? > >> > >> > >> Thanks, > >> -RR > > > > > > > > > > -- > > Jay Vyas > > http://jayunit100.blogspot.com > > > > -- > Harsh J > -- Jay Vyas http://jayunit100.blogspot.com