Re: Optimizing serialization-deserialization

Bryan Duxbury Mon, 09 Apr 2012 15:19:11 -0700

The transport stuff is pretty thin. I doubt that you would actually save
any time removing any of that. (Particularly as there are already byte[]
backed transports you can use.)

With the addition of Schemes in 0.8/0.9, at least in Java, you could
presumably modify the code generator to make support for your own hyperfast
protocols that didn't care about things like endianness. However, things
like managing endianness are pretty core features to most Thrift users, and
I doubt it would be something we'd end up folding into trunk.

Worth noting is that if you're using your Thrift structs in a Hadoop
context, particularly if you're on Cascading, a lot of work has already
been done here. There are tools for using them as keys and values. And you
should check out the Tuple Protocol, which is something I created that
strips out a lot of the backwards compatibility features for the purpose of
saving space in serialized structs. It is also faster as a side effect,
though I don't know if it's 20% faster.

Also, if you're in the Hadoop context, keep in mind that Thrift
deserialization is likely to be less of a performance bottleneck than
reading and writing to disk and network.

-Bryan

On Mon, Apr 9, 2012 at 12:38 PM, Anand Srivastava <
[email protected]> wrote:

> Hi Byran,
>        The Protocol and Transport classes provide nice abstractions but
> not really needed by us. A function of the kind "int toByteArray(byte[]
> dst, int dstLen)" would have worked for me which does not take care of
> Endianness, not does it insert field ids (for backward and forward
> compatibility) if it performed 20% faster.
> We don't mind enhancing the code generator but are worried about having to
> keep porting it on every Thrift upgrade.
> I am open to trying a different implementation altogether even if it has
> less features but performs better. (auto-generation of robust
> serialization-deserialization code for nested structs being a mandatory
> feature).
> We are going to use these as key-value objects in Hadoop map-reduce and
> would write the comparators when required.
>
> Regards,
> Anand
>
> On 09-Apr-2012, at 8:19 PM, Bryan Duxbury wrote:
>
> > What kind of "other abstractions" would you like to turn off?
> >
> > On Sun, Apr 8, 2012 at 10:50 PM, Anand Srivastava <
> > [email protected]> wrote:
> >
> >> Hi,
> >> We have been using Thrift objects when we want to serialize to disk and
> >> back (in C++ and Java). While profiling, we have noticed that hard
> written
> >> serializers for our specific objects can perform much better (~2x) than
> >> Thrift.  While we understand it is unreasonable to expect Thrift to
> match
> >> their performance given the generic usecases it solves, we have been
> were
> >> wondering if we can somehow keep using the auto-generated
> >> serialization-deserialization code but not pay for the other
> abstractions
> >> provided.
> >> So, we don't care about the language/architecture independence provided
> or
> >> the backward compatibility. And suggestions on possible performance
> >> improvements as a tradeoff for some features would be useful.
> >>
> >> We have tried TBinaryProtocol and TCompactProtocol along with
> >> TMemoryBuffer.
> >>
> >> The objects we are interested in as similar to the one below:
> >>
> >> Java:
> >> public class MeasureSet {
> >> public double[] simple;
> >> public ArrayList<ArrayList<Long>> complex;
> >> public ArrayList<ByteBuffer> others;
> >> }
> >>
> >> IDL:
> >> struct ThriftMeasureSet {
> >> 1:list<double> simple,
> >> 2:list<list<i64>> complex,
> >> 3:list<binary> others
> >> }
> >>
> >> We have tried Protocol Buffer as well and find their performance to be
> >> similar to Thrift. Any pointers as welcome.
> >>
> >> Thanks and Regards,
> >> Anand
> >>
>
>

Re: Optimizing serialization-deserialization

Reply via email to