[jira] Commented: (THRIFT-110) A more compact format

Chad Walters (JIRA) Wed, 20 Aug 2008 04:43:10 -0700

    [ 
https://issues.apache.org/jira/browse/THRIFT-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623961#action_12623961
 ]


Chad Walters commented on THRIFT-110:
-------------------------------------

There seem to be 3 general areas that we are touching on here:
A. Variable-length encoding for integer types
B. Indexed string lookup
C. Non-homogeneous collections

It seems like A is not that controversial, so here is a bit of a deep dive WRT 
implementation:

So far we all seem to be in agreement on at least the following:
1. Don't change the BinaryProtocol at all
2. Extend the DenseProtocol somewhat to support various encoding schemes

Let me point out a few important implementation concerns related to the above:

1. In Thrift, all protocol implementations share a common interface that is 
used in the bindings. So while we don't want to change the actual wire format 
of the BinaryProtocol, some code changes to the BinaryProtocol (and any other 
protocol implementations) will be necessary. In particular, when we extend the 
IDL to include new types or type modifiers, we will need to make some code 
changes, even if the net result for the BinaryProtocol is to treat the new 
types like the old types or ignore the type modifiers.

2. WRT DenseProtocol implementation, nobody seems to have taken note of my 
specific suggestion on making the current variable-length encoding the default 
and then having modifiers to allow IDL writers to state that something should 
used "fixed" or "zipper" encoding. I'd like to push on this a little more. This 
mechanism  would provide full backwards compatibility for any current users of 
the DenseProtocol (David says it is used internally at Facebook).

There are two possible approaches we can take to extending the IDL and the 
protocol interface: add new types or add type modifiers.

I personally prefer the type modifier approach. Instead of adding a profusion 
of new types, we just add a couple modifier keywords (tentatively "fixed" and 
"zipper"). We could use the top 2 bits of the type bytes for type modifiers . 
These two bits would be pulled out in the language bindings and passed as an 
additional parameter in each call to the protocol interface. For integers, the 
values would be:
0 = default (variable length encoding with non-negative preferred)
1 = zipper (variable length encoding that also works well for small negative 
values)
2 = fixed
3 = reserved for future use

The BinaryProtocol would just ignore the type modifier parameter and work as 
is. The DenseProtocol would respect it and use the appropriate encoding. The 
performance cost of pulling out the top two bits and masking the type byte 
should be negligible.

This type modifier mechanism could also be used to pass through the "extern" 
modifier on strings (and binary?):
0 = standard handling
1 = string is stored via index into externed buffer

Again, the BinaryProtocol could ignore this and the DenseProtocol could make 
use of it.

In practice the IDL would now look something like this:

struct Foo {
# use fixed encoding for hash values
 1: fixed i64 md5high,
 2: fixed i64 md5low,
# default assumes small non-negative values predominate
 3: i64 count,
# use zipper encoding when small values (negative and positive) predominate
 4: zipper i32 adjust,
# indicate that the string value should be stored in an external 
 5: extern string typeName
}






> A more compact format 
> ----------------------
>
>                 Key: THRIFT-110
>                 URL: https://issues.apache.org/jira/browse/THRIFT-110
>             Project: Thrift
>          Issue Type: Improvement
>            Reporter: Noble Paul
>
> Thrift is not very compact in writing out data as (say protobuf) . It does 
> not have the concept of variable length integers and various other 
> optimizations possible . In Solr we use a lot of such optimizations to make a 
> very compact payload. Thrift has a lot common with that format.
> It is all done in a single class
> http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/common/util/NamedListCodec.java?revision=685640&view=markup
> The other optimizations include writing type/value  in same byte, very fast 
> writes of Strings, externalizable strings etc 
> We could use a thrift format for non-java clients and I would like to see it 
> as compact as the current java version

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-110) A more compact format

Reply via email to