[
https://issues.apache.org/jira/browse/THRIFT-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623961#action_12623961
]
Chad Walters commented on THRIFT-110:
-------------------------------------
There seem to be 3 general areas that we are touching on here:
A. Variable-length encoding for integer types
B. Indexed string lookup
C. Non-homogeneous collections
It seems like A is not that controversial, so here is a bit of a deep dive WRT
implementation:
So far we all seem to be in agreement on at least the following:
1. Don't change the BinaryProtocol at all
2. Extend the DenseProtocol somewhat to support various encoding schemes
Let me point out a few important implementation concerns related to the above:
1. In Thrift, all protocol implementations share a common interface that is
used in the bindings. So while we don't want to change the actual wire format
of the BinaryProtocol, some code changes to the BinaryProtocol (and any other
protocol implementations) will be necessary. In particular, when we extend the
IDL to include new types or type modifiers, we will need to make some code
changes, even if the net result for the BinaryProtocol is to treat the new
types like the old types or ignore the type modifiers.
2. WRT DenseProtocol implementation, nobody seems to have taken note of my
specific suggestion on making the current variable-length encoding the default
and then having modifiers to allow IDL writers to state that something should
used "fixed" or "zipper" encoding. I'd like to push on this a little more. This
mechanism would provide full backwards compatibility for any current users of
the DenseProtocol (David says it is used internally at Facebook).
There are two possible approaches we can take to extending the IDL and the
protocol interface: add new types or add type modifiers.
I personally prefer the type modifier approach. Instead of adding a profusion
of new types, we just add a couple modifier keywords (tentatively "fixed" and
"zipper"). We could use the top 2 bits of the type bytes for type modifiers .
These two bits would be pulled out in the language bindings and passed as an
additional parameter in each call to the protocol interface. For integers, the
values would be:
0 = default (variable length encoding with non-negative preferred)
1 = zipper (variable length encoding that also works well for small negative
values)
2 = fixed
3 = reserved for future use
The BinaryProtocol would just ignore the type modifier parameter and work as
is. The DenseProtocol would respect it and use the appropriate encoding. The
performance cost of pulling out the top two bits and masking the type byte
should be negligible.
This type modifier mechanism could also be used to pass through the "extern"
modifier on strings (and binary?):
0 = standard handling
1 = string is stored via index into externed buffer
Again, the BinaryProtocol could ignore this and the DenseProtocol could make
use of it.
In practice the IDL would now look something like this:
struct Foo {
# use fixed encoding for hash values
1: fixed i64 md5high,
2: fixed i64 md5low,
# default assumes small non-negative values predominate
3: i64 count,
# use zipper encoding when small values (negative and positive) predominate
4: zipper i32 adjust,
# indicate that the string value should be stored in an external
5: extern string typeName
}
> A more compact format
> ----------------------
>
> Key: THRIFT-110
> URL: https://issues.apache.org/jira/browse/THRIFT-110
> Project: Thrift
> Issue Type: Improvement
> Reporter: Noble Paul
>
> Thrift is not very compact in writing out data as (say protobuf) . It does
> not have the concept of variable length integers and various other
> optimizations possible . In Solr we use a lot of such optimizations to make a
> very compact payload. Thrift has a lot common with that format.
> It is all done in a single class
> http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/common/util/NamedListCodec.java?revision=685640&view=markup
> The other optimizations include writing type/value in same byte, very fast
> writes of Strings, externalizable strings etc
> We could use a thrift format for non-java clients and I would like to see it
> as compact as the current java version
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.