When I started using protocol buffers I faced the problem of sending 
unstructured data inside a protocol buffers message. For example, letting 
the user to send any custom data inside a protocol buffers message without 
having to define extensions, recompile messages, etc. Then one may think to 
send json data inside a string field or binary-json/messagepack inside a 
bytes field. Both of them are good approaches. However I though on using 
protocol buffer encoding techniques for storing arbitrary data. You can 
define JSON as a set of protocol buffers messages, but it would not be 
efficient as there are many submessages fields that take extra space. In 
this way I have changed the protocol buffers encoding rules a little bit to 
allow encoding arbitrary data efficiently using basic encoding concepts 
that already uses protocol buffers, like encoding tags, varints, strings, 
etc. Here is an example of how it is encoded a sample JSON:

{"str":"hello", "val": 1, "array" : [true, false], "nested": { "value" : 
true }}

TAG(PSON, OBJECT)
VARINT(OBJECT_SIZE)
VARINT(3)
"str"
TAG(PSON, STRING)
VARINT(5)
"hello"
VARINT(3)
"val"
TAG(PSON, ONE)
VARINT(5)
"array"
TAG(PSON, ARRAY)
VARINT(2)
TAG(PSON, TRUE)
TAG(PSON, FALSE)
VARINT(6)
"nested"
TAG(PSON, OBJECT)
VARINT(7)
VARINT(5)
"value"
TAG(PSON, TRUE)

For reference, TAG is defined in protocol buffers as (wire type, field 
number);

This approach could be easily integrated in protocol buffers by defining a 
new wire type 'pson' (it still remains traversable), which define a set of 
custom fields inside the tag to properly determine the data type (object, 
array, string, bytes, varint, signed varint, float, boolean, true, false, 
1, 0, etc). It also apply some optimizations, and encoding a boolean, zero, 
or one, requires just one byte. Also floating point numbers are encoded as 
varints under some circumstances. Signed integers as encoded as simple 
varints and the sign is restored on decoding, and so on.

This format can also encode data without defining a root object like an 
array or a json object, and also you can store binary data. So the pson 
wire type could be an efficient way to store any unstructured data.

I have implemented a preliminary version of this approach in GitHub 
(https://github.com/thinger-io/Protoson). Depending on the encoded content, 
the output size is quite similar or smaller than MessagePack.

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to protobuf+unsubscr...@googlegroups.com.
To post to this group, send email to protobuf@googlegroups.com.
Visit this group at http://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.

Reply via email to