Re: [protobuf] incompatible type changes philosophy

2012-05-10 Thread Evan Jones
On May 9, 2012, at 15:26 , Jeremy Stribling wrote:
 * There are two nodes, 1 and 2, running version A of the software.
 * They exchange messages containing protobuf P, which contains a string field 
 F.
 * We write a new version B of the software, which changes field F to an 
 integer as an optimization.
 * We upgrade node 1, but node 2.
 * If node 1 sends a protobuf P to node 2, I want node 2 to be able to access 
 field F as a string, even though the wire format sent by node 1 was an 
 integer.


I think you can achieve your goals by building a layer on top of the existing 
protocol buffer parsing, possibly in combination with some custom options, a 
protoc plugin, and maybe a small tweak to the existing C++ code generator. You 
do the breaking change by effectively renaming the field, then using a protoc 
plugin to make it invisible to the application. To make this concrete, your 
Version A looks like:

message P {
optional string F = 1;
}


Then Version B looks like the following:

message P {
optional string old_F = 1 [(custom_upgrade_option) = 
some_upgrade_code];
optional int32 F = 2;
}


With this structure, Version B can always parse a Version A message. Senders 
will always ensure there is only one version in the message, so the only thing 
you are losing here is a field number, which isn't a huge deal. However, you 
but now want to automatically convert old_F to F. This can be done without 
changing the guts of the parser by writing a protoc plugin that generates a 
member function based on the custom option:

void UpgradeToLatest() {
if (has_old_F()) {
set_F(some_upgrade_code(get_old_F()));
clear_old_F();
}
}


You then need to make sure that Version B of the software calls this everywhere 
it is needed. Maybe this argues that what is needed is a post-processing 
insertion point in ::MergePartialFromCodedStream? Then your protoc plugin could 
insert this call after a protocol buffer message is successfully parsed, so the 
application would only ever have to deal with the integer version.


In the other direction, I don't understand how the downgrading can possibly be 
done at the receiver, since it doesn't know how to do the downgrade (unless you 
are thinking about mobile code?). So in your example, Node 1 must create a 
Version A protocol buffer message when sending to Node 2. This means you need 
*some* sort of handshaking between Node 1 and Node 2, to indicate supported 
versions.

This is reason I proposed adding some other member function that takes a 
target_version, so the sender knows what to emit. If sending the same message 
to multiple recipients, you'll need to send the lowest version in the group. 
Based on the above, your plugin could emit:

void DowngradeToVersion(int target_version) {
if (target_version  0xB  has_F()) {
set_old_F(some_downgrade_code(get_F()));
clear_F();
}
}


There are many other ways you could do this, but it seems to me that this 
proposal is a way to do it without complicating the base protocol buffers 
library with application-specific details.

Evan

--
http://evanjones.ca/

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



Re: [protobuf] incompatible type changes philosophy

2012-05-10 Thread Jeremy Stribling



On 05/10/2012 07:52 AM, Evan Jones wrote:

On May 9, 2012, at 15:26 , Jeremy Stribling wrote:

* There are two nodes, 1 and 2, running version A of the software.
* They exchange messages containing protobuf P, which contains a string field F.
* We write a new version B of the software, which changes field F to an integer 
as an optimization.
* We upgrade node 1, but node 2.
* If node 1 sends a protobuf P to node 2, I want node 2 to be able to access 
field F as a string, even though the wire format sent by node 1 was an integer.


I think you can achieve your goals by building a layer on top of the existing protocol 
buffer parsing, possibly in combination with some custom options, a protoc plugin, and 
maybe a small tweak to the existing C++ code generator. You do the breaking change by 
effectively renaming the field, then using a protoc plugin to make it 
invisible to the application. To make this concrete, your Version A looks like:

message P {
optional string F = 1;
}


Then Version B looks like the following:

message P {
optional string old_F = 1 [(custom_upgrade_option) = 
some_upgrade_code];
optional int32 F = 2;
}


With this structure, Version B can always parse a Version A message. Senders will always 
ensure there is only one version in the message, so the only thing you are 
losing here is a field number, which isn't a huge deal. However, you but now 
want to automatically convert old_F to F. This can be done without changing the guts of 
the parser by writing a protoc plugin that generates a member function based on the 
custom option:

void UpgradeToLatest() {
 if (has_old_F()) {
 set_F(some_upgrade_code(get_old_F()));
 clear_old_F();
 }
}


You then need to make sure that Version B of the software calls this everywhere it is 
needed. Maybe this argues that what is needed is a post-processing insertion 
point in ::MergePartialFromCodedStream? Then your protoc plugin could insert this call 
after a protocol buffer message is successfully parsed, so the application would only 
ever have to deal with the integer version.


Yep, I think something like that could work.  Thanks, I'll have to 
explore how best to add a post-processing insertion point there, if we 
decide to go that route.





In the other direction, I don't understand how the downgrading can possibly be 
done at the receiver, since it doesn't know how to do the downgrade (unless you 
are thinking about mobile code?). So in your example, Node 1 must create a 
Version A protocol buffer message when sending to Node 2. This means you need 
*some* sort of handshaking between Node 1 and Node 2, to indicate supported 
versions.

This is reason I proposed adding some other member function that takes a 
target_version, so the sender knows what to emit. If sending the same message 
to multiple recipients, you'll need to send the lowest version in the group. Based on the 
above, your plugin could emit:

void DowngradeToVersion(int target_version) {
 if (target_version  0xB  has_F()) {
 set_old_F(some_downgrade_code(get_F()));
 clear_F();
 }
}


There are many other ways you could do this, but it seems to me that this 
proposal is a way to do it without complicating the base protocol buffers 
library with application-specific details.


Downgrading at the sender is not an option, because the sender might 
be writing something to persistent storage that can be read by any 
version of the program -- there might be no direct connection over which 
to relay versions.  It is possible to do the downgrading at the receiver 
by having two separate processes, likely connected over a local socket 
-- one that holds the main logic of your program, and one which is 
responsible only for translation.  Then, as part of your upgrade, you 
can first upgrade the translation program separately on all nodes, so 
they know how to downgrade from newer versions of the data.  This 
upgrade would be easy, and completely non-disruptive to the main logic 
process.  After all translation programs in the system have been 
upgraded, you can start the (possibly long) process of upgrading the 
other processes, one by one, without worrying much about the effect they 
have on the non-upgraded nodes.  As long as there's a stable interface 
between the two processes that can withstand restarts at either end, 
this should be possible.  This is what's described in Sameer's thesis.


So the challenge I'm pondering is how to plug in calls to such a program 
from somewhere in the protobuf processing path, for only the case where 
the incoming message's version is not natively supported by the 
program.  Perhaps, as you suggest, a post-processing insertion point in 
MergePartialFromCodedStream is the right way to go.  I'll report back if 
I make any progress on this.


Jeremy

--
You received this message because you are subscribed to the Google Groups Protocol 
Buffers group.
To post to this group, send 

Re: [protobuf] incompatible type changes philosophy

2012-05-09 Thread Oliver Jowett
On Wed, May 9, 2012 at 12:42 AM, Jeremy Stribling st...@nicira.com wrote:

 I'm wondering if anyone has experience with a scenario like this, and
 if there's a more elegant way to solve it.

We do something a bit similar by advertising capabilities during a
handshake at the start of each connection. If we need an incompatible
changes in a message, we retain both forms as separate fields in the
protobuf definition and add a new capability that says I understand
the new form as well as the old form. The sender ensures the right
field is set based on the recipient's advertised capabilities and
which forms the sender understands. This only works if you can do a
handshake, though - it wouldn't be any good for persistent storage or
multicast/datagram-like situations.

(In practice, it's been rare to actually do incompatible changes -
more commonly we use the capabilities to negotiate behavioural
changes)

Oliver

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



Re: [protobuf] incompatible type changes philosophy

2012-05-09 Thread Evan Jones
On May 8, 2012, at 21:26 , Jeremy Stribling wrote:
 Thanks for the response.  As you say, this solution is painful because you 
 can't enable the optimization until the old version of the program is 
 completely deprecated.  This is somewhat simple in the case that you yourself 
 are deploying the software, but when you're shipping software to customers 
 (as we are) and have to support many old versions, it will take a very long 
 time (possibly years) before you can enable the optimization.  Also, it 
 breaks the downgrade path.  Once you enable the optimization, you can never 
 downgrade back to a version that did not know about the new field.

I think I now understand your problem. You want to add some additional stuff to 
your .proto file to indicate the incompatible change, then have the application 
code not need to know about it? Eg. you want to write the application code that 
only accesses new_my_data and never needs to check for deprecated_my_data, 
but in fact the underlying protocol buffer supports both fields, or something 
like that.

It seems to me like this is starts to end up in the territory of too high 
level for the protocol buffer library itself since I can't imagine this 
working without handshaking like Oliver talked about (e.g. I understand 
everything up to version X). My personal experience has been more like what 
Daniel describes: you keep both versions of the field, and your code has if 
statements to check for both. I believe this can be made to work, even in your 
scenario, but it does require ugly code in your application to handle it. My 
impression is that you are trying to avoid that.


Random brainstorming that may not be helpful in any way:

I'm curious about how you end up choosing to solve this, but I think you are 
going to need to use some combination of custom field options (to specify the 
change in a way that protoc can parse?), and then hacks in the C++ code 
generator  to call your custom upgrade / downgrade code. I think this can work 
somewhat seamlessly in the reading older messages case (eg. you just add code 
that says if we see the old field, upgrade it to the new field). However, 
this can't work in the writing a newer message for an older receiver case 
without making the Serialize* code aware of the version it should be *writing*. 
I think this is going to be pretty application specific?

My other thought: I think you might be able to get away with writing a protoc 
plugin that adds two functions to the class scope (which already exists as an 
insertion point):

static UpgradedMessage ParseAnyMessageVersion(…);
string SerializeToVersion(int target_version);

These functions can apply the appropriate upgrade/downgrading as needed. 
However, you then need to call the appropriate functions to read/write the 
messages. However, I would argue that since in the serializing case you are 
going to need to know the target_version anyway, this might actually work?

Good luck, and again I'd be interested to know how you do end up solving this.

Evan

--
http://evanjones.ca/

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



Re: [protobuf] incompatible type changes philosophy

2012-05-09 Thread Jeremy Stribling

On 05/09/2012 04:41 AM, Evan Jones wrote:

On May 8, 2012, at 21:26 , Jeremy Stribling wrote:

Thanks for the response.  As you say, this solution is painful because you 
can't enable the optimization until the old version of the program is 
completely deprecated.  This is somewhat simple in the case that you yourself 
are deploying the software, but when you're shipping software to customers (as 
we are) and have to support many old versions, it will take a very long time 
(possibly years) before you can enable the optimization.  Also, it breaks the 
downgrade path.  Once you enable the optimization, you can never downgrade back 
to a version that did not know about the new field.

I think I now understand your problem. You want to add some additional stuff to your .proto file to 
indicate the incompatible change, then have the application code not need to know about it? Eg. you 
want to write the application code that only accesses new_my_data and never needs to 
check for deprecated_my_data, but in fact the underlying protocol buffer supports both 
fields, or something like that.


Hey Evan, thanks for the response.  That is one way to look at it.  
Ideally, the application code would only access my_data(), and it would 
magically appear as the new type in the new version of the app and the 
old type in the old version of the app.  But renaming the field for the 
new version is fine too.  The important points are twofold: 1) the data 
would only appear once on the wire and in storage, and translated if 
necessary by the receiver to the expected format, and 2) that this 
translation could work on the downgrade path as well, so that old 
applications could be able to interpret data written by new 
applications, even if the format of the fields have changes.  Sameer 
Ajmani's ECOOP paper and thesis work discusses these types of scenarios 
(http://pmg.csail.mit.edu/~ajmani/papers/ecoop06-upgrades.pdf).



It seems to me like this is starts to end up in the territory of too high level for the 
protocol buffer library itself since I can't imagine this working without handshaking like 
Oliver talked about (e.g. I understand everything up to version X). My personal 
experience has been more like what Daniel describes: you keep both versions of the field, and your 
code has if statements to check for both. I believe this can be made to work, even in your 
scenario, but it does require ugly code in your application to handle it. My impression is that you 
are trying to avoid that.


I'm trying to avoid keeping both version of the data in the wire format, 
since in this scenario the whole reason for the change was 
optimization.  I don't care if the new version of the protobuf has two 
separate fields; there just needs to be a way for the old version to 
still get at its old data.  Involving the application in some way is 
totally reasonable and expected; I am just hoping to find a way to add a 
translator into the deserialization code, so that it can be upgraded 
independently on old instances of the program, to be able to interpret 
the new version of the protobof while still running the old version of 
the application code.  Here's a specific example:


* There are two nodes, 1 and 2, running version A of the software.
* They exchange messages containing protobuf P, which contains a string 
field F.
* We write a new version B of the software, which changes field F to an 
integer as an optimization.

* We upgrade node 1, but node 2.
* If node 1 sends a protobuf P to node 2, I want node 2 to be able to 
access field F as a string, even though the wire format sent by node 1 
was an integer.





Random brainstorming that may not be helpful in any way:

I'm curious about how you end up choosing to solve this, but I think you are going to need to use some 
combination of custom field options (to specify the change in a way that protoc can parse?), and then hacks 
in the C++ code generator  to call your custom upgrade / downgrade code. I think this can work somewhat 
seamlessly in the reading older messages case (eg. you just add code that says if we see 
the old field, upgrade it to the new field). However, this can't work in the writing a newer 
message for an older receiver case without making the Serialize* code aware of the version it should be 
*writing*. I think this is going to be pretty application specific?


I think doing it on the deserialize is better, because then we can put 
the burden of translation on the receiver, and the sender can merrily 
send the same serialized message to multiple receivers (tagged with its 
own version) without having to keep track of the version capabilities of 
each receiver.  This is especially important, as Oliver pointed out, 
when the data is not transferred over a live connection but through the 
persistent state.  It will definitely be app-specific, which was why I 
was thinking an insertion point might be the way to go.



My other thought: I think you might be able 

[protobuf] incompatible type changes philosophy

2012-05-08 Thread Jeremy Stribling
I'm working on a project to upgrade- and downgrade-proof a distributed
system that uses protobufs to communicate data between instances of a C
++ program.  I'm trying to cover all possible cases for data schema
changes between versions of my programs, and I was hoping to get some
insight from the community on what the best practice is for the
following tricky scenario.

To reduce serialization type and protobuf message size, the format of
a field in a message is changed between incompatible types.  For
example, a string field gets changed to an int, or perhaps a field
gets changed from one message type to another.  Because this is being
done as an optimization, it makes no sense to keep both versions of
the data around, so I think whether we change the field ID is not
relevant -- we only ever want to have one version of the field in any
particular protobuf.

Of course, this makes communicating between versions of the program
very difficult, and I think it requires there to be some kind of
translator code to transform the field from one format to the other.
Ideally, this transformation would be invisible to the rest of the
program.  One ugly thought I had was to have a version field in every
message, and then in the autogenerated C++ serialize code, maybe in
MergePartialCodedFromStream, I could insert a call to an external
translator program that would transform the input bytes into something
that could be decoded by the version of the message expected by this
instance of the program.  I don't think there's an insertion point
defined for this part of the code, so I'd have to write my own script
to do it.  The external translator program could be upgraded
independently of the main program, so older versions would know how to
intepret the fields of the newer versions.

I'm wondering if anyone has experience with a scenario like this, and
if there's a more elegant way to solve it.  If not, what do folks
think of this business of an external translator program?  Foolish
nonsense?  Worthy of a proper insertion point?

Thanks,

Jeremy

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



Re: [protobuf] incompatible type changes philosophy

2012-05-08 Thread Daniel Wright
On Tue, May 8, 2012 at 4:42 PM, Jeremy Stribling st...@nicira.com wrote:

 I'm working on a project to upgrade- and downgrade-proof a distributed
 system that uses protobufs to communicate data between instances of a C
 ++ program.  I'm trying to cover all possible cases for data schema
 changes between versions of my programs, and I was hoping to get some
 insight from the community on what the best practice is for the
 following tricky scenario.

 To reduce serialization type and protobuf message size, the format of
 a field in a message is changed between incompatible types.  For
 example, a string field gets changed to an int, or perhaps a field
 gets changed from one message type to another.  Because this is being
 done as an optimization, it makes no sense to keep both versions of
 the data around, so I think whether we change the field ID is not
 relevant -- we only ever want to have one version of the field in any
 particular protobuf.


Even though you don't keep both versions of the data around, you should
keep both fields around, and have the code be able to read from whichever
is set during the transition.  You can rename the old one (say put
deprecated in the name) so that people know that it's old, but don't
actually remove it from the .proto file until no old instances of the proto
remain.  To put it more concretely, say you have

  optional string my_data = 1;

Now you come up with a way to encode it as an int64 instead.  You'd change
the .proto to:

  optional string deprecated_my_data = 1;
  optional int64 my_data = 2;

- At this point, you write the data to deprecated_my_data and not
my_data, but when you read, you check has_my_data() and
has_deprecated_my_data() and read from whichever one is present.  It might
help to wrapper functions for reading and writing during the transition if
the field is accessed in many places.

- once all instances of the program have been re-compiled so they all know
about the new int64 field, you can start writing to my_data and not
deprecated_my_data.

- once all of the instances of the program have been recompiled again, you
can remove the code that reads deprecated_my_data, and delete the field.

This is kind of painful, but it's much cleaner than adding a version
number.  It also only ever writes the data to one field, so there's no
bloat during the transition.

Daniel

Of course, this makes communicating between versions of the program
 very difficult, and I think it requires there to be some kind of
 translator code to transform the field from one format to the other.
 Ideally, this transformation would be invisible to the rest of the
 program.  One ugly thought I had was to have a version field in every
 message, and then in the autogenerated C++ serialize code, maybe in
 MergePartialCodedFromStream, I could insert a call to an external
 translator program that would transform the input bytes into something
 that could be decoded by the version of the message expected by this
 instance of the program.  I don't think there's an insertion point
 defined for this part of the code, so I'd have to write my own script
 to do it.  The external translator program could be upgraded
 independently of the main program, so older versions would know how to
 intepret the fields of the newer versions.

 I'm wondering if anyone has experience with a scenario like this, and
 if there's a more elegant way to solve it.  If not, what do folks
 think of this business of an external translator program?  Foolish
 nonsense?  Worthy of a proper insertion point?

 Thanks,

 Jeremy

 --
 You received this message because you are subscribed to the Google Groups
 Protocol Buffers group.
 To post to this group, send email to protobuf@googlegroups.com.
 To unsubscribe from this group, send email to
 protobuf+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/protobuf?hl=en.



-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



Re: [protobuf] incompatible type changes philosophy

2012-05-08 Thread Jeremy Stribling



On 05/08/2012 06:04 PM, Daniel Wright wrote:
On Tue, May 8, 2012 at 4:42 PM, Jeremy Stribling st...@nicira.com 
mailto:st...@nicira.com wrote:


I'm working on a project to upgrade- and downgrade-proof a distributed
system that uses protobufs to communicate data between instances
of a C
++ program.  I'm trying to cover all possible cases for data schema
changes between versions of my programs, and I was hoping to get some
insight from the community on what the best practice is for the
following tricky scenario.

To reduce serialization type and protobuf message size, the format of
a field in a message is changed between incompatible types.  For
example, a string field gets changed to an int, or perhaps a field
gets changed from one message type to another.  Because this is being
done as an optimization, it makes no sense to keep both versions of
the data around, so I think whether we change the field ID is not
relevant -- we only ever want to have one version of the field in any
particular protobuf.


Even though you don't keep both versions of the data around, you 
should keep both fields around, and have the code be able to read from 
whichever is set during the transition.  You can rename the old one 
(say put deprecated in the name) so that people know that it's old, 
but don't actually remove it from the .proto file until no old 
instances of the proto remain.  To put it more concretely, say you have


  optional string my_data = 1;

Now you come up with a way to encode it as an int64 instead.  You'd 
change the .proto to:


  optional string deprecated_my_data = 1;
  optional int64 my_data = 2;

- At this point, you write the data to deprecated_my_data and not 
my_data, but when you read, you check has_my_data() and 
has_deprecated_my_data() and read from whichever one is present.  It 
might help to wrapper functions for reading and writing during the 
transition if the field is accessed in many places.


- once all instances of the program have been re-compiled so they all 
know about the new int64 field, you can start writing to my_data and 
not deprecated_my_data.


- once all of the instances of the program have been recompiled again, 
you can remove the code that reads deprecated_my_data, and delete the 
field.


This is kind of painful, but it's much cleaner than adding a version 
number.  It also only ever writes the data to one field, so there's no 
bloat during the transition.




Thanks for the response.  As you say, this solution is painful because 
you can't enable the optimization until the old version of the program 
is completely deprecated.  This is somewhat simple in the case that you 
yourself are deploying the software, but when you're shipping software 
to customers (as we are) and have to support many old versions, it will 
take a very long time (possibly years) before you can enable the 
optimization.  Also, it breaks the downgrade path.  Once you enable the 
optimization, you can never downgrade back to a version that did not 
know about the new field.


--
You received this message because you are subscribed to the Google Groups Protocol 
Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.