This is a very ambitious proposal! I will comment on specifics inline, but let me start by saying that we should probably create a lot of separate JIRA tickets for dealing with the ideas raised in this issue. As one monolithic blob, it will be too difficult to implement and review.

On Sep 9, 2008, at 5:40 AM, Johan Stuyts wrote:

Hi,

I thought the move from Facebook to Apache packages might be an opportunity to redesign the Java API. Redesigning the API at this stage means that there are no backward compatibility issues: you either use the old or the new packages. A switch can be added to the compiler to choose between code that uses the old or the new packages.
I appreciate the attention to backward compatibility issues, but I don't think we should be so quick to fork the Java libraries at this stage in the game. While it may not be possible in every regards to every one of your suggestions below, I think in general it would be best if we could apply the improvements to the existing libraries rather than create new ones. This specter of backwards compatibility has been brought up a few times now in regards to big refactors (like changing packages, etc), and I really think while we're in the Incubator, pre-release, we should just refactor what we want to refactor. We're never going to get such an open chance again. That's not to say we shouldn't take steps to ease the transition for people with existing code, but let's find a way that doesn't involve creating a new library and deprecating the old one.


Why can a redesign be beneficial?
- A different set of functions for TProtocol so protocol implementations have more flexibility (For example see https:// issues.apache.org/jira/browse/THRIFT-110 (Changes to the IDL would still be required to be able to implement a compact format)). - Support for changes to the IDL needed to support more formats (e.g. the compact format) if the IDL is changed.
It's true that it would be nice if the TProtocol interface was richer, because it would definitely make implementing the compact protocol easier. However, that said, I'm not convinced its necessary. We've established that it can be done with a stateful TProtocol. (We also don't need IDL changes to implement the compact protocol, with the possible exception of extern strings.)

- Properly supporting optional members of structures.
This is done in the java:beans generator already. I do wonder why anyone would use the non-beans generator at all, though, so you might make the argument that the two paths should be unified.

- Removal of the name when writing the beginning of the structure as nobody seems to need this (https://issues.apache.org/jira/browse/ THRIFT-8). There is no need to hang on to obsolete constructs and confuse new users.
This is part of the protocol interface, correct? Are you suggesting that we remove struct names from all of thrift, or just from the Java API? If it's unused, I'm pro-removal everywhere. However, we do have to decide if that is a limitation we want to have.

- Replace base classes with interfaces so implementations that want to do things differently do not have to extend the base class awkwardly. For example: TProtocol and TServer.
+1

- Refactoring of strange constructs in the current API. For example: the 'getTransport(TTransport)' method of TTransportFactory is used as a transport transformer in addition to being used as a factory. When used as a factory method the parameter makes no sense.
+1

- Remove the Client and Processor classes from generated code of services so server and client implementations have more freedom concerning how (service and) function selection information is communicated and processed, and how I/O is handled. See below.
Are you sure we should stop generating the code altogether? For most default cases, I doubt people are going to want to reimplement these pieces. I'd be for making the Processor interface clearer and easier to implement outside of code generation, though.

- Drop transports. See below.
I sort of like the transport abstraction. Not every use of transports is hidden behind a server, after all - Rapleaf does a lot of serializing structures to byte arrays and such. For the case of a transport and a server being intimately intertwined, I would say that there's nothing stopping you from not taking a Transport in the constructor and doing whatever you want internally. The only point at which you need to use a Transport is as soon as you want to rely on the generated code (de)serialization, and at that point you can use the memory based transports. I did something analogous in the TNonblockingServer.

- Add an interface to the generated code and add support classes for doing asynchronous calls (i.e. support for sequence IDs). See below. - Drop the many constructors of a number of classes and replace them with a single constructor taking an options object.
I see where you're going with this, but doesn't this just mean that you now have to manually code the validation of order and presence of certain parameters in certain situations? Certainly the existing scenario of tons of constructors is overly verbose, but at least it avoids the bugs that would show up if we tried to do it ourself.

- Drop the 'T' prefix of types as this is not customary in Java.
I'm mostly pro on this one, though there are a few situations that would yield name clashes if we did this to every class (TException -> Exception, for one).


Removal of Client and Processor
===============================
It is impossible for the client and server implementations to handle services in a generic way currently. The only types available are TProcessorFactory and TProcessor. These types do not provide enough information. For example: for some server implementations it is interesting to know whether or not a function has the 'async' keyword. If a function does have it, the communication channel can be released immediatly instead of being held onto while the asynchronous function is running.

In addition, the assumption that the protocol can be assigned to the classes Client and Processor once, makes it more difficult to implement multiplexing servers (which also need to send the service name in addition to the function name), clients and servers using non-blocking I/O, connection pooling clients, asynchronous calls, etc. It would be much better if client and server implementations have more control over how they want to handle things before passing control to the function implementations.

By only supplying the minimal information about services and functions to client and server implementations, and by passing the protocol as late as possible to function implementations, lots of flexibility is gained. Here is a rough draft of how the API would look like (from the perspective of the server. Information and methods needed for client implementations are missing):
class Service {
  Function get(String);
}
class Function {
  boolean isAsynchronous();

  // Throws an exception if function is asynchronous.
  //
  // Only reads the structure containing the parameters
  // using the protocol. Messages are handled by the
  // server.
  //
  // Returns the wrapper around the result.
  //
  // Server can decide when to write the result without
  // having to copy it to a byte array.
  Base invoke(Protocol);

  // Discards result if function is synchronous.
  //
  // Only reads the structure containing the parameters
  // using the protocol. Messages are handled by the
  // server.
  void invoke(Protocol);
}

Another example that could be built is a client that requests the mapping of (services and) functions to IDs from the server, so invocation of a particular function does not require the sending of the (service and) function names. Instead a few bytes are all that is needed to select a function. This is very difficult to do now because the Client and Processor classes have control over the information of the message being sent.

For the multiplexing server that I want to build having more freedom could mean replacing this request header:
i32:    version and type
string: service name
i32:    sequence ID (unused)
i32:    version and type
string: function name
i32:    sequence ID

With this request header (if I implement the function to ID mapping). This saves (for a 15-character ASCII service name, a 10- character ASCII function name and less than 16384 functions) 42-43 bytes per function invocation:
byte:   version and type
vint:   function ID (1 or 2 bytes)
i32:    sequence ID

If I would use one bit of the version and type byte as a flag to indicate sequence IDs are not needed, another 4 bytes could be saved, i.e. only 2-3 bytes would be needed to select a function.

Note that protocols must understand more header types than 'TMessage' for this to work. Maybe the clients and servers must be given the freedom to write the header instead of having a single message type. Not all clients and servers can have their own header format because that would defeat interoperability, instead a number of header formats have to be agreed upon. For example:
- Current header format for single-service clients and servers.
- New header format for multiplexed clients and servers with support for:
  - a function to ID map for more efficient function selection;
  - indicating a sequence ID is not needed.

Removal of the Client and Processor classes will also remove a lot of duplicate code for handling calls from the generated classes. In my opinion this duplicate code can easily be transformed into generic code in client and server implementations.

Drop transports
===============
All servers currently require a transport, but the internal workings of a server are closely tied to a specific transport. Why have servers work with wrappers around sockets instead of working with sockets directly? Possibilities for server implementation will now be limited by the wrappers instead of allowing the full flexibility of the socket API.

In my opinion a transport is internal to a server. If a server can be implemented in a better way, the thin transport wrapper around sockets should not be a limiting factor. It would be very cumbersome if an API change of a 'TServerTransport' implementation is needed to be able to implement a server differently.

The only thing that needs to be handled is 'TFramedTransport' because this is a decorator. In my opinion it may be better to implement it as a protocol decorator which frames messages.

I don't know how this works out for clients but I think it would be very similar.

Asynchronous interface
======================
(Does anyone have a better name than 'asynchronous functions' so the 'async' keyword and sequence IDs cannot be confused?)

The addition of an asynchronous interface (to support sequence IDs) besides the synchronous one makes it easier to implement asynchronous calls (using sequence IDs). Asynchronous calls need to be handled differently by a client anyway, so it is best to make this distinction clear. For some languages only asynchronous interfaces might be generated. Here is a rough draft of how this would look if polling is used to retrieve the response:
interface Calculator.IfaceAsync {
  AsynchronousCall ping();
  IntAsynchronousCall add(int, int);
  IntAsynchronousCall calculate(int, Work);
  // zip() not added because it does not return a result
}
interface AsynchronousCall {
  boolean hasResponseArrived();
}
interface IntAsynchronousCall extends AsynchronousCall {
  int getResult();
}
interface StructAsynchronousCall<R> extends AsynchronousCall {
  R getResult();
}

An alternative is to use events instead of polling:
interface Calculator.IfaceAsync {
  void ping(VoidAsynchronousCompletion);
  void add(int, int, IntAsynchronousCompletion);
  void calculate(int, Work, IntAsynchronousCompletion);
  // zip() not added because it does not return a result
}
interface VoidAsynchronousCompletion {
  void handleResponse();
}
interface IntAsynchronousCompletion {
  void handleResponse(int);
}
interface StructAsynchronousCompletion<R> {
  void handleResponse(R);
}


What do you think? Would a redesign be useful and worth it?

--
Kind regards,

Johan Stuyts

Reply via email to