This is a very ambitious proposal! I will comment on specifics
inline, but let me start by saying that we should probably create a
lot of separate JIRA tickets for dealing with the ideas raised in
this issue. As one monolithic blob, it will be too difficult to
implement and review.
On Sep 9, 2008, at 5:40 AM, Johan Stuyts wrote:
Hi,
I thought the move from Facebook to Apache packages might be an
opportunity to redesign the Java API. Redesigning the API at this
stage means that there are no backward compatibility issues: you
either use the old or the new packages. A switch can be added to
the compiler to choose between code that uses the old or the new
packages.
I appreciate the attention to backward compatibility issues, but I
don't think we should be so quick to fork the Java libraries at this
stage in the game. While it may not be possible in every regards to
every one of your suggestions below, I think in general it would be
best if we could apply the improvements to the existing libraries
rather than create new ones. This specter of backwards compatibility
has been brought up a few times now in regards to big refactors (like
changing packages, etc), and I really think while we're in the
Incubator, pre-release, we should just refactor what we want to
refactor. We're never going to get such an open chance again. That's
not to say we shouldn't take steps to ease the transition for people
with existing code, but let's find a way that doesn't involve
creating a new library and deprecating the old one.
Why can a redesign be beneficial?
- A different set of functions for TProtocol so protocol
implementations have more flexibility (For example see https://
issues.apache.org/jira/browse/THRIFT-110 (Changes to the IDL would
still be required to be able to implement a compact format)).
- Support for changes to the IDL needed to support more formats
(e.g. the compact format) if the IDL is changed.
It's true that it would be nice if the TProtocol interface was
richer, because it would definitely make implementing the compact
protocol easier. However, that said, I'm not convinced its necessary.
We've established that it can be done with a stateful TProtocol. (We
also don't need IDL changes to implement the compact protocol, with
the possible exception of extern strings.)
- Properly supporting optional members of structures.
This is done in the java:beans generator already. I do wonder why
anyone would use the non-beans generator at all, though, so you might
make the argument that the two paths should be unified.
- Removal of the name when writing the beginning of the structure
as nobody seems to need this (https://issues.apache.org/jira/browse/
THRIFT-8). There is no need to hang on to obsolete constructs and
confuse new users.
This is part of the protocol interface, correct? Are you suggesting
that we remove struct names from all of thrift, or just from the Java
API? If it's unused, I'm pro-removal everywhere. However, we do have
to decide if that is a limitation we want to have.
- Replace base classes with interfaces so implementations that want
to do things differently do not have to extend the base class
awkwardly. For example: TProtocol and TServer.
+1
- Refactoring of strange constructs in the current API. For
example: the 'getTransport(TTransport)' method of TTransportFactory
is used as a transport transformer in addition to being used as a
factory. When used as a factory method the parameter makes no sense.
+1
- Remove the Client and Processor classes from generated code of
services so server and client implementations have more freedom
concerning how (service and) function selection information is
communicated and processed, and how I/O is handled. See below.
Are you sure we should stop generating the code altogether? For most
default cases, I doubt people are going to want to reimplement these
pieces. I'd be for making the Processor interface clearer and easier
to implement outside of code generation, though.
- Drop transports. See below.
I sort of like the transport abstraction. Not every use of transports
is hidden behind a server, after all - Rapleaf does a lot of
serializing structures to byte arrays and such. For the case of a
transport and a server being intimately intertwined, I would say that
there's nothing stopping you from not taking a Transport in the
constructor and doing whatever you want internally. The only point at
which you need to use a Transport is as soon as you want to rely on
the generated code (de)serialization, and at that point you can use
the memory based transports. I did something analogous in the
TNonblockingServer.
- Add an interface to the generated code and add support classes
for doing asynchronous calls (i.e. support for sequence IDs). See
below.
- Drop the many constructors of a number of classes and replace
them with a single constructor taking an options object.
I see where you're going with this, but doesn't this just mean that
you now have to manually code the validation of order and presence of
certain parameters in certain situations? Certainly the existing
scenario of tons of constructors is overly verbose, but at least it
avoids the bugs that would show up if we tried to do it ourself.
- Drop the 'T' prefix of types as this is not customary in Java.
I'm mostly pro on this one, though there are a few situations that
would yield name clashes if we did this to every class (TException ->
Exception, for one).
Removal of Client and Processor
===============================
It is impossible for the client and server implementations to
handle services in a generic way currently. The only types
available are TProcessorFactory and TProcessor. These types do not
provide enough information. For example: for some server
implementations it is interesting to know whether or not a function
has the 'async' keyword. If a function does have it, the
communication channel can be released immediatly instead of being
held onto while the asynchronous function is running.
In addition, the assumption that the protocol can be assigned to
the classes Client and Processor once, makes it more difficult to
implement multiplexing servers (which also need to send the service
name in addition to the function name), clients and servers using
non-blocking I/O, connection pooling clients, asynchronous calls,
etc. It would be much better if client and server implementations
have more control over how they want to handle things before
passing control to the function implementations.
By only supplying the minimal information about services and
functions to client and server implementations, and by passing the
protocol as late as possible to function implementations, lots of
flexibility is gained. Here is a rough draft of how the API would
look like (from the perspective of the server. Information and
methods needed for client implementations are missing):
class Service {
Function get(String);
}
class Function {
boolean isAsynchronous();
// Throws an exception if function is asynchronous.
//
// Only reads the structure containing the parameters
// using the protocol. Messages are handled by the
// server.
//
// Returns the wrapper around the result.
//
// Server can decide when to write the result without
// having to copy it to a byte array.
Base invoke(Protocol);
// Discards result if function is synchronous.
//
// Only reads the structure containing the parameters
// using the protocol. Messages are handled by the
// server.
void invoke(Protocol);
}
Another example that could be built is a client that requests the
mapping of (services and) functions to IDs from the server, so
invocation of a particular function does not require the sending of
the (service and) function names. Instead a few bytes are all that
is needed to select a function. This is very difficult to do now
because the Client and Processor classes have control over the
information of the message being sent.
For the multiplexing server that I want to build having more
freedom could mean replacing this request header:
i32: version and type
string: service name
i32: sequence ID (unused)
i32: version and type
string: function name
i32: sequence ID
With this request header (if I implement the function to ID
mapping). This saves (for a 15-character ASCII service name, a 10-
character ASCII function name and less than 16384 functions) 42-43
bytes per function invocation:
byte: version and type
vint: function ID (1 or 2 bytes)
i32: sequence ID
If I would use one bit of the version and type byte as a flag to
indicate sequence IDs are not needed, another 4 bytes could be
saved, i.e. only 2-3 bytes would be needed to select a function.
Note that protocols must understand more header types than
'TMessage' for this to work. Maybe the clients and servers must be
given the freedom to write the header instead of having a single
message type. Not all clients and servers can have their own header
format because that would defeat interoperability, instead a number
of header formats have to be agreed upon. For example:
- Current header format for single-service clients and servers.
- New header format for multiplexed clients and servers with
support for:
- a function to ID map for more efficient function selection;
- indicating a sequence ID is not needed.
Removal of the Client and Processor classes will also remove a lot
of duplicate code for handling calls from the generated classes. In
my opinion this duplicate code can easily be transformed into
generic code in client and server implementations.
Drop transports
===============
All servers currently require a transport, but the internal
workings of a server are closely tied to a specific transport. Why
have servers work with wrappers around sockets instead of working
with sockets directly? Possibilities for server implementation will
now be limited by the wrappers instead of allowing the full
flexibility of the socket API.
In my opinion a transport is internal to a server. If a server can
be implemented in a better way, the thin transport wrapper around
sockets should not be a limiting factor. It would be very
cumbersome if an API change of a 'TServerTransport' implementation
is needed to be able to implement a server differently.
The only thing that needs to be handled is 'TFramedTransport'
because this is a decorator. In my opinion it may be better to
implement it as a protocol decorator which frames messages.
I don't know how this works out for clients but I think it would be
very similar.
Asynchronous interface
======================
(Does anyone have a better name than 'asynchronous functions' so
the 'async' keyword and sequence IDs cannot be confused?)
The addition of an asynchronous interface (to support sequence IDs)
besides the synchronous one makes it easier to implement
asynchronous calls (using sequence IDs). Asynchronous calls need to
be handled differently by a client anyway, so it is best to make
this distinction clear. For some languages only asynchronous
interfaces might be generated. Here is a rough draft of how this
would look if polling is used to retrieve the response:
interface Calculator.IfaceAsync {
AsynchronousCall ping();
IntAsynchronousCall add(int, int);
IntAsynchronousCall calculate(int, Work);
// zip() not added because it does not return a result
}
interface AsynchronousCall {
boolean hasResponseArrived();
}
interface IntAsynchronousCall extends AsynchronousCall {
int getResult();
}
interface StructAsynchronousCall<R> extends AsynchronousCall {
R getResult();
}
An alternative is to use events instead of polling:
interface Calculator.IfaceAsync {
void ping(VoidAsynchronousCompletion);
void add(int, int, IntAsynchronousCompletion);
void calculate(int, Work, IntAsynchronousCompletion);
// zip() not added because it does not return a result
}
interface VoidAsynchronousCompletion {
void handleResponse();
}
interface IntAsynchronousCompletion {
void handleResponse(int);
}
interface StructAsynchronousCompletion<R> {
void handleResponse(R);
}
What do you think? Would a redesign be useful and worth it?
--
Kind regards,
Johan Stuyts