RE: TCP socket InputSource

Michael Wojcik Wed, 24 Apr 2002 10:12:48 -0700

> From: Itay Eliaz [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, April 24, 2002 12:03 PM


> I'm trying to implement an InputSource based on a TCP socket for the
XMLparser.
> I do this since my connection is not HTTP, and therefore I can't use the
> URLInputSource.
> To do this I derived from the InputSource and BinInputStream classes.
> My problem is the document ends but the socket is still open and the
parser
> hangs.
> In my implemetation of the BinInputStream::readBytes method, the whole
> document is read, but in the next method call it hangs since it didn't
reach
> maxToRead nor EOF.

(I think I understand your problem, but if not feel free to correct me
and/or ignore this message entirely.)

This is more TCP issue than a Xerces one.  TCP doesn't make any guarantees
about record boundaries - it's an octet-stream connected protocol.  That
means you have no guarantee that one call to recv() will read all the data
written to the socket by the peer.

The sending program must communicate the end of its transmission in some
fashion.  The usual approaches are:

1. An application protocol that delimits the application data.  This can be
as simple as sending a size value immediately before the data (making sure,
of course, to send it in some canonical form over the wire and translate it
on the receiving end into something you can use).  Or it can be a flexible,
powerful protocol with features to handle many kinds of conditions and room
for expansion.  Like, say, HTTP.

2. Data that's self-delimiting, with some kind of sentinel value at the end.
Good only if you can reserve an octet value for the sentinel, and even so it
lacks the advantage of the first method in simplifying buffer-manipulation
code in languages that don't offer automatic buffer manipulation.

3. A TCP half-close: the sending side sends a TCP FIN, indicating that it
will not be sending any more data.  The sockets API lets you do this with
the shutdown() function.  See pretty much any reference on programming with
sockets for more information.

4. Send record-boundary information on another channel.  There's no
advantage here for new applications; it's a kluge to get around unavoidable
design problems with existing code.

5. Consider the transmission over when a timer expires without new data
being received.  This is only appropriate in limited circumstances, usually
as part of error recovery.

If you use method 3 (the half-close), subsequent calls to recv() will return
0 (the socket equivalent of EOF).  Otherwise, when your readBytes notes the
end of the transmission from the peer, it will have to note that fact
somewhere, so that subsequent calls to readBytes can simulate EOF.  (That's
assuming I understand the semantics of readBytes; I haven't looked at that
code.)

Now, since XML documents are self-delimiting (assuming you only send a
single document at a time), you *could* use the document itself to implement
method 2, but that would in effect require readBytes to do some parsing
itself - definitely a backwards and error-prone approach.

I'd say 1 or 3 is the way to go.  Personally, I'd lean toward 1, as the most
powerful, and have the sender implement a very basic HTTP/1.0 user agent.
(You don't want HTTP/1.1, as that would require supporting the Chunked
transfer-encoding and various other features excessive to your
requirements.)  But 3 is simple and elegant, if the sender only wants to
send one document, get a response, and close the connection.  Just make sure
you handle a return value of 0 from recv() correctly in readBytes.

Michael Wojcik
Principal Software Systems Developer, Micro Focus
Department of English, Miami University


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: TCP socket InputSource

Reply via email to