Hi guys,
I know that protocol-httpclient is not recommended to use because of known
problems, but I don't have much choice because I need authentication support,
as a few other people do as well, I am sure.
I've reported a problem with too aggressive de-duplication recently. On the
example that I had, I traced that problem to an empty content field. Digging
further, I found this in httpclient/HttpResponse.java (lines 126-130):
while ((bufferFilled = in.read(buffer, 0, buffer.length)) != -1
&& totalRead + bufferFilled < contentLength) {
totalRead += bufferFilled;
out.write(buffer, 0, bufferFilled);
}
This should be changed to
while ( ( bufferFilled = in.read( buffer, 0, buffer.length ) ) != -1 )
{
int toWrite = totalRead + bufferFilled < contentLength ?
totalRead + bufferFilled :
contentLength - totalRead ;
totalRead += bufferFilled;
out.write( buffer, 0, toWrite ) ;
if ( totalRead >= contentLength ) break ;
}
Else the last read portion quite often is not stored. Obviously, this is
causing problems, especially in small documents where the last read portion is
the only one, and in PDF documents, as well as other document types that are
sensitive to truncation.
This problem explains a large part of false de-duplication cases, as well as
parsing errors with truncated content symptoms, but it does not seem to explain
all of them.
Regards,
Arkadi