Before opening a ticket, searched the JIRA and found that this is the same problem as reported in NUTCH-1089, NUTCH-990 and NUTCH-1112. I just found it via different symptoms. NUTCH-1089 offers a patch.
> -----Original Message----- > From: Markus Jelsma [mailto:[email protected]] > Sent: Tuesday, 8 November 2011 7:19 PM > To: [email protected] > Subject: Re: A bug has been fixed in protocol-httpclient > > Hi, > > Can you open a Jira ticket and attach a patch file so we can track it? > > Thanks > > > Hi guys, > > > > I know that protocol-httpclient is not recommended to use because of > known > > problems, but I don't have much choice because I need authentication > > support, as a few other people do as well, I am sure. > > > > I've reported a problem with too aggressive de-duplication recently. > On the > > example that I had, I traced that problem to an empty content field. > > Digging further, I found this in httpclient/HttpResponse.java (lines > > 126-130): > > > > while ((bufferFilled = in.read(buffer, 0, buffer.length)) != > -1 > > && totalRead + bufferFilled < contentLength) { > > totalRead += bufferFilled; > > out.write(buffer, 0, bufferFilled); > > } > > > > This should be changed to > > > > while ( ( bufferFilled = in.read( buffer, 0, buffer.length ) > ) != > > -1 ) { > > int toWrite = totalRead + bufferFilled < contentLength ? > > totalRead + > bufferFilled : > > contentLength - totalRead ; totalRead += bufferFilled; > > out.write( buffer, 0, toWrite ) ; > > if ( totalRead >= contentLength ) break ; > > } > > > > Else the last read portion quite often is not stored. Obviously, this > is > > causing problems, especially in small documents where the last read > > portion is the only one, and in PDF documents, as well as other > document > > types that are sensitive to truncation. > > > > This problem explains a large part of false de-duplication cases, as > well > > as parsing errors with truncated content symptoms, but it does not > seem to > > explain all of them. > > > > Regards, > > > > Arkadi

