Thanks!
> Before opening a ticket, searched the JIRA and found that this is the same > problem as reported in NUTCH-1089, NUTCH-990 and NUTCH-1112. I just found > it via different symptoms. NUTCH-1089 offers a patch. > > > -----Original Message----- > > From: Markus Jelsma [mailto:[email protected]] > > Sent: Tuesday, 8 November 2011 7:19 PM > > To: [email protected] > > Subject: Re: A bug has been fixed in protocol-httpclient > > > > Hi, > > > > Can you open a Jira ticket and attach a patch file so we can track it? > > > > Thanks > > > > > Hi guys, > > > > > > I know that protocol-httpclient is not recommended to use because of > > > > known > > > > > problems, but I don't have much choice because I need authentication > > > support, as a few other people do as well, I am sure. > > > > > > I've reported a problem with too aggressive de-duplication recently. > > > > On the > > > > > example that I had, I traced that problem to an empty content field. > > > Digging further, I found this in httpclient/HttpResponse.java (lines > > > > > > 126-130): > > > while ((bufferFilled = in.read(buffer, 0, buffer.length)) != > > > > -1 > > > > > && totalRead + bufferFilled < contentLength) { > > > > > > totalRead += bufferFilled; > > > out.write(buffer, 0, bufferFilled); > > > > > > } > > > > > > This should be changed to > > > > > > while ( ( bufferFilled = in.read( buffer, 0, buffer.length ) > > > > ) != > > > > > -1 ) { > > > > > > int toWrite = totalRead + bufferFilled < contentLength ? > > > > > > totalRead + > > > > bufferFilled : > > > contentLength - totalRead ; totalRead += bufferFilled; > > > > > > out.write( buffer, 0, toWrite ) ; > > > if ( totalRead >= contentLength ) break ; > > > > > > } > > > > > > Else the last read portion quite often is not stored. Obviously, this > > > > is > > > > > causing problems, especially in small documents where the last read > > > portion is the only one, and in PDF documents, as well as other > > > > document > > > > > types that are sensitive to truncation. > > > > > > This problem explains a large part of false de-duplication cases, as > > > > well > > > > > as parsing errors with truncated content symptoms, but it does not > > > > seem to > > > > > explain all of them. > > > > > > Regards, > > > > > > Arkadi

