Before opening a ticket, searched the JIRA and found that this is the same 
problem as reported in NUTCH-1089, NUTCH-990 and NUTCH-1112. I just found it 
via different symptoms. NUTCH-1089 offers a patch.  

> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]]
> Sent: Tuesday, 8 November 2011 7:19 PM
> To: [email protected]
> Subject: Re: A bug has been fixed in protocol-httpclient
> 
> Hi,
> 
> Can you open a Jira ticket and attach a patch file so we can track it?
> 
> Thanks
> 
> > Hi guys,
> >
> > I know that protocol-httpclient is not recommended to use because of
> known
> > problems, but I don't have much choice because I need authentication
> > support, as a few other people do as well, I am sure.
> >
> > I've reported a problem with too aggressive de-duplication recently.
> On the
> > example that I had, I traced that problem to an empty content field.
> > Digging further, I found this in httpclient/HttpResponse.java (lines
> > 126-130):
> >
> >         while ((bufferFilled = in.read(buffer, 0, buffer.length)) !=
> -1
> >             && totalRead + bufferFilled < contentLength) {
> >           totalRead += bufferFilled;
> >           out.write(buffer, 0, bufferFilled);
> >         }
> >
> > This should be changed to
> >
> >         while ( ( bufferFilled = in.read( buffer, 0, buffer.length )
> ) !=
> > -1 ) {
> >           int toWrite = totalRead + bufferFilled < contentLength ?
> >                                                 totalRead +
> bufferFilled :
> > contentLength - totalRead ; totalRead += bufferFilled;
> >           out.write( buffer, 0, toWrite ) ;
> >           if ( totalRead >= contentLength ) break ;
> >         }
> >
> > Else the last read portion quite often is not stored. Obviously, this
> is
> > causing problems, especially in small documents where the last read
> > portion is the only one, and in PDF documents, as well as other
> document
> > types that are sensitive to truncation.
> >
> > This problem explains a large part of false de-duplication cases, as
> well
> > as parsing errors with truncated content symptoms, but it does not
> seem to
> > explain all of them.
> >
> > Regards,
> >
> > Arkadi

Reply via email to