I have closed all 3 issues (NUTCH-1089, NUTCH-990 and NUTCH-1112) as they
were all duplicates and have been addressed in 1089

Thanks

Julien

On 9 November 2011 18:44, Lewis John Mcgibbney <[email protected]>wrote:

> Hi,
>
> Evidently, there is some work to be done here. The good thing is that there
> is a comprehenisve amount of correspondence on all of the issues quoted
> above. Maybe we can clean the Jira up a bit.
>
> On Wed, Nov 9, 2011 at 1:42 AM, Markus Jelsma <[email protected]
> >wrote:
>
> > Thanks!
> >
> > > Before opening a ticket, searched the JIRA and found that this is the
> > same
> > > problem as reported in NUTCH-1089, NUTCH-990 and NUTCH-1112. I just
> found
> > > it via different symptoms. NUTCH-1089 offers a patch.
> > >
> > > > -----Original Message-----
> > > > From: Markus Jelsma [mailto:[email protected]]
> > > > Sent: Tuesday, 8 November 2011 7:19 PM
> > > > To: [email protected]
> > > > Subject: Re: A bug has been fixed in protocol-httpclient
> > > >
> > > > Hi,
> > > >
> > > > Can you open a Jira ticket and attach a patch file so we can track
> it?
> > > >
> > > > Thanks
> > > >
> > > > > Hi guys,
> > > > >
> > > > > I know that protocol-httpclient is not recommended to use because
> of
> > > >
> > > > known
> > > >
> > > > > problems, but I don't have much choice because I need
> authentication
> > > > > support, as a few other people do as well, I am sure.
> > > > >
> > > > > I've reported a problem with too aggressive de-duplication
> recently.
> > > >
> > > > On the
> > > >
> > > > > example that I had, I traced that problem to an empty content
> field.
> > > > > Digging further, I found this in httpclient/HttpResponse.java
> (lines
> > > > >
> > > > > 126-130):
> > > > >         while ((bufferFilled = in.read(buffer, 0, buffer.length))
> !=
> > > >
> > > > -1
> > > >
> > > > >             && totalRead + bufferFilled < contentLength) {
> > > > >
> > > > >           totalRead += bufferFilled;
> > > > >           out.write(buffer, 0, bufferFilled);
> > > > >
> > > > >         }
> > > > >
> > > > > This should be changed to
> > > > >
> > > > >         while ( ( bufferFilled = in.read( buffer, 0, buffer.length
> )
> > > >
> > > > ) !=
> > > >
> > > > > -1 ) {
> > > > >
> > > > >           int toWrite = totalRead + bufferFilled < contentLength ?
> > > > >
> > > > >                                                 totalRead +
> > > >
> > > > bufferFilled :
> > > > > contentLength - totalRead ; totalRead += bufferFilled;
> > > > >
> > > > >           out.write( buffer, 0, toWrite ) ;
> > > > >           if ( totalRead >= contentLength ) break ;
> > > > >
> > > > >         }
> > > > >
> > > > > Else the last read portion quite often is not stored. Obviously,
> this
> > > >
> > > > is
> > > >
> > > > > causing problems, especially in small documents where the last read
> > > > > portion is the only one, and in PDF documents, as well as other
> > > >
> > > > document
> > > >
> > > > > types that are sensitive to truncation.
> > > > >
> > > > > This problem explains a large part of false de-duplication cases,
> as
> > > >
> > > > well
> > > >
> > > > > as parsing errors with truncated content symptoms, but it does not
> > > >
> > > > seem to
> > > >
> > > > > explain all of them.
> > > > >
> > > > > Regards,
> > > > >
> > > > > Arkadi
> >
>
>
>
> --
> *Lewis*
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to