I have closed all 3 issues (NUTCH-1089, NUTCH-990 and NUTCH-1112) as they were all duplicates and have been addressed in 1089
Thanks Julien On 9 November 2011 18:44, Lewis John Mcgibbney <[email protected]>wrote: > Hi, > > Evidently, there is some work to be done here. The good thing is that there > is a comprehenisve amount of correspondence on all of the issues quoted > above. Maybe we can clean the Jira up a bit. > > On Wed, Nov 9, 2011 at 1:42 AM, Markus Jelsma <[email protected] > >wrote: > > > Thanks! > > > > > Before opening a ticket, searched the JIRA and found that this is the > > same > > > problem as reported in NUTCH-1089, NUTCH-990 and NUTCH-1112. I just > found > > > it via different symptoms. NUTCH-1089 offers a patch. > > > > > > > -----Original Message----- > > > > From: Markus Jelsma [mailto:[email protected]] > > > > Sent: Tuesday, 8 November 2011 7:19 PM > > > > To: [email protected] > > > > Subject: Re: A bug has been fixed in protocol-httpclient > > > > > > > > Hi, > > > > > > > > Can you open a Jira ticket and attach a patch file so we can track > it? > > > > > > > > Thanks > > > > > > > > > Hi guys, > > > > > > > > > > I know that protocol-httpclient is not recommended to use because > of > > > > > > > > known > > > > > > > > > problems, but I don't have much choice because I need > authentication > > > > > support, as a few other people do as well, I am sure. > > > > > > > > > > I've reported a problem with too aggressive de-duplication > recently. > > > > > > > > On the > > > > > > > > > example that I had, I traced that problem to an empty content > field. > > > > > Digging further, I found this in httpclient/HttpResponse.java > (lines > > > > > > > > > > 126-130): > > > > > while ((bufferFilled = in.read(buffer, 0, buffer.length)) > != > > > > > > > > -1 > > > > > > > > > && totalRead + bufferFilled < contentLength) { > > > > > > > > > > totalRead += bufferFilled; > > > > > out.write(buffer, 0, bufferFilled); > > > > > > > > > > } > > > > > > > > > > This should be changed to > > > > > > > > > > while ( ( bufferFilled = in.read( buffer, 0, buffer.length > ) > > > > > > > > ) != > > > > > > > > > -1 ) { > > > > > > > > > > int toWrite = totalRead + bufferFilled < contentLength ? > > > > > > > > > > totalRead + > > > > > > > > bufferFilled : > > > > > contentLength - totalRead ; totalRead += bufferFilled; > > > > > > > > > > out.write( buffer, 0, toWrite ) ; > > > > > if ( totalRead >= contentLength ) break ; > > > > > > > > > > } > > > > > > > > > > Else the last read portion quite often is not stored. Obviously, > this > > > > > > > > is > > > > > > > > > causing problems, especially in small documents where the last read > > > > > portion is the only one, and in PDF documents, as well as other > > > > > > > > document > > > > > > > > > types that are sensitive to truncation. > > > > > > > > > > This problem explains a large part of false de-duplication cases, > as > > > > > > > > well > > > > > > > > > as parsing errors with truncated content symptoms, but it does not > > > > > > > > seem to > > > > > > > > > explain all of them. > > > > > > > > > > Regards, > > > > > > > > > > Arkadi > > > > > > -- > *Lewis* > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

