protocol-httpclient is broken and needs replacing On 19 July 2011 23:10, Anders Rask <anr...@gmail.com> wrote:
> Hi guys! > > I experimented some more, and it seems I'm only getting these problems when > using protocol-httpclient. It works fine when I use protocol-http. > > Could you please try and see if you get the same behavior? > > > Best regards, > --Anders Rask > www.findwise.com > > 2011/7/18 Anders Rask <anr...@gmail.com> > > > Thank you for your quick responses! > > > > Our custom parser runs an embedded OpenPipeline ( > > http://www.openpipeline.org/), which in turn runs a Tika parser to parse > > the content. > > > > I have tried running inject, generate, fetch, readseg with standard Nutch > > now and it works fine with that page. So if the problem is in our custom > > parser then how is the parser involved in the fetch command? > > > > Markus, you mentioned changes to HTML parse API between version 1.1 and > > 1.2. I checked the CHANGES.txt file but couldn't find anything about > this, > > do you have more information? > > > > > > Best regards, > > --Anders Rask > > www.findwise.com > > > > > > 2011/7/18 Julien Nioche <lists.digitalpeb...@gmail.com> > > > >> As pointed out by Markus the logs show that the content has been > properly > >> fetched. Moreover > >> > >> > >> > ./nutch org.apache.nutch.parse.ParserChecker ' > >> > http://www.uu.se/news/news_item.php?typ=pm&id=1381' > >> > >> > >> works fine. Double check your custom parser, it is likely to be the > source > >> of the problem. > >> > >> BTW : what does your custom parser do? Is it a HtmlParseFilter? If so > >> which > >> parser are you using for HTML - parse-html or parse-tika? > >> > >> Julien > >> > >> > >> > >> On 18 July 2011 10:46, Markus Jelsma <markus.jel...@openindex.io> > wrote: > >> > >> > Judging from the segment those url's are fetched and parsed. I think > >> maybe > >> > some HTML parse API's have changed between your 1.1 and 1.2 versions. > If > >> > parserchecker shows the same issue then it's most likey a parse plugin > >> > problem > >> > for the new version. Can you check? > >> > > >> > > Hi, > >> > > > >> > > If you have a look at your regex-ulrfilter.txt it will by default be > >> > > rejecting ? in the URL. Please test with line edited (or commented > >> out) > >> > and > >> > > see if the problem fades. > >> > > > >> > > On Mon, Jul 18, 2011 at 10:11 AM, Anders Rask <anr...@gmail.com> > >> wrote: > >> > > > Hi Markus! > >> > > > > >> > > > We are using a custom parser, but I don't think that the problem > is > >> in > >> > > > the parsing. I got the same problem when trying the ParserChecker. > I > >> > > > also tried the following: > >> > > > > >> > > > I injected the following seeds: > >> > > > > >> > > > http://www.uu.se/news/news_item.php?id=1423&typ=pm > >> > > > http://www.uu.se/news/news_item.php?id=1421&typ=pm > >> > > > http://www.uu.se/news/news_item.php?id=1489&typ=artikel > >> > > > http://www.uu.se/news/news_item.php?id=1407&typ=pm > >> > > > http://www.uu.se/news/news_item.php?id=1234&typ=artikel > >> > > > http://www.uu.se/news/news_item.php?id=1233&typ=artikel > >> > > > http://www.uu.se/news/news_item.php?id=1180&typ=artikel > >> > > > http://www.uu.se/news/news_item.php?typ=pm&id=1381 > >> > > > http://www.uu.se/ > >> > > > > >> > > > Then generated a segment, fetched that segment and then did a > >> readseg > >> > > > with -noparse, -noparsedata and -noparsetext. > >> > > > > >> > > > I have attached the readseg dump and it shows no content for: > >> > > > http://www.uu.se/news/news_item.php?typ=pm&id=1381 > >> > > > > >> > > > Can the problem somehow be in the configurations for the fetcher? > >> > > > > >> > > > > >> > > > Best regards, > >> > > > --Anders Rask > >> > > > www.findwise.com > >> > > > > >> > > > > >> > > > 2011/7/15 Markus Jelsma <markus.jel...@openindex.io> > >> > > > > >> > > >> What parser are you using? What does bin/nutch > >> > > >> org.apache.nutch.parse.ParserChecker say? Here it outputs the > >> content > >> > > >> fine with parse-tika enabled. > >> > > >> > >> > > >> On Friday 15 July 2011 15:04:55 Anders Rask wrote: > >> > > >> > Hi! > >> > > >> > > >> > > >> > We are using Nutch to crawl a bunch of websites and index them > to > >> > > >> > Solr. > >> > > >> > >> > > >> At > >> > > >> > >> > > >> > the moment we are in the process of upgrading from Nutch 1.1 to > >> > Nutch > >> > > >> > >> > > >> 1.3 > >> > > >> > >> > > >> > and in the same time going from one server to two servers. > >> > > >> > > >> > > >> > Unfortunately we are stuck with a problem which we haven't seen > >> in > >> > the > >> > > >> > >> > > >> old > >> > > >> > >> > > >> > environment. Several of the pages that we are fetching contain > no > >> > > >> > >> > > >> content > >> > > >> > >> > > >> > when they are stored in the segment. The following is an > excerpt > >> > from > >> > > >> > "readseg" on a segment containing such a page: > >> > > >> > > >> > > >> > ---- > >> > > >> > > >> > > >> > Recno:: 5 > >> > > >> > URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381 > >> > > >> > > >> > > >> > Content:: > >> > > >> > Version: -1 > >> > > >> > url: http://www.uu.se/news/news_item.php?typ=pm&id=1381 > >> > > >> > base: http://www.uu.se/news/news_item.php?typ=pm&id=1381 > >> > > >> > contentType: text/html > >> > > >> > metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT > Content-Length=7195 > >> > > >> > nutch.crawl.score=1.0 _fst_=33 nutch.segment.name > =20110715110049 > >> > > >> > Connection=close Content-Type=text/html Server=Apache > >> > > >> > Content: > >> > > >> > > >> > > >> > ---- > >> > > >> > > >> > > >> > The fetch logs say nothing unusual about retrieving this page: > >> > > >> > 2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher: > >> > > >> > fetching http://www.uu.se/news/news_item.php?typ=pm&id=1381 > >> > > >> > > >> > > >> > There seems to be nothing strange about the page itself and a > >> very > >> > > >> > >> > > >> similar > >> > > >> > >> > > >> > page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is > >> > crawled > >> > > >> > >> > > >> and > >> > > >> > >> > > >> > indexed without any problems. > >> > > >> > > >> > > >> > Anyone have any ideas about what might be wrong here? > >> > > >> > > >> > > >> > > >> > > >> > Best regards, > >> > > >> > --Anders Rask > >> > > >> > www.findwise.com > >> > > >> > >> > > >> -- > >> > > >> Markus Jelsma - CTO - Openindex > >> > > >> http://www.linkedin.com/in/markus17 > >> > > >> 050-8536620 / 06-50258350 > >> > > >> > >> > >> > >> -- > >> * > >> *Open Source Solutions for Text Engineering > >> > >> http://digitalpebble.blogspot.com/ > >> http://www.digitalpebble.com > >> > > > > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com