Hi guys! I experimented some more, and it seems I'm only getting these problems when using protocol-httpclient. It works fine when I use protocol-http.
Could you please try and see if you get the same behavior? Best regards, --Anders Rask www.findwise.com 2011/7/18 Anders Rask <anr...@gmail.com> > Thank you for your quick responses! > > Our custom parser runs an embedded OpenPipeline ( > http://www.openpipeline.org/), which in turn runs a Tika parser to parse > the content. > > I have tried running inject, generate, fetch, readseg with standard Nutch > now and it works fine with that page. So if the problem is in our custom > parser then how is the parser involved in the fetch command? > > Markus, you mentioned changes to HTML parse API between version 1.1 and > 1.2. I checked the CHANGES.txt file but couldn't find anything about this, > do you have more information? > > > Best regards, > --Anders Rask > www.findwise.com > > > 2011/7/18 Julien Nioche <lists.digitalpeb...@gmail.com> > >> As pointed out by Markus the logs show that the content has been properly >> fetched. Moreover >> >> >> > ./nutch org.apache.nutch.parse.ParserChecker ' >> > http://www.uu.se/news/news_item.php?typ=pm&id=1381' >> >> >> works fine. Double check your custom parser, it is likely to be the source >> of the problem. >> >> BTW : what does your custom parser do? Is it a HtmlParseFilter? If so >> which >> parser are you using for HTML - parse-html or parse-tika? >> >> Julien >> >> >> >> On 18 July 2011 10:46, Markus Jelsma <markus.jel...@openindex.io> wrote: >> >> > Judging from the segment those url's are fetched and parsed. I think >> maybe >> > some HTML parse API's have changed between your 1.1 and 1.2 versions. If >> > parserchecker shows the same issue then it's most likey a parse plugin >> > problem >> > for the new version. Can you check? >> > >> > > Hi, >> > > >> > > If you have a look at your regex-ulrfilter.txt it will by default be >> > > rejecting ? in the URL. Please test with line edited (or commented >> out) >> > and >> > > see if the problem fades. >> > > >> > > On Mon, Jul 18, 2011 at 10:11 AM, Anders Rask <anr...@gmail.com> >> wrote: >> > > > Hi Markus! >> > > > >> > > > We are using a custom parser, but I don't think that the problem is >> in >> > > > the parsing. I got the same problem when trying the ParserChecker. I >> > > > also tried the following: >> > > > >> > > > I injected the following seeds: >> > > > >> > > > http://www.uu.se/news/news_item.php?id=1423&typ=pm >> > > > http://www.uu.se/news/news_item.php?id=1421&typ=pm >> > > > http://www.uu.se/news/news_item.php?id=1489&typ=artikel >> > > > http://www.uu.se/news/news_item.php?id=1407&typ=pm >> > > > http://www.uu.se/news/news_item.php?id=1234&typ=artikel >> > > > http://www.uu.se/news/news_item.php?id=1233&typ=artikel >> > > > http://www.uu.se/news/news_item.php?id=1180&typ=artikel >> > > > http://www.uu.se/news/news_item.php?typ=pm&id=1381 >> > > > http://www.uu.se/ >> > > > >> > > > Then generated a segment, fetched that segment and then did a >> readseg >> > > > with -noparse, -noparsedata and -noparsetext. >> > > > >> > > > I have attached the readseg dump and it shows no content for: >> > > > http://www.uu.se/news/news_item.php?typ=pm&id=1381 >> > > > >> > > > Can the problem somehow be in the configurations for the fetcher? >> > > > >> > > > >> > > > Best regards, >> > > > --Anders Rask >> > > > www.findwise.com >> > > > >> > > > >> > > > 2011/7/15 Markus Jelsma <markus.jel...@openindex.io> >> > > > >> > > >> What parser are you using? What does bin/nutch >> > > >> org.apache.nutch.parse.ParserChecker say? Here it outputs the >> content >> > > >> fine with parse-tika enabled. >> > > >> >> > > >> On Friday 15 July 2011 15:04:55 Anders Rask wrote: >> > > >> > Hi! >> > > >> > >> > > >> > We are using Nutch to crawl a bunch of websites and index them to >> > > >> > Solr. >> > > >> >> > > >> At >> > > >> >> > > >> > the moment we are in the process of upgrading from Nutch 1.1 to >> > Nutch >> > > >> >> > > >> 1.3 >> > > >> >> > > >> > and in the same time going from one server to two servers. >> > > >> > >> > > >> > Unfortunately we are stuck with a problem which we haven't seen >> in >> > the >> > > >> >> > > >> old >> > > >> >> > > >> > environment. Several of the pages that we are fetching contain no >> > > >> >> > > >> content >> > > >> >> > > >> > when they are stored in the segment. The following is an excerpt >> > from >> > > >> > "readseg" on a segment containing such a page: >> > > >> > >> > > >> > ---- >> > > >> > >> > > >> > Recno:: 5 >> > > >> > URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381 >> > > >> > >> > > >> > Content:: >> > > >> > Version: -1 >> > > >> > url: http://www.uu.se/news/news_item.php?typ=pm&id=1381 >> > > >> > base: http://www.uu.se/news/news_item.php?typ=pm&id=1381 >> > > >> > contentType: text/html >> > > >> > metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT Content-Length=7195 >> > > >> > nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110715110049 >> > > >> > Connection=close Content-Type=text/html Server=Apache >> > > >> > Content: >> > > >> > >> > > >> > ---- >> > > >> > >> > > >> > The fetch logs say nothing unusual about retrieving this page: >> > > >> > 2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher: >> > > >> > fetching http://www.uu.se/news/news_item.php?typ=pm&id=1381 >> > > >> > >> > > >> > There seems to be nothing strange about the page itself and a >> very >> > > >> >> > > >> similar >> > > >> >> > > >> > page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is >> > crawled >> > > >> >> > > >> and >> > > >> >> > > >> > indexed without any problems. >> > > >> > >> > > >> > Anyone have any ideas about what might be wrong here? >> > > >> > >> > > >> > >> > > >> > Best regards, >> > > >> > --Anders Rask >> > > >> > www.findwise.com >> > > >> >> > > >> -- >> > > >> Markus Jelsma - CTO - Openindex >> > > >> http://www.linkedin.com/in/markus17 >> > > >> 050-8536620 / 06-50258350 >> > >> >> >> >> -- >> * >> *Open Source Solutions for Text Engineering >> >> http://digitalpebble.blogspot.com/ >> http://www.digitalpebble.com >> > >