Hi, If you have a look at your regex-ulrfilter.txt it will by default be rejecting ? in the URL. Please test with line edited (or commented out) and see if the problem fades.
On Mon, Jul 18, 2011 at 10:11 AM, Anders Rask <anr...@gmail.com> wrote: > Hi Markus! > > We are using a custom parser, but I don't think that the problem is in the > parsing. I got the same problem when trying the ParserChecker. I also tried > the following: > > I injected the following seeds: > > http://www.uu.se/news/news_item.php?id=1423&typ=pm > http://www.uu.se/news/news_item.php?id=1421&typ=pm > http://www.uu.se/news/news_item.php?id=1489&typ=artikel > http://www.uu.se/news/news_item.php?id=1407&typ=pm > http://www.uu.se/news/news_item.php?id=1234&typ=artikel > http://www.uu.se/news/news_item.php?id=1233&typ=artikel > http://www.uu.se/news/news_item.php?id=1180&typ=artikel > http://www.uu.se/news/news_item.php?typ=pm&id=1381 > http://www.uu.se/ > > Then generated a segment, fetched that segment and then did a readseg with > -noparse, -noparsedata and -noparsetext. > > I have attached the readseg dump and it shows no content for: > http://www.uu.se/news/news_item.php?typ=pm&id=1381 > > Can the problem somehow be in the configurations for the fetcher? > > > Best regards, > --Anders Rask > www.findwise.com > > > 2011/7/15 Markus Jelsma <markus.jel...@openindex.io> > >> What parser are you using? What does bin/nutch >> org.apache.nutch.parse.ParserChecker say? Here it outputs the content fine >> with parse-tika enabled. >> >> On Friday 15 July 2011 15:04:55 Anders Rask wrote: >> > Hi! >> > >> > We are using Nutch to crawl a bunch of websites and index them to Solr. >> At >> > the moment we are in the process of upgrading from Nutch 1.1 to Nutch >> 1.3 >> > and in the same time going from one server to two servers. >> > >> > Unfortunately we are stuck with a problem which we haven't seen in the >> old >> > environment. Several of the pages that we are fetching contain no >> content >> > when they are stored in the segment. The following is an excerpt from >> > "readseg" on a segment containing such a page: >> > >> > ---- >> > >> > Recno:: 5 >> > URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381 >> > >> > Content:: >> > Version: -1 >> > url: http://www.uu.se/news/news_item.php?typ=pm&id=1381 >> > base: http://www.uu.se/news/news_item.php?typ=pm&id=1381 >> > contentType: text/html >> > metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT Content-Length=7195 >> > nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110715110049 >> > Connection=close Content-Type=text/html Server=Apache >> > Content: >> > >> > ---- >> > >> > The fetch logs say nothing unusual about retrieving this page: >> > 2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher: fetching >> > http://www.uu.se/news/news_item.php?typ=pm&id=1381 >> > >> > There seems to be nothing strange about the page itself and a very >> similar >> > page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is crawled >> and >> > indexed without any problems. >> > >> > Anyone have any ideas about what might be wrong here? >> > >> > >> > Best regards, >> > --Anders Rask >> > www.findwise.com >> >> -- >> Markus Jelsma - CTO - Openindex >> http://www.linkedin.com/in/markus17 >> 050-8536620 / 06-50258350 >> > > -- *Lewis*