Thank you for your quick responses!

Our custom parser runs an embedded OpenPipeline (
http://www.openpipeline.org/), which in turn runs a Tika parser to parse the
content.

I have tried running inject, generate, fetch, readseg with standard Nutch
now and it works fine with that page. So if the problem is in our custom
parser then how is the parser involved in the fetch command?

Markus, you mentioned changes to HTML parse API between version 1.1 and 1.2.
I checked the CHANGES.txt file but couldn't find anything about this, do you
have more information?


Best regards,
--Anders Rask
www.findwise.com

2011/7/18 Julien Nioche <lists.digitalpeb...@gmail.com>

> As pointed out by Markus the logs show that the content has been properly
> fetched. Moreover
>
>
> > ./nutch org.apache.nutch.parse.ParserChecker '
> > http://www.uu.se/news/news_item.php?typ=pm&id=1381'
>
>
> works fine. Double check your custom parser, it is likely to be the source
> of the problem.
>
> BTW : what does your custom parser do? Is it a HtmlParseFilter? If so which
> parser are you using for HTML - parse-html or parse-tika?
>
> Julien
>
>
>
> On 18 July 2011 10:46, Markus Jelsma <markus.jel...@openindex.io> wrote:
>
> > Judging from the segment those url's are fetched and parsed. I think
> maybe
> > some HTML parse API's have changed between your 1.1 and 1.2 versions. If
> > parserchecker shows the same issue then it's most likey a parse plugin
> > problem
> > for the new version. Can you check?
> >
> > > Hi,
> > >
> > > If you have a look at your regex-ulrfilter.txt it will by default be
> > > rejecting ? in the URL. Please test with line edited (or commented out)
> > and
> > > see if the problem fades.
> > >
> > > On Mon, Jul 18, 2011 at 10:11 AM, Anders Rask <anr...@gmail.com>
> wrote:
> > > > Hi Markus!
> > > >
> > > > We are using a custom parser, but I don't think that the problem is
> in
> > > > the parsing. I got the same problem when trying the ParserChecker. I
> > > > also tried the following:
> > > >
> > > > I injected the following seeds:
> > > >
> > > > http://www.uu.se/news/news_item.php?id=1423&typ=pm
> > > > http://www.uu.se/news/news_item.php?id=1421&typ=pm
> > > > http://www.uu.se/news/news_item.php?id=1489&typ=artikel
> > > > http://www.uu.se/news/news_item.php?id=1407&typ=pm
> > > > http://www.uu.se/news/news_item.php?id=1234&typ=artikel
> > > > http://www.uu.se/news/news_item.php?id=1233&typ=artikel
> > > > http://www.uu.se/news/news_item.php?id=1180&typ=artikel
> > > > http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > > > http://www.uu.se/
> > > >
> > > > Then generated a segment, fetched that segment and then did a readseg
> > > > with -noparse, -noparsedata and -noparsetext.
> > > >
> > > > I have attached the readseg dump and it shows no content for:
> > > > http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > > >
> > > > Can the problem somehow be in the configurations for the fetcher?
> > > >
> > > >
> > > > Best regards,
> > > > --Anders Rask
> > > > www.findwise.com
> > > >
> > > >
> > > > 2011/7/15 Markus Jelsma <markus.jel...@openindex.io>
> > > >
> > > >> What parser are you using? What does bin/nutch
> > > >> org.apache.nutch.parse.ParserChecker say? Here it outputs the
> content
> > > >> fine with parse-tika enabled.
> > > >>
> > > >> On Friday 15 July 2011 15:04:55 Anders Rask wrote:
> > > >> > Hi!
> > > >> >
> > > >> > We are using Nutch to crawl a bunch of websites and index them to
> > > >> > Solr.
> > > >>
> > > >> At
> > > >>
> > > >> > the moment we are in the process of upgrading from Nutch 1.1 to
> > Nutch
> > > >>
> > > >> 1.3
> > > >>
> > > >> > and in the same time going from one server to two servers.
> > > >> >
> > > >> > Unfortunately we are stuck with a problem which we haven't seen in
> > the
> > > >>
> > > >> old
> > > >>
> > > >> > environment. Several of the pages that we are fetching contain no
> > > >>
> > > >> content
> > > >>
> > > >> > when they are stored in the segment. The following is an excerpt
> > from
> > > >> > "readseg" on a segment containing such a page:
> > > >> >
> > > >> > ----
> > > >> >
> > > >> > Recno:: 5
> > > >> > URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > > >> >
> > > >> > Content::
> > > >> > Version: -1
> > > >> > url: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > > >> > base: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > > >> > contentType: text/html
> > > >> > metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT Content-Length=7195
> > > >> > nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110715110049
> > > >> > Connection=close Content-Type=text/html Server=Apache
> > > >> > Content:
> > > >> >
> > > >> > ----
> > > >> >
> > > >> > The fetch logs say nothing unusual about retrieving this page:
> > > >> > 2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher:
> > > >> > fetching http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > > >> >
> > > >> > There seems to be nothing strange about the page itself and a very
> > > >>
> > > >> similar
> > > >>
> > > >> > page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is
> > crawled
> > > >>
> > > >> and
> > > >>
> > > >> > indexed without any problems.
> > > >> >
> > > >> > Anyone have any ideas about what might be wrong here?
> > > >> >
> > > >> >
> > > >> > Best regards,
> > > >> > --Anders Rask
> > > >> > www.findwise.com
> > > >>
> > > >> --
> > > >> Markus Jelsma - CTO - Openindex
> > > >> http://www.linkedin.com/in/markus17
> > > >> 050-8536620 / 06-50258350
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Reply via email to