Re: Fetched pages has no content

Julien Nioche Wed, 20 Jul 2011 01:44:26 -0700

protocol-httpclient is broken and needs replacing

On 19 July 2011 23:10, Anders Rask <anr...@gmail.com> wrote:


> Hi guys!
>
> I experimented some more, and it seems I'm only getting these problems when
> using protocol-httpclient. It works fine when I use protocol-http.
>
> Could you please try and see if you get the same behavior?
>
>
> Best regards,
> --Anders Rask
> www.findwise.com
>
> 2011/7/18 Anders Rask <anr...@gmail.com>
>
> > Thank you for your quick responses!
> >
> > Our custom parser runs an embedded OpenPipeline (
> > http://www.openpipeline.org/), which in turn runs a Tika parser to parse
> > the content.
> >
> > I have tried running inject, generate, fetch, readseg with standard Nutch
> > now and it works fine with that page. So if the problem is in our custom
> > parser then how is the parser involved in the fetch command?
> >
> > Markus, you mentioned changes to HTML parse API between version 1.1 and
> > 1.2. I checked the CHANGES.txt file but couldn't find anything about
> this,
> > do you have more information?
> >
> >
> > Best regards,
> > --Anders Rask
> > www.findwise.com
> >
> >
> > 2011/7/18 Julien Nioche <lists.digitalpeb...@gmail.com>
> >
> >> As pointed out by Markus the logs show that the content has been
> properly
> >> fetched. Moreover
> >>
> >>
> >> > ./nutch org.apache.nutch.parse.ParserChecker '
> >> > http://www.uu.se/news/news_item.php?typ=pm&id=1381'
> >>
> >>
> >> works fine. Double check your custom parser, it is likely to be the
> source
> >> of the problem.
> >>
> >> BTW : what does your custom parser do? Is it a HtmlParseFilter? If so
> >> which
> >> parser are you using for HTML - parse-html or parse-tika?
> >>
> >> Julien
> >>
> >>
> >>
> >> On 18 July 2011 10:46, Markus Jelsma <markus.jel...@openindex.io>
> wrote:
> >>
> >> > Judging from the segment those url's are fetched and parsed. I think
> >> maybe
> >> > some HTML parse API's have changed between your 1.1 and 1.2 versions.
> If
> >> > parserchecker shows the same issue then it's most likey a parse plugin
> >> > problem
> >> > for the new version. Can you check?
> >> >
> >> > > Hi,
> >> > >
> >> > > If you have a look at your regex-ulrfilter.txt it will by default be
> >> > > rejecting ? in the URL. Please test with line edited (or commented
> >> out)
> >> > and
> >> > > see if the problem fades.
> >> > >
> >> > > On Mon, Jul 18, 2011 at 10:11 AM, Anders Rask <anr...@gmail.com>
> >> wrote:
> >> > > > Hi Markus!
> >> > > >
> >> > > > We are using a custom parser, but I don't think that the problem
> is
> >> in
> >> > > > the parsing. I got the same problem when trying the ParserChecker.
> I
> >> > > > also tried the following:
> >> > > >
> >> > > > I injected the following seeds:
> >> > > >
> >> > > > http://www.uu.se/news/news_item.php?id=1423&typ=pm
> >> > > > http://www.uu.se/news/news_item.php?id=1421&typ=pm
> >> > > > http://www.uu.se/news/news_item.php?id=1489&typ=artikel
> >> > > > http://www.uu.se/news/news_item.php?id=1407&typ=pm
> >> > > > http://www.uu.se/news/news_item.php?id=1234&typ=artikel
> >> > > > http://www.uu.se/news/news_item.php?id=1233&typ=artikel
> >> > > > http://www.uu.se/news/news_item.php?id=1180&typ=artikel
> >> > > > http://www.uu.se/news/news_item.php?typ=pm&id=1381
> >> > > > http://www.uu.se/
> >> > > >
> >> > > > Then generated a segment, fetched that segment and then did a
> >> readseg
> >> > > > with -noparse, -noparsedata and -noparsetext.
> >> > > >
> >> > > > I have attached the readseg dump and it shows no content for:
> >> > > > http://www.uu.se/news/news_item.php?typ=pm&id=1381
> >> > > >
> >> > > > Can the problem somehow be in the configurations for the fetcher?
> >> > > >
> >> > > >
> >> > > > Best regards,
> >> > > > --Anders Rask
> >> > > > www.findwise.com
> >> > > >
> >> > > >
> >> > > > 2011/7/15 Markus Jelsma <markus.jel...@openindex.io>
> >> > > >
> >> > > >> What parser are you using? What does bin/nutch
> >> > > >> org.apache.nutch.parse.ParserChecker say? Here it outputs the
> >> content
> >> > > >> fine with parse-tika enabled.
> >> > > >>
> >> > > >> On Friday 15 July 2011 15:04:55 Anders Rask wrote:
> >> > > >> > Hi!
> >> > > >> >
> >> > > >> > We are using Nutch to crawl a bunch of websites and index them
> to
> >> > > >> > Solr.
> >> > > >>
> >> > > >> At
> >> > > >>
> >> > > >> > the moment we are in the process of upgrading from Nutch 1.1 to
> >> > Nutch
> >> > > >>
> >> > > >> 1.3
> >> > > >>
> >> > > >> > and in the same time going from one server to two servers.
> >> > > >> >
> >> > > >> > Unfortunately we are stuck with a problem which we haven't seen
> >> in
> >> > the
> >> > > >>
> >> > > >> old
> >> > > >>
> >> > > >> > environment. Several of the pages that we are fetching contain
> no
> >> > > >>
> >> > > >> content
> >> > > >>
> >> > > >> > when they are stored in the segment. The following is an
> excerpt
> >> > from
> >> > > >> > "readseg" on a segment containing such a page:
> >> > > >> >
> >> > > >> > ----
> >> > > >> >
> >> > > >> > Recno:: 5
> >> > > >> > URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> >> > > >> >
> >> > > >> > Content::
> >> > > >> > Version: -1
> >> > > >> > url: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> >> > > >> > base: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> >> > > >> > contentType: text/html
> >> > > >> > metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT
> Content-Length=7195
> >> > > >> > nutch.crawl.score=1.0 _fst_=33 nutch.segment.name
> =20110715110049
> >> > > >> > Connection=close Content-Type=text/html Server=Apache
> >> > > >> > Content:
> >> > > >> >
> >> > > >> > ----
> >> > > >> >
> >> > > >> > The fetch logs say nothing unusual about retrieving this page:
> >> > > >> > 2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher:
> >> > > >> > fetching http://www.uu.se/news/news_item.php?typ=pm&id=1381
> >> > > >> >
> >> > > >> > There seems to be nothing strange about the page itself and a
> >> very
> >> > > >>
> >> > > >> similar
> >> > > >>
> >> > > >> > page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is
> >> > crawled
> >> > > >>
> >> > > >> and
> >> > > >>
> >> > > >> > indexed without any problems.
> >> > > >> >
> >> > > >> > Anyone have any ideas about what might be wrong here?
> >> > > >> >
> >> > > >> >
> >> > > >> > Best regards,
> >> > > >> > --Anders Rask
> >> > > >> > www.findwise.com
> >> > > >>
> >> > > >> --
> >> > > >> Markus Jelsma - CTO - Openindex
> >> > > >> http://www.linkedin.com/in/markus17
> >> > > >> 050-8536620 / 06-50258350
> >> >
> >>
> >>
> >>
> >> --
> >> *
> >> *Open Source Solutions for Text Engineering
> >>
> >> http://digitalpebble.blogspot.com/
> >> http://www.digitalpebble.com
> >>
> >
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Fetched pages has no content

Reply via email to