Re: Fetched pages has no content

Anders Rask Tue, 19 Jul 2011 15:11:01 -0700

Hi guys!

I experimented some more, and it seems I'm only getting these problems when
using protocol-httpclient. It works fine when I use protocol-http.


Could you please try and see if you get the same behavior?


Best regards,
--Anders Rask
www.findwise.com

2011/7/18 Anders Rask <anr...@gmail.com>

> Thank you for your quick responses!
>
> Our custom parser runs an embedded OpenPipeline (
> http://www.openpipeline.org/), which in turn runs a Tika parser to parse
> the content.
>
> I have tried running inject, generate, fetch, readseg with standard Nutch
> now and it works fine with that page. So if the problem is in our custom
> parser then how is the parser involved in the fetch command?
>
> Markus, you mentioned changes to HTML parse API between version 1.1 and
> 1.2. I checked the CHANGES.txt file but couldn't find anything about this,
> do you have more information?
>
>
> Best regards,
> --Anders Rask
> www.findwise.com
>
>
> 2011/7/18 Julien Nioche <lists.digitalpeb...@gmail.com>
>
>> As pointed out by Markus the logs show that the content has been properly
>> fetched. Moreover
>>
>>
>> > ./nutch org.apache.nutch.parse.ParserChecker '
>> > http://www.uu.se/news/news_item.php?typ=pm&id=1381'
>>
>>
>> works fine. Double check your custom parser, it is likely to be the source
>> of the problem.
>>
>> BTW : what does your custom parser do? Is it a HtmlParseFilter? If so
>> which
>> parser are you using for HTML - parse-html or parse-tika?
>>
>> Julien
>>
>>
>>
>> On 18 July 2011 10:46, Markus Jelsma <markus.jel...@openindex.io> wrote:
>>
>> > Judging from the segment those url's are fetched and parsed. I think
>> maybe
>> > some HTML parse API's have changed between your 1.1 and 1.2 versions. If
>> > parserchecker shows the same issue then it's most likey a parse plugin
>> > problem
>> > for the new version. Can you check?
>> >
>> > > Hi,
>> > >
>> > > If you have a look at your regex-ulrfilter.txt it will by default be
>> > > rejecting ? in the URL. Please test with line edited (or commented
>> out)
>> > and
>> > > see if the problem fades.
>> > >
>> > > On Mon, Jul 18, 2011 at 10:11 AM, Anders Rask <anr...@gmail.com>
>> wrote:
>> > > > Hi Markus!
>> > > >
>> > > > We are using a custom parser, but I don't think that the problem is
>> in
>> > > > the parsing. I got the same problem when trying the ParserChecker. I
>> > > > also tried the following:
>> > > >
>> > > > I injected the following seeds:
>> > > >
>> > > > http://www.uu.se/news/news_item.php?id=1423&typ=pm
>> > > > http://www.uu.se/news/news_item.php?id=1421&typ=pm
>> > > > http://www.uu.se/news/news_item.php?id=1489&typ=artikel
>> > > > http://www.uu.se/news/news_item.php?id=1407&typ=pm
>> > > > http://www.uu.se/news/news_item.php?id=1234&typ=artikel
>> > > > http://www.uu.se/news/news_item.php?id=1233&typ=artikel
>> > > > http://www.uu.se/news/news_item.php?id=1180&typ=artikel
>> > > > http://www.uu.se/news/news_item.php?typ=pm&id=1381
>> > > > http://www.uu.se/
>> > > >
>> > > > Then generated a segment, fetched that segment and then did a
>> readseg
>> > > > with -noparse, -noparsedata and -noparsetext.
>> > > >
>> > > > I have attached the readseg dump and it shows no content for:
>> > > > http://www.uu.se/news/news_item.php?typ=pm&id=1381
>> > > >
>> > > > Can the problem somehow be in the configurations for the fetcher?
>> > > >
>> > > >
>> > > > Best regards,
>> > > > --Anders Rask
>> > > > www.findwise.com
>> > > >
>> > > >
>> > > > 2011/7/15 Markus Jelsma <markus.jel...@openindex.io>
>> > > >
>> > > >> What parser are you using? What does bin/nutch
>> > > >> org.apache.nutch.parse.ParserChecker say? Here it outputs the
>> content
>> > > >> fine with parse-tika enabled.
>> > > >>
>> > > >> On Friday 15 July 2011 15:04:55 Anders Rask wrote:
>> > > >> > Hi!
>> > > >> >
>> > > >> > We are using Nutch to crawl a bunch of websites and index them to
>> > > >> > Solr.
>> > > >>
>> > > >> At
>> > > >>
>> > > >> > the moment we are in the process of upgrading from Nutch 1.1 to
>> > Nutch
>> > > >>
>> > > >> 1.3
>> > > >>
>> > > >> > and in the same time going from one server to two servers.
>> > > >> >
>> > > >> > Unfortunately we are stuck with a problem which we haven't seen
>> in
>> > the
>> > > >>
>> > > >> old
>> > > >>
>> > > >> > environment. Several of the pages that we are fetching contain no
>> > > >>
>> > > >> content
>> > > >>
>> > > >> > when they are stored in the segment. The following is an excerpt
>> > from
>> > > >> > "readseg" on a segment containing such a page:
>> > > >> >
>> > > >> > ----
>> > > >> >
>> > > >> > Recno:: 5
>> > > >> > URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381
>> > > >> >
>> > > >> > Content::
>> > > >> > Version: -1
>> > > >> > url: http://www.uu.se/news/news_item.php?typ=pm&id=1381
>> > > >> > base: http://www.uu.se/news/news_item.php?typ=pm&id=1381
>> > > >> > contentType: text/html
>> > > >> > metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT Content-Length=7195
>> > > >> > nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110715110049
>> > > >> > Connection=close Content-Type=text/html Server=Apache
>> > > >> > Content:
>> > > >> >
>> > > >> > ----
>> > > >> >
>> > > >> > The fetch logs say nothing unusual about retrieving this page:
>> > > >> > 2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher:
>> > > >> > fetching http://www.uu.se/news/news_item.php?typ=pm&id=1381
>> > > >> >
>> > > >> > There seems to be nothing strange about the page itself and a
>> very
>> > > >>
>> > > >> similar
>> > > >>
>> > > >> > page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is
>> > crawled
>> > > >>
>> > > >> and
>> > > >>
>> > > >> > indexed without any problems.
>> > > >> >
>> > > >> > Anyone have any ideas about what might be wrong here?
>> > > >> >
>> > > >> >
>> > > >> > Best regards,
>> > > >> > --Anders Rask
>> > > >> > www.findwise.com
>> > > >>
>> > > >> --
>> > > >> Markus Jelsma - CTO - Openindex
>> > > >> http://www.linkedin.com/in/markus17
>> > > >> 050-8536620 / 06-50258350
>> >
>>
>>
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>>
>
>

Re: Fetched pages has no content

Reply via email to