Hi,

If you have a look at your regex-ulrfilter.txt it will by default be
rejecting ? in the URL. Please test with line edited (or commented out) and
see if the problem fades.

On Mon, Jul 18, 2011 at 10:11 AM, Anders Rask <anr...@gmail.com> wrote:

> Hi Markus!
>
> We are using a custom parser, but I don't think that the problem is in the
> parsing. I got the same problem when trying the ParserChecker. I also tried
> the following:
>
> I injected the following seeds:
>
> http://www.uu.se/news/news_item.php?id=1423&typ=pm
> http://www.uu.se/news/news_item.php?id=1421&typ=pm
> http://www.uu.se/news/news_item.php?id=1489&typ=artikel
> http://www.uu.se/news/news_item.php?id=1407&typ=pm
> http://www.uu.se/news/news_item.php?id=1234&typ=artikel
> http://www.uu.se/news/news_item.php?id=1233&typ=artikel
> http://www.uu.se/news/news_item.php?id=1180&typ=artikel
> http://www.uu.se/news/news_item.php?typ=pm&id=1381
> http://www.uu.se/
>
> Then generated a segment, fetched that segment and then did a readseg with
> -noparse, -noparsedata and -noparsetext.
>
> I have attached the readseg dump and it shows no content for:
> http://www.uu.se/news/news_item.php?typ=pm&id=1381
>
> Can the problem somehow be in the configurations for the fetcher?
>
>
> Best regards,
> --Anders Rask
> www.findwise.com
>
>
> 2011/7/15 Markus Jelsma <markus.jel...@openindex.io>
>
>> What parser are you using? What does bin/nutch
>> org.apache.nutch.parse.ParserChecker say? Here it outputs the content fine
>> with parse-tika enabled.
>>
>> On Friday 15 July 2011 15:04:55 Anders Rask wrote:
>> > Hi!
>> >
>> > We are using Nutch to crawl a bunch of websites and index them to Solr.
>> At
>> > the moment we are in the process of upgrading from Nutch 1.1 to Nutch
>> 1.3
>> > and in the same time going from one server to two servers.
>> >
>> > Unfortunately we are stuck with a problem which we haven't seen in the
>> old
>> > environment. Several of the pages that we are fetching contain no
>> content
>> > when they are stored in the segment. The following is an excerpt from
>> > "readseg" on a segment containing such a page:
>> >
>> > ----
>> >
>> > Recno:: 5
>> > URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381
>> >
>> > Content::
>> > Version: -1
>> > url: http://www.uu.se/news/news_item.php?typ=pm&id=1381
>> > base: http://www.uu.se/news/news_item.php?typ=pm&id=1381
>> > contentType: text/html
>> > metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT Content-Length=7195
>> > nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110715110049
>> > Connection=close Content-Type=text/html Server=Apache
>> > Content:
>> >
>> > ----
>> >
>> > The fetch logs say nothing unusual about retrieving this page:
>> > 2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher: fetching
>> > http://www.uu.se/news/news_item.php?typ=pm&id=1381
>> >
>> > There seems to be nothing strange about the page itself and a very
>> similar
>> > page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is crawled
>> and
>> > indexed without any problems.
>> >
>> > Anyone have any ideas about what might be wrong here?
>> >
>> >
>> > Best regards,
>> > --Anders Rask
>> > www.findwise.com
>>
>> --
>> Markus Jelsma - CTO - Openindex
>> http://www.linkedin.com/in/markus17
>> 050-8536620 / 06-50258350
>>
>
>


-- 
*Lewis*

Reply via email to