Re:Re: New script bin/crawl - skipping urls different batch id (XXXXXXXX-YYYYYYYYY)

RS Thu, 04 Jul 2013 20:47:30 -0700

Hi:
    You write the wrong rules in the conf/regex-urlfilter.txt file.
    You should chang it like this :
     +^http://www.eisbaeren.de/*
    then ,you will get log like this :
    fetching http://www.eisbaeren.de/club/partner/ (queue crawl delay=5000ms)



Thanks 
HeChuan




At 2013-07-05 03:32:36,glumet <[email protected]> wrote:
>Ok, as I have written, the problem was in an old version of nutch (2.1).
>After updating to 2.2.1 the message about different batch id disabled but I
>have a new problem now.
>
>Everytime I start the script bin/crawl it fetch only the urls from seed (no
>pages)
>
>fetching http://www.museumhetvalkhof.nl/ (queue crawl delay=5000ms)
>fetching http://www.eisbaeren.de/ (queue crawl delay=5000ms)
>fetching http://www.s-bahn-berlin.de/ (queue crawl delay=5000ms)
>
>...but I want to fetch and then parse also 
>
>fetching http://www.museumhetvalkhof.nl/something.html
>fetching http://www.eisbaeren.de/something/something.html
>
>etc...
>
>Where is the problem please?
>
>The urls in my seed are defined like:
>
>http://www.funkhauseuropa.de/
>http://www.swr.de/
>http://www.swrmediathek.de/
>
>And regex-urlfilter.txt:
>
>+^http://([a-z0-9]*\.)*funkhauseuropa.de/
>+^http://([a-z0-9]*\.)*swr.de/
>+^http://([a-z0-9]*\.)*swrmediathek.de/
>
>
>
>
>--
>View this message in context: 
>http://lucene.472066.n3.nabble.com/New-script-bin-crawl-skipping-urls-different-batch-id-XXXXXXXX-YYYYYYYYY-tp4075441p4075577.html
>Sent from the Nutch - User mailing list archive at Nabble.com.

Re:Re: New script bin/crawl - skipping urls different batch id (XXXXXXXX-YYYYYYYYY)

Reply via email to