Re: New script bin/crawl - skipping urls different batch id (XXXXXXXX-YYYYYYYYY)

glumet Thu, 04 Jul 2013 12:33:34 -0700

Ok, as I have written, the problem was in an old version of nutch (2.1).
After updating to 2.2.1 the message about different batch id disabled but I
have a new problem now.


Everytime I start the script bin/crawl it fetch only the urls from seed (no
pages)

fetching http://www.museumhetvalkhof.nl/ (queue crawl delay=5000ms)
fetching http://www.eisbaeren.de/ (queue crawl delay=5000ms)
fetching http://www.s-bahn-berlin.de/ (queue crawl delay=5000ms)

...but I want to fetch and then parse also 

fetching http://www.museumhetvalkhof.nl/something.html
fetching http://www.eisbaeren.de/something/something.html

etc...

Where is the problem please?

The urls in my seed are defined like:

http://www.funkhauseuropa.de/
http://www.swr.de/
http://www.swrmediathek.de/

And regex-urlfilter.txt:

+^http://([a-z0-9]*\.)*funkhauseuropa.de/
+^http://([a-z0-9]*\.)*swr.de/
+^http://([a-z0-9]*\.)*swrmediathek.de/




--
View this message in context: 
http://lucene.472066.n3.nabble.com/New-script-bin-crawl-skipping-urls-different-batch-id-XXXXXXXX-YYYYYYYYY-tp4075441p4075577.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: New script bin/crawl - skipping urls different batch id (XXXXXXXX-YYYYYYYYY)

Reply via email to