Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-21 Thread joshua paul
YES - I forgot to include that... robots.txt is fine. it is wide open: ### # # sample robots.txt file for this website # # addresses all robots by using wild card * User-agent: * # # list folders robots are not allowed to index #Disallow: /tutorials/404redirect/

nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-20 Thread joshua paul
nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com. I am using this command: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 - urls directory contains urls.txt which contains http://www.fmforums.com/ - crawl-urlfilter.txt contains

RE: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-20 Thread Arkadi.Kosmynin
What is in your regex-urlfilter.txt? -Original Message- From: joshua paul [mailto:jos...@neocodesoftware.com] Sent: Wednesday, 21 April 2010 9:44 AM To: nutch-user@lucene.apache.org Subject: nutch says No URLs to fetch - check your seed list and URL filters when trying to index

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-20 Thread joshua paul
after getting this email, I tried commenting out this line in regex-urlfilter.txt = #-[...@=] but it didn't help... i still get same message - no urls to feth regex-urlfilter.txt = # skip URLs containing certain characters as probable queries, etc. -[...@=] # skip URLs with slash-delimited

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-20 Thread Harry Nutch
Did you check robots.txt On Wed, Apr 21, 2010 at 7:57 AM, joshua paul jos...@neocodesoftware.comwrote: after getting this email, I tried commenting out this line in regex-urlfilter.txt = #-[...@=] but it didn't help... i still get same message - no urls to feth regex-urlfilter.txt = #

nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-16 Thread joshuasottpaul
bin/nutch crawl urls -dir crawl -depth 3 -topN 50 where urls directory contains urls.txt which contains http://www.fmforums.com/ where crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*fmforums.com/ note - my nutch setup indexes other sites fine. for example where urls directory contains