Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com
YES - I forgot to include that... robots.txt is fine. it is wide open: ### # # sample robots.txt file for this website # # addresses all robots by using wild card * User-agent: * # # list folders robots are not allowed to index #Disallow: /tutorials/404redirect/ Disallow: # # list specific files robots are not allowed to index #Disallow: /tutorials/custom_error_page.html Disallow: # # list the location of any sitemaps Sitemap: http://www.yourdomain.com/site_index.xml # # End of robots.txt file # ### Harry Nutch wrote on 2010-04-20 19:22 : Did you check robots.txt On Wed, Apr 21, 2010 at 7:57 AM, joshua pauljos...@neocodesoftware.comwrote: after getting this email, I tried commenting out this line in regex-urlfilter.txt = #-[...@=] but it didn't help... i still get same message - no urls to feth regex-urlfilter.txt = # skip URLs containing certain characters as probable queries, etc. -[...@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else +. crawl-urlfilter.txt = # skip URLs containing certain characters as probable queries, etc. # we don't want to skip #-[...@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ +^http://([a-z0-9]*\.)*fmforums.com/ # skip everything else -. arkadi.kosmy...@csiro.au wrote on 2010-04-20 4:49 PM: What is in your regex-urlfilter.txt? -Original Message- From: joshua paul [mailto:jos...@neocodesoftware.com] Sent: Wednesday, 21 April 2010 9:44 AM To: nutch-user@lucene.apache.org Subject: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com. I am using this command: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 - urls directory contains urls.txt which contains http://www.fmforums.com/ - crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*fmforums.com/ Note - my nutch setup indexes other sites fine. For example I am using this command: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 - urls directory contains urls.txt which contains http://dispatch.neocodesoftware.com - crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*dispatch.neocodesoftware.com/ And nutch generates a good crawl. How can I troubleshoot why nutch says No URLs to fetch? -- catching falling stars... https://www.linkedin.com/in/joshuascottpaul MSN coga...@hotmail.com AOL neocodesoftware Yahoo joshuascottpaul Skype neocodesoftware Toll Free 1.888.748.0668 Fax 1-866-336-7246 #238 - 425 Carrall St YVR BC V6B 6E3 CANADA www.neocodesoftware.com store.neocodesoftware.com www.monicapark.ca www.digitalpostercenter.com -- catching falling stars... https://www.linkedin.com/in/joshuascottpaul MSN coga...@hotmail.com AOL neocodesoftware Yahoo joshuascottpaul Skype neocodesoftware Toll Free 1.888.748.0668 Fax 1-866-336-7246 #238 - 425 Carrall St YVR BC V6B 6E3 CANADA www.neocodesoftware.com store.neocodesoftware.com www.monicapark.ca www.digitalpostercenter.com
RE: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com
What is in your regex-urlfilter.txt? -Original Message- From: joshua paul [mailto:jos...@neocodesoftware.com] Sent: Wednesday, 21 April 2010 9:44 AM To: nutch-user@lucene.apache.org Subject: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com. I am using this command: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 - urls directory contains urls.txt which contains http://www.fmforums.com/ - crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*fmforums.com/ Note - my nutch setup indexes other sites fine. For example I am using this command: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 - urls directory contains urls.txt which contains http://dispatch.neocodesoftware.com - crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*dispatch.neocodesoftware.com/ And nutch generates a good crawl. How can I troubleshoot why nutch says No URLs to fetch?
Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com
after getting this email, I tried commenting out this line in regex-urlfilter.txt = #-[...@=] but it didn't help... i still get same message - no urls to feth regex-urlfilter.txt = # skip URLs containing certain characters as probable queries, etc. -[...@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else +. crawl-urlfilter.txt = # skip URLs containing certain characters as probable queries, etc. # we don't want to skip #-[...@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ +^http://([a-z0-9]*\.)*fmforums.com/ # skip everything else -. arkadi.kosmy...@csiro.au wrote on 2010-04-20 4:49 PM: What is in your regex-urlfilter.txt? -Original Message- From: joshua paul [mailto:jos...@neocodesoftware.com] Sent: Wednesday, 21 April 2010 9:44 AM To: nutch-user@lucene.apache.org Subject: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com. I am using this command: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 - urls directory contains urls.txt which contains http://www.fmforums.com/ - crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*fmforums.com/ Note - my nutch setup indexes other sites fine. For example I am using this command: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 - urls directory contains urls.txt which contains http://dispatch.neocodesoftware.com - crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*dispatch.neocodesoftware.com/ And nutch generates a good crawl. How can I troubleshoot why nutch says No URLs to fetch? -- catching falling stars... https://www.linkedin.com/in/joshuascottpaul MSN coga...@hotmail.com AOL neocodesoftware Yahoo joshuascottpaul Skype neocodesoftware Toll Free 1.888.748.0668 Fax 1-866-336-7246 #238 - 425 Carrall St YVR BC V6B 6E3 CANADA www.neocodesoftware.com store.neocodesoftware.com www.monicapark.ca www.digitalpostercenter.com
Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com
Did you check robots.txt On Wed, Apr 21, 2010 at 7:57 AM, joshua paul jos...@neocodesoftware.comwrote: after getting this email, I tried commenting out this line in regex-urlfilter.txt = #-[...@=] but it didn't help... i still get same message - no urls to feth regex-urlfilter.txt = # skip URLs containing certain characters as probable queries, etc. -[...@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else +. crawl-urlfilter.txt = # skip URLs containing certain characters as probable queries, etc. # we don't want to skip #-[...@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ +^http://([a-z0-9]*\.)*fmforums.com/ # skip everything else -. arkadi.kosmy...@csiro.au wrote on 2010-04-20 4:49 PM: What is in your regex-urlfilter.txt? -Original Message- From: joshua paul [mailto:jos...@neocodesoftware.com] Sent: Wednesday, 21 April 2010 9:44 AM To: nutch-user@lucene.apache.org Subject: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com. I am using this command: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 - urls directory contains urls.txt which contains http://www.fmforums.com/ - crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*fmforums.com/ Note - my nutch setup indexes other sites fine. For example I am using this command: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 - urls directory contains urls.txt which contains http://dispatch.neocodesoftware.com - crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*dispatch.neocodesoftware.com/ And nutch generates a good crawl. How can I troubleshoot why nutch says No URLs to fetch? -- catching falling stars... https://www.linkedin.com/in/joshuascottpaul MSN coga...@hotmail.com AOL neocodesoftware Yahoo joshuascottpaul Skype neocodesoftware Toll Free 1.888.748.0668 Fax 1-866-336-7246 #238 - 425 Carrall St YVR BC V6B 6E3 CANADA www.neocodesoftware.com store.neocodesoftware.com www.monicapark.ca www.digitalpostercenter.com