Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com
YES - I forgot to include that... robots.txt is fine. it is wide open: ### # # sample robots.txt file for this website # # addresses all robots by using wild card * User-agent: * # # list folders robots are not allowed to index #Disallow: /tutorials/404redirect/ Disallow: # # list specific files robots are not allowed to index #Disallow: /tutorials/custom_error_page.html Disallow: # # list the location of any sitemaps Sitemap: http://www.yourdomain.com/site_index.xml # # End of robots.txt file # ### Harry Nutch wrote on 2010-04-20 19:22 : Did you check robots.txt On Wed, Apr 21, 2010 at 7:57 AM, joshua paulwrote: after getting this email, I tried commenting out this line in regex-urlfilter.txt = #-[...@=] but it didn't help... i still get same message - no urls to feth regex-urlfilter.txt = # skip URLs containing certain characters as probable queries, etc. -[...@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else +. crawl-urlfilter.txt = # skip URLs containing certain characters as probable queries, etc. # we don't want to skip #-[...@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ +^http://([a-z0-9]*\.)*fmforums.com/ # skip everything else -. arkadi.kosmy...@csiro.au wrote on 2010-04-20 4:49 PM: What is in your regex-urlfilter.txt? -Original Message- From: joshua paul [mailto:jos...@neocodesoftware.com] Sent: Wednesday, 21 April 2010 9:44 AM To: nutch-user@lucene.apache.org Subject: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com. I am using this command: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 - urls directory contains urls.txt which contains http://www.fmforums.com/ - crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*fmforums.com/ Note - my nutch setup indexes other sites fine. For example I am using this command: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 - urls directory contains urls.txt which contains http://dispatch.neocodesoftware.com - crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*dispatch.neocodesoftware.com/ And nutch generates a good crawl. How can I troubleshoot why nutch says "No URLs to fetch"? -- catching falling stars... https://www.linkedin.com/in/joshuascottpaul MSN coga...@hotmail.com AOL neocodesoftware Yahoo joshuascottpaul Skype neocodesoftware Toll Free 1.888.748.0668 Fax 1-866-336-7246 #238 - 425 Carrall St YVR BC V6B 6E3 CANADA www.neocodesoftware.com store.neocodesoftware.com www.monicapark.ca www.digitalpostercenter.com -- catching falling stars... https://www.linkedin.com/in/joshuascottpaul MSN coga...@hotmail.com AOL neocodesoftware Yahoo joshuascottpaul Skype neocodesoftware Toll Free 1.888.748.0668 Fax 1-866-336-7246 #238 - 425 Carrall St YVR BC V6B 6E3 CANADA www.neocodesoftware.com store.neocodesoftware.com www.monicapark.ca www.digitalpostercenter.com
Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com
Did you check robots.txt On Wed, Apr 21, 2010 at 7:57 AM, joshua paul wrote: > after getting this email, I tried commenting out this line in > regex-urlfilter.txt = > > #-[...@=] > > but it didn't help... i still get same message - no urls to feth > > > regex-urlfilter.txt = > > # skip URLs containing certain characters as probable queries, etc. > -[...@=] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > > # accept anything else > +. > > crawl-urlfilter.txt = > > # skip URLs containing certain characters as probable queries, etc. > # we don't want to skip > #-[...@=] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > > > +^http://([a-z0-9]*\.)*fmforums.com/ > > # skip everything else > -. > > > arkadi.kosmy...@csiro.au wrote on 2010-04-20 4:49 PM: > > What is in your regex-urlfilter.txt? >> >> >> >>> -Original Message- >>> From: joshua paul [mailto:jos...@neocodesoftware.com] >>> Sent: Wednesday, 21 April 2010 9:44 AM >>> To: nutch-user@lucene.apache.org >>> Subject: nutch says No URLs to fetch - check your seed list and URL >>> filters when trying to index fmforums.com >>> >>> nutch says No URLs to fetch - check your seed list and URL filters when >>> trying to index fmforums.com. >>> >>> I am using this command: >>> >>> bin/nutch crawl urls -dir crawl -depth 3 -topN 50 >>> >>> - urls directory contains urls.txt which contains >>> http://www.fmforums.com/ >>> - crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*fmforums.com/ >>> >>> Note - my nutch setup indexes other sites fine. >>> >>> For example I am using this command: >>> >>> bin/nutch crawl urls -dir crawl -depth 3 -topN 50 >>> >>> - urls directory contains urls.txt which contains >>> http://dispatch.neocodesoftware.com >>> - crawl-urlfilter.txt contains >>> +^http://([a-z0-9]*\.)*dispatch.neocodesoftware.com/ >>> >>> And nutch generates a good crawl. >>> >>> How can I troubleshoot why nutch says "No URLs to fetch"? >>> >>> >> > -- > catching falling stars... > > https://www.linkedin.com/in/joshuascottpaul > MSN coga...@hotmail.com AOL neocodesoftware > Yahoo joshuascottpaul Skype neocodesoftware > Toll Free 1.888.748.0668 Fax 1-866-336-7246 > #238 - 425 Carrall St YVR BC V6B 6E3 CANADA > > www.neocodesoftware.com store.neocodesoftware.com > www.monicapark.ca www.digitalpostercenter.com > >
Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com
after getting this email, I tried commenting out this line in regex-urlfilter.txt = #-[...@=] but it didn't help... i still get same message - no urls to feth regex-urlfilter.txt = # skip URLs containing certain characters as probable queries, etc. -[...@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else +. crawl-urlfilter.txt = # skip URLs containing certain characters as probable queries, etc. # we don't want to skip #-[...@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ +^http://([a-z0-9]*\.)*fmforums.com/ # skip everything else -. arkadi.kosmy...@csiro.au wrote on 2010-04-20 4:49 PM: What is in your regex-urlfilter.txt? -Original Message- From: joshua paul [mailto:jos...@neocodesoftware.com] Sent: Wednesday, 21 April 2010 9:44 AM To: nutch-user@lucene.apache.org Subject: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com. I am using this command: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 - urls directory contains urls.txt which contains http://www.fmforums.com/ - crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*fmforums.com/ Note - my nutch setup indexes other sites fine. For example I am using this command: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 - urls directory contains urls.txt which contains http://dispatch.neocodesoftware.com - crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*dispatch.neocodesoftware.com/ And nutch generates a good crawl. How can I troubleshoot why nutch says "No URLs to fetch"? -- catching falling stars... https://www.linkedin.com/in/joshuascottpaul MSN coga...@hotmail.com AOL neocodesoftware Yahoo joshuascottpaul Skype neocodesoftware Toll Free 1.888.748.0668 Fax 1-866-336-7246 #238 - 425 Carrall St YVR BC V6B 6E3 CANADA www.neocodesoftware.com store.neocodesoftware.com www.monicapark.ca www.digitalpostercenter.com
RE: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com
What is in your regex-urlfilter.txt? > -Original Message- > From: joshua paul [mailto:jos...@neocodesoftware.com] > Sent: Wednesday, 21 April 2010 9:44 AM > To: nutch-user@lucene.apache.org > Subject: nutch says No URLs to fetch - check your seed list and URL > filters when trying to index fmforums.com > > nutch says No URLs to fetch - check your seed list and URL filters when > trying to index fmforums.com. > > I am using this command: > > bin/nutch crawl urls -dir crawl -depth 3 -topN 50 > > - urls directory contains urls.txt which contains > http://www.fmforums.com/ > - crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*fmforums.com/ > > Note - my nutch setup indexes other sites fine. > > For example I am using this command: > > bin/nutch crawl urls -dir crawl -depth 3 -topN 50 > > - urls directory contains urls.txt which contains > http://dispatch.neocodesoftware.com > - crawl-urlfilter.txt contains > +^http://([a-z0-9]*\.)*dispatch.neocodesoftware.com/ > > And nutch generates a good crawl. > > How can I troubleshoot why nutch says "No URLs to fetch"?