Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-21 Thread joshua paul

YES - I forgot to include that... robots.txt is fine. it is wide open:

###
#
# sample robots.txt file for this website
#
# addresses all robots by using wild card *
User-agent: *
#
# list folders robots are not allowed to index
#Disallow: /tutorials/404redirect/
Disallow:
#
# list specific files robots are not allowed to index
#Disallow: /tutorials/custom_error_page.html
Disallow:
#
# list the location of any sitemaps
Sitemap: http://www.yourdomain.com/site_index.xml
#
# End of robots.txt file
#
###



Harry Nutch wrote on 2010-04-20 19:22 :

Did you check robots.txt

On Wed, Apr 21, 2010 at 7:57 AM, joshua paulwrote:

   

after getting this email, I tried commenting out this line in
regex-urlfilter.txt =

#-[...@=]

but it didn't help... i still get same message - no urls to feth


regex-urlfilter.txt =

# skip URLs containing certain characters as probable queries, etc.
-[...@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.

crawl-urlfilter.txt =

# skip URLs containing certain characters as probable queries, etc.
# we don't want to skip
#-[...@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/


+^http://([a-z0-9]*\.)*fmforums.com/

# skip everything else
-.


arkadi.kosmy...@csiro.au wrote on 2010-04-20 4:49 PM:

  What is in your regex-urlfilter.txt?
 



   

-Original Message-
From: joshua paul [mailto:jos...@neocodesoftware.com]
Sent: Wednesday, 21 April 2010 9:44 AM
To: nutch-user@lucene.apache.org
Subject: nutch says No URLs to fetch - check your seed list and URL
filters when trying to index fmforums.com

nutch says No URLs to fetch - check your seed list and URL filters when
trying to index fmforums.com.

I am using this command:

bin/nutch crawl urls -dir crawl -depth 3 -topN 50

- urls directory contains urls.txt which contains
http://www.fmforums.com/
- crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*fmforums.com/

Note - my nutch setup indexes other sites fine.

For example I am using this command:

bin/nutch crawl urls -dir crawl -depth 3 -topN 50

- urls directory contains urls.txt which contains
http://dispatch.neocodesoftware.com
- crawl-urlfilter.txt contains
+^http://([a-z0-9]*\.)*dispatch.neocodesoftware.com/

And nutch generates a good crawl.

How can I troubleshoot why nutch says "No URLs to fetch"?


 
   

--
catching falling stars...

https://www.linkedin.com/in/joshuascottpaul
MSN coga...@hotmail.com AOL neocodesoftware
Yahoo joshuascottpaul Skype neocodesoftware
Toll Free 1.888.748.0668 Fax 1-866-336-7246
#238 - 425 Carrall St YVR BC V6B 6E3 CANADA

www.neocodesoftware.com store.neocodesoftware.com
www.monicapark.ca www.digitalpostercenter.com


 
   


--
catching falling stars...

https://www.linkedin.com/in/joshuascottpaul
MSN coga...@hotmail.com AOL neocodesoftware
Yahoo joshuascottpaul Skype neocodesoftware
Toll Free 1.888.748.0668 Fax 1-866-336-7246
#238 - 425 Carrall St YVR BC V6B 6E3 CANADA

www.neocodesoftware.com store.neocodesoftware.com
www.monicapark.ca www.digitalpostercenter.com



Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-20 Thread Harry Nutch
Did you check robots.txt

On Wed, Apr 21, 2010 at 7:57 AM, joshua paul wrote:

> after getting this email, I tried commenting out this line in
> regex-urlfilter.txt =
>
> #-[...@=]
>
> but it didn't help... i still get same message - no urls to feth
>
>
> regex-urlfilter.txt =
>
> # skip URLs containing certain characters as probable queries, etc.
> -[...@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
> # accept anything else
> +.
>
> crawl-urlfilter.txt =
>
> # skip URLs containing certain characters as probable queries, etc.
> # we don't want to skip
> #-[...@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
>
> +^http://([a-z0-9]*\.)*fmforums.com/
>
> # skip everything else
> -.
>
>
> arkadi.kosmy...@csiro.au wrote on 2010-04-20 4:49 PM:
>
>  What is in your regex-urlfilter.txt?
>>
>>
>>
>>> -Original Message-
>>> From: joshua paul [mailto:jos...@neocodesoftware.com]
>>> Sent: Wednesday, 21 April 2010 9:44 AM
>>> To: nutch-user@lucene.apache.org
>>> Subject: nutch says No URLs to fetch - check your seed list and URL
>>> filters when trying to index fmforums.com
>>>
>>> nutch says No URLs to fetch - check your seed list and URL filters when
>>> trying to index fmforums.com.
>>>
>>> I am using this command:
>>>
>>> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
>>>
>>> - urls directory contains urls.txt which contains
>>> http://www.fmforums.com/
>>> - crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*fmforums.com/
>>>
>>> Note - my nutch setup indexes other sites fine.
>>>
>>> For example I am using this command:
>>>
>>> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
>>>
>>> - urls directory contains urls.txt which contains
>>> http://dispatch.neocodesoftware.com
>>> - crawl-urlfilter.txt contains
>>> +^http://([a-z0-9]*\.)*dispatch.neocodesoftware.com/
>>>
>>> And nutch generates a good crawl.
>>>
>>> How can I troubleshoot why nutch says "No URLs to fetch"?
>>>
>>>
>>
> --
> catching falling stars...
>
> https://www.linkedin.com/in/joshuascottpaul
> MSN coga...@hotmail.com AOL neocodesoftware
> Yahoo joshuascottpaul Skype neocodesoftware
> Toll Free 1.888.748.0668 Fax 1-866-336-7246
> #238 - 425 Carrall St YVR BC V6B 6E3 CANADA
>
> www.neocodesoftware.com store.neocodesoftware.com
> www.monicapark.ca www.digitalpostercenter.com
>
>


Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-20 Thread joshua paul

after getting this email, I tried commenting out this line in 
regex-urlfilter.txt =

#-[...@=]

but it didn't help... i still get same message - no urls to feth


regex-urlfilter.txt =

# skip URLs containing certain characters as probable queries, etc.
-[...@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break 
loops

-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.

crawl-urlfilter.txt =

# skip URLs containing certain characters as probable queries, etc.
# we don't want to skip
#-[...@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break 
loops

-.*(/[^/]+)/[^/]+\1/[^/]+\1/

+^http://([a-z0-9]*\.)*fmforums.com/

# skip everything else
-.


arkadi.kosmy...@csiro.au wrote on 2010-04-20 4:49 PM:

What is in your regex-urlfilter.txt?

  

-Original Message-
From: joshua paul [mailto:jos...@neocodesoftware.com]
Sent: Wednesday, 21 April 2010 9:44 AM
To: nutch-user@lucene.apache.org
Subject: nutch says No URLs to fetch - check your seed list and URL
filters when trying to index fmforums.com

nutch says No URLs to fetch - check your seed list and URL filters when
trying to index fmforums.com.

I am using this command:

bin/nutch crawl urls -dir crawl -depth 3 -topN 50

- urls directory contains urls.txt which contains
http://www.fmforums.com/
- crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*fmforums.com/

Note - my nutch setup indexes other sites fine.

For example I am using this command:

bin/nutch crawl urls -dir crawl -depth 3 -topN 50

- urls directory contains urls.txt which contains
http://dispatch.neocodesoftware.com
- crawl-urlfilter.txt contains
+^http://([a-z0-9]*\.)*dispatch.neocodesoftware.com/

And nutch generates a good crawl.

How can I troubleshoot why nutch says "No URLs to fetch"?



--
catching falling stars...

https://www.linkedin.com/in/joshuascottpaul
MSN coga...@hotmail.com AOL neocodesoftware
Yahoo joshuascottpaul Skype neocodesoftware
Toll Free 1.888.748.0668 Fax 1-866-336-7246
#238 - 425 Carrall St YVR BC V6B 6E3 CANADA

www.neocodesoftware.com store.neocodesoftware.com
www.monicapark.ca www.digitalpostercenter.com



RE: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-20 Thread Arkadi.Kosmynin
What is in your regex-urlfilter.txt?

> -Original Message-
> From: joshua paul [mailto:jos...@neocodesoftware.com]
> Sent: Wednesday, 21 April 2010 9:44 AM
> To: nutch-user@lucene.apache.org
> Subject: nutch says No URLs to fetch - check your seed list and URL
> filters when trying to index fmforums.com
> 
> nutch says No URLs to fetch - check your seed list and URL filters when
> trying to index fmforums.com.
> 
> I am using this command:
> 
> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> 
> - urls directory contains urls.txt which contains
> http://www.fmforums.com/
> - crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*fmforums.com/
> 
> Note - my nutch setup indexes other sites fine.
> 
> For example I am using this command:
> 
> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> 
> - urls directory contains urls.txt which contains
> http://dispatch.neocodesoftware.com
> - crawl-urlfilter.txt contains
> +^http://([a-z0-9]*\.)*dispatch.neocodesoftware.com/
> 
> And nutch generates a good crawl.
> 
> How can I troubleshoot why nutch says "No URLs to fetch"?