Re: Nutch Fetching alot but SOLR doesn't include all the fetches

Stefan Scheffler Sat, 18 Aug 2012 00:15:57 -0700

Did you set db.ignore.external in *conf/nutch-site.xml*?
This avoids that external links are fetched.

Another problem could be, that the robots.txt of the servers preventsthe crawler from fetching.you can check this with *bin/nutch readdb*. There you see, if the sitesare really fetched

regards
Stefan


Am 18.08.2012 09:07, schrieb Robert Irribarren:

I run this
nutch inject urls
nutch generate
bin/nutch crawl urls -depth 3 -topN 100
bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex
echo Crawling completed
dir

then I see alot of urls being fetched during the crawl phase.
When I run the solrindex it doesn't add all the urls i see when it says
fetching

54 URLs in 5 queues
fetching http://www.tarpits.org/join-us
fetching http://www.leonisadobemuseum.org/history-leonis.asp
fetching http://az.wikipedia.org/wiki/Quercus_prinus

It doesn't add wikipedia nor the others.

ADDITIONAL INFO :
My regex-urlfilter.txt
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.
#################################################################

ADDITIONAL INFO : Running on solr 4.0 nutch 2.0

Re: Nutch Fetching alot but SOLR doesn't include all the fetches

Reply via email to