image crawling with nutch

Eyeris Rodriguez Rueda Wed, 06 Mar 2013 07:59:22 -0800

Hi all.
I tring to restrict nutch for crawl image documents only, i have used a 
suffix-urlfilter.txt to restrict some extensions not needed for me, and also 
regex-urlfilter.txt to allow image document but nutch dont generate urls to 
fetch, please any suggestion to configure nutch to crawl image documents only 
will be appreciated.
I am using nutch 1.4 and solr 3.6 in a single mode using:
bin/nutch crawl urls -dir crawl -depth 10 -topN 1000 -solr 
http://localhost:8080/solr/images


My seed.txt has 19 url and this my console output:

crawl started in: crawl
rootUrlDir = urls
threads = 20
depth = 10
solrUrl=http://localhost:8080/solr/images
topN = 1000
Injector: starting at 2013-03-06 10:41:33
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-03-06 10:41:36, elapsed: 00:00:02
Generator: starting at 2013-03-06 10:41:36
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 1000
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl

image crawling with nutch

Reply via email to