Hi all. I tring to restrict nutch for crawl image documents only, i have used a suffix-urlfilter.txt to restrict some extensions not needed for me, and also regex-urlfilter.txt to allow image document but nutch dont generate urls to fetch, please any suggestion to configure nutch to crawl image documents only will be appreciated. I am using nutch 1.4 and solr 3.6 in a single mode using: bin/nutch crawl urls -dir crawl -depth 10 -topN 1000 -solr http://localhost:8080/solr/images
My seed.txt has 19 url and this my console output: crawl started in: crawl rootUrlDir = urls threads = 20 depth = 10 solrUrl=http://localhost:8080/solr/images topN = 1000 Injector: starting at 2013-03-06 10:41:33 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2013-03-06 10:41:36, elapsed: 00:00:02 Generator: starting at 2013-03-06 10:41:36 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 1000 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=0 - no more URLs to fetch. No URLs to fetch - check your seed list and URL filters. crawl finished: crawl

