Can you provide your regex-urlfilter and suffix-urlfilter files ?
On Wed, Mar 6, 2013 at 7:58 AM, Eyeris Rodriguez Rueda <[email protected]>wrote: > Hi all. > I tring to restrict nutch for crawl image documents only, i have used a > suffix-urlfilter.txt to restrict some extensions not needed for me, and > also regex-urlfilter.txt to allow image document but nutch dont generate > urls to fetch, please any suggestion to configure nutch to crawl image > documents only will be appreciated. > I am using nutch 1.4 and solr 3.6 in a single mode using: > bin/nutch crawl urls -dir crawl -depth 10 -topN 1000 -solr > http://localhost:8080/solr/images > > My seed.txt has 19 url and this my console output: > > crawl started in: crawl > rootUrlDir = urls > threads = 20 > depth = 10 > solrUrl=http://localhost:8080/solr/images > topN = 1000 > Injector: starting at 2013-03-06 10:41:33 > Injector: crawlDb: crawl/crawldb > Injector: urlDir: urls > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: finished at 2013-03-06 10:41:36, elapsed: 00:00:02 > Generator: starting at 2013-03-06 10:41:36 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: topN: 1000 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: 0 records selected for fetching, exiting ... > Stopping at depth=0 - no more URLs to fetch. > No URLs to fetch - check your seed list and URL filters. > crawl finished: crawl > > >

