Re: image crawling with nutch

Tejas Patil Wed, 06 Mar 2013 09:36:14 -0800

Can you provide your regex-urlfilter and suffix-urlfilter files ?


On Wed, Mar 6, 2013 at 7:58 AM, Eyeris Rodriguez Rueda <[email protected]>wrote:

> Hi all.
> I tring to restrict nutch for crawl image documents only, i have used a
> suffix-urlfilter.txt to restrict some extensions not needed for me, and
> also regex-urlfilter.txt to allow image document but nutch dont generate
> urls to fetch, please any suggestion to configure nutch to crawl image
> documents only will be appreciated.
> I am using nutch 1.4 and solr 3.6 in a single mode using:
> bin/nutch crawl urls -dir crawl -depth 10 -topN 1000 -solr
> http://localhost:8080/solr/images
>
> My seed.txt has 19 url and this my console output:
>
> crawl started in: crawl
> rootUrlDir = urls
> threads = 20
> depth = 10
> solrUrl=http://localhost:8080/solr/images
> topN = 1000
> Injector: starting at 2013-03-06 10:41:33
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2013-03-06 10:41:36, elapsed: 00:00:02
> Generator: starting at 2013-03-06 10:41:36
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 1000
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=0 - no more URLs to fetch.
> No URLs to fetch - check your seed list and URL filters.
> crawl finished: crawl
>
>
>

Re: image crawling with nutch

Reply via email to