If you just want to crawl images and dont want any html pages, add rules to regex-urlfilter.txt such that it accepts only (jpg / gif / png / ico / bmp) and rejects rest. Remove all the existing rules from the file and add this:
+\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|bmp|BMP)$ -. Thanks, Tejas Patil On Fri, Jan 18, 2013 at 10:43 AM, Eyeris Rodriguez Rueda <[email protected]>wrote: > Hi all. > > Im tring to make a crawl for image documents only(jpg, gif,png,ico,bmp), > but unafortunetly some html are included in my index to. I have used a > sufix-urlfilter.txt plugin restricting .html,.php,.xml but there are some > html page that not have extensions and this are being inserted in my solr > index. Also i have restrict for all in regex-urlfilter.txt and permit this > image only but nutch said that no have document to fetch, Im using nutch > 1.4 and solr 3.6. > Any body can help me or point me in correct way to make a crawl only for > documents that i want. > Thanks in advance. > > 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS > INFORMATICAS... > CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION > > http://www.uci.cu > http://www.facebook.com/universidad.uci > http://www.flickr.com/photos/universidad_uci >

