If you just want to crawl images and dont want any html pages, add
rules to regex-urlfilter.txt
such that it accepts only (jpg / gif / png / ico / bmp) and rejects rest.
Remove all the existing rules from the file and add this:

+\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|bmp|BMP)$

-.


Thanks,

Tejas Patil



On Fri, Jan 18, 2013 at 10:43 AM, Eyeris Rodriguez Rueda <[email protected]>wrote:

> Hi all.
>
> Im tring to make a crawl for image documents only(jpg, gif,png,ico,bmp),
> but unafortunetly some html are included in my index to. I have used a
> sufix-urlfilter.txt plugin restricting .html,.php,.xml but there are some
> html page that not have extensions and this are being inserted in my solr
> index. Also i have restrict for all in regex-urlfilter.txt and permit this
> image only but nutch said that no have document to fetch, Im using nutch
> 1.4 and solr 3.6.
> Any body can help me or point me in correct way to make a crawl only for
> documents that i want.
> Thanks in advance.
>
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
>

Reply via email to