Hi Eyeris,
first of all you need to check in your nutch-default.xml which plugins are configured for <name>plugin.includes</name> . In my crawler I configured the follwoing urlfilters <value>protocol-http|urlfilter-(domain|domainblacklist|regex)|parse-(html|tika|metatags)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> which simply means first use urlfilter-domain, second use urlfilter-domainblacklist and third use urlfilter-regex! If you want to change the simple 'take one after the other' order, you can configure the configuration entry <name>urlfilter.order</name> by using the fully qualified urlfilter class names separated with a blank. Looking into the configuration, you can find for all mentioned urlfilters above some configuration entries which give the used urlfilters a hint where to find their configuration files. The configured default values are the files domainblacklist-urlfilter.txt (for domainblacklist-urlfilter), domain-urlfilter.txt ( for urlfilter-domain), prefix-urlfilter.txt (for urlfilter-prefix), regex-urlfilter.txt (for urlfilter-regex) suffix-urlfilter.txt (for urlfilter-suffix) and maybe others. The overall rule for url filtering is first postive match breaks the chain!! For this reason I configured the ordering above. I think the best thing is to read the sourcecode of the several urlfilter plugins directly. Just as a scetch, urlfilter-domain simply takes domains like de, at, ch each in a separate line, which means only urls with this domain pass the filter!! urlfilter-domainblacklist takes somthing like 'www.idontwantthissite.de' which means, don't let urls with theses domains pass the filter. urlfilter-regex takes regular expressions one per line. Remember that the first positive match lets the url pass. If a url passes this filter without a positive match the url is disregarded! In your regex-urlfilter.txt file the entry +\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|bmp|BMP)$ looks correctly in telling to include all urls ending with one of the given suffixes. You could have omitted the following line, because all urls reaching the end without a positive match by a regular expression will be skipped. Furthermore the image file suffixes, you are interested in, are correctly ommited in your urlfilter-suffix configuration. I think the filter configuration you previously sent should work. Are your urlfilters correctly configured in your nutch-default.xml? Can you please provide more information about that? I am also using version 1.5.1 for the moment and I included your regex into my configuration and a given jpg image was fetched!! How do you check if images are fetched? Cheers, Walter Am 08.03.2013 17:22, schrieb Eyeris Rodriguez Rueda: > Hi all. > > Tejas. > Im tring changing nutch to 1.5.1 and not use 1.4 anymore for images, i need > a explanation about how function url filters in nutch and how avoid colisions > betwen rules in regex urlfilter files. > > ----- Mensaje original ----- > De: "Eyeris Rodriguez Rueda" <[email protected]> > Para: [email protected] > Enviados: Jueves, 7 de Marzo 2013 9:31:22 > Asunto: Re: image crawling with nutch > > Thanks tejas for yor reply, last month i was asking about a similar topic and > you anwer me a recomendation that i implemented in regex-urlfilter.txt as you > can see, i have tried to crawl only > image(+\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|bmp|BMP)$) but nutch is telling me > that no url to fetch and I don´t understand why is hapenning > > > > -- -------------------------------- Walter Tietze Senior Software Developer Neofonie GmbH Robert-Koch-Platz 4 10115 Berlin T: +49 30 246 27 318 [email protected] http://www.neofonie.de Handelsregister Berlin-Charlottenburg: HRB 67460 Geschäftsführung Thomas Kitlitschko --------------------------------

