Re: image crawling with nutch

Walter Tietze Fri, 08 Mar 2013 10:22:54 -0800

Hi Eyeris,


first of all you need to check in your nutch-default.xml which plugins are 
configured for

<name>plugin.includes</name> .

In my crawler I configured the follwoing urlfilters

<value>protocol-http|urlfilter-(domain|domainblacklist|regex)|parse-(html|tika|metatags)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

which simply means first use urlfilter-domain, second use 
urlfilter-domainblacklist and third use urlfilter-regex!


If you want to change the simple 'take one after the other' order, you can 
configure the configuration entry

<name>urlfilter.order</name> by using the fully qualified urlfilter class names 
separated with a blank.



Looking into the configuration, you can find for all mentioned urlfilters above 
some configuration entries which give the used urlfilters a hint
where to find their configuration files.



The configured default values are the files

domainblacklist-urlfilter.txt (for domainblacklist-urlfilter),

domain-urlfilter.txt ( for urlfilter-domain),

prefix-urlfilter.txt (for urlfilter-prefix),

regex-urlfilter.txt (for urlfilter-regex)

suffix-urlfilter.txt (for urlfilter-suffix) and maybe others.



The overall rule for url filtering is first postive match breaks the chain!! 
For this reason I configured the ordering above.

I think the best thing is to read the sourcecode of the several urlfilter 
plugins directly.




Just as a scetch,

urlfilter-domain simply takes domains like de, at, ch each in a separate line, 
which means only urls with this domain pass the filter!!

urlfilter-domainblacklist takes somthing like 'www.idontwantthissite.de' which 
means, don't let urls with theses domains pass the filter.

urlfilter-regex takes regular expressions one per line. Remember that the first 
positive match lets the url pass. If a url passes this filter without a 
positive match the url is disregarded!



In your regex-urlfilter.txt file the entry

+\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|bmp|BMP)$

looks correctly in telling to include all urls ending with one of the given 
suffixes.

You could have omitted the following line, because all urls reaching the end 
without a positive match by a regular expression will be skipped.



Furthermore the image file suffixes, you are interested in, are correctly 
ommited in your urlfilter-suffix configuration.


I think the filter configuration you previously sent should work.


Are your urlfilters correctly configured in your nutch-default.xml?


Can you please provide more information about that?


I am also using version 1.5.1 for the moment and I included your regex into my 
configuration and a given jpg image was fetched!!


How do you check if images are fetched?



Cheers, Walter




Am 08.03.2013 17:22, schrieb Eyeris Rodriguez Rueda:
> Hi all.
> 
> Tejas.
>  Im tring changing nutch to 1.5.1 and not use 1.4 anymore for images, i need 
> a explanation about how function url filters in nutch and how avoid colisions 
> betwen rules in regex urlfilter files.
> 
> ----- Mensaje original -----
> De: "Eyeris Rodriguez Rueda" <[email protected]>
> Para: [email protected]
> Enviados: Jueves, 7 de Marzo 2013 9:31:22
> Asunto: Re: image crawling with nutch
> 
> Thanks tejas for yor reply, last month i was asking about a similar topic and 
> you anwer me a recomendation that i implemented in regex-urlfilter.txt as you 
> can see, i have tried to crawl only 
> image(+\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|bmp|BMP)$) but nutch is telling me 
> that no url to fetch and I don´t understand why is hapenning
> 
> 
> 
> 


-- 

--------------------------------
Walter Tietze
Senior Software Developer

Neofonie GmbH
Robert-Koch-Platz 4
10115 Berlin

T: +49 30 246 27 318

[email protected]
http://www.neofonie.de

Handelsregister
Berlin-Charlottenburg: HRB 67460

Geschäftsführung
Thomas Kitlitschko
--------------------------------

Re: image crawling with nutch

Reply via email to