Re: image crawling with nutch

Walter Tietze Fri, 08 Mar 2013 12:48:41 -0800

Am 08.03.2013 20:23, schrieb Eyeris Rodriguez Rueda:
> Thanks a lot walter for your time, Im new with nutch.
> I really appreciate your reply, it was very helpfully for me.
> 
> So, for better understandig.
> 
> using urlfilter-domain i can to specify what domains specifically are allowed 
> and urlfilter-domainblacklist is to restrict domains ?.
>


Exactly.

You might want to add 'cu' or 'www.uci.cu' to a urlfilter-domain configuration.

I am not quite sure, but I think 'uci.cu' alone won't check against the 
getDomanName() or getHost() methods.

If you just add the top level domain, you will have to use further regular 
expressions.




> urlfilter-suffix restrict only for extensions of documents for example if i 
> have an url like this
> http://host.domain.country/image.jpg  and i have included a .jpg in 
> sufix-urlfilter.txt this url will be skiped ?
> 


Depends. Please read the comment heading the file 
org.apache.nutch.urlfilter.suffix.SuffixURLFilter:


--------------------------------------- SNIP 
-------------------------------------------

 * <p>This filter can be configured to work in one of two modes:
 * <ul>
 * <li><b>default to reject</b> ('-'): in this mode, only URLs that match 
suffixes
 * specified in the config file will be accepted, all other URLs will be
 * rejected.</li>
 * <li><b>default to accept</b> ('+'): in this mode, only URLs that match 
suffixes
 * specified in the config file will be rejected, all other URLs will be
 * accepted.</li>
 * </ul>
 * <p>
 * The format of this config file is one URL suffix per line, with no preceding
 * whitespace. Order, in which suffixes are specified, doesn't matter. Blank
 * lines and comments (#) are allowed.
 * </p>
 * <p>
 * A single '+' or '-' sign not followed by any suffix must be used once, to
 * signify the mode this plugin operates in. An optional single 'I' can be 
appended,
 * to signify that suffix matches should be case-insensitive. The default, if
 * not specified, is to use case-sensitive matches, i.e. suffix '.JPG'
 * does not match '.jpg'.
 * </p>
 * <p>
 * NOTE: the format of this file is different from urlfilter-prefix, because
 * that plugin doesn't support allowed/prohibited prefixes (only supports
 * allowed prefixes). Please note that this plugin does not support regular
 * expressions, it only accepts literal suffixes. I.e. a suffix "+*.jpg" is most
 * probably wrong, you should use "+.jpg" instead.

--------------------------------------- SNAP 
-------------------------------------------

Your file is starting with


# case-insensitive, allow unknown suffixes
+I
# uncomment the line below to filter on url path
+P


which means treat urls case-insensitive and use for the url path to check the 
suffix against.

The second plus is significant and tells that only 'URLs that match suffixes 
specified in the config file will be rejected, all other URLs will be accepted'.

In your case this means that all urls ending with the given suffixes in the 
suffix-urlfilter.txt will be rejected.


I think for your case it would be better to define some simple regexes for the 
regex-urlfilter.txt file.




Did you already consider to define a regex allowing all pictures from your site?


Something like

+^http://([a-z0-9]*\.)*.uci.cu/.*\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|bmp|BMP)$


This creates one java Pattern to check against, which should be pretty fast!




> What happend if i included .html in sufix-urlfilter.txt ? because i dont want 
> to index a html documents in solr but these are important to discover links 
> to another images
> 

I think you have to include a regex to include all html files from your site, 
otherwise you will not be able to find the images!


## Matches all pages in subdomains from your site !!! Prefix match !!!

+^http://([a-z0-9]*\.)*uci.cu/



> I want to cawl all images from uci.cu domain only.
> 
> 
> 
> 

Cheers, Walter


> 
> 
> 
> 
> 
> 
> 
> 
> ----- Mensaje original -----
> De: "Walter Tietze" <[email protected]>
> Para: [email protected]
> Enviados: Viernes, 8 de Marzo 2013 13:22:25
> Asunto: Re: image crawling with nutch
> 
> 
> Hi Eyeris,
> 
> 
> first of all you need to check in your nutch-default.xml which plugins are 
> configured for
> 
> <name>plugin.includes</name> .
> 
> In my crawler I configured the follwoing urlfilters
> 
> <value>protocol-http|urlfilter-(domain|domainblacklist|regex)|parse-(html|tika|metatags)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> 
> which simply means first use urlfilter-domain, second use 
> urlfilter-domainblacklist and third use urlfilter-regex!
> 
> 
> If you want to change the simple 'take one after the other' order, you can 
> configure the configuration entry
> 
> <name>urlfilter.order</name> by using the fully qualified urlfilter class 
> names separated with a blank.
> 
> 
> 
> Looking into the configuration, you can find for all mentioned urlfilters 
> above some configuration entries which give the used urlfilters a hint
> where to find their configuration files.
> 
> 
> 
> The configured default values are the files
> 
> domainblacklist-urlfilter.txt (for domainblacklist-urlfilter),
> 
> domain-urlfilter.txt ( for urlfilter-domain),
> 
> prefix-urlfilter.txt (for urlfilter-prefix),
> 
> regex-urlfilter.txt (for urlfilter-regex)
> 
> suffix-urlfilter.txt (for urlfilter-suffix) and maybe others.
> 
> 
> 
> The overall rule for url filtering is first postive match breaks the chain!! 
> For this reason I configured the ordering above.
> 
> I think the best thing is to read the sourcecode of the several urlfilter 
> plugins directly.
> 
> 
> 
> 
> Just as a scetch,
> 
> urlfilter-domain simply takes domains like de, at, ch each in a separate 
> line, which means only urls with this domain pass the filter!!
> 
> urlfilter-domainblacklist takes somthing like 'www.idontwantthissite.de' 
> which means, don't let urls with theses domains pass the filter.
> 
> urlfilter-regex takes regular expressions one per line. Remember that the 
> first positive match lets the url pass. If a url passes this filter without a 
> positive match the url is disregarded!
> 
> 
> 
> In your regex-urlfilter.txt file the entry
> 
> +\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|bmp|BMP)$
> 
> looks correctly in telling to include all urls ending with one of the given 
> suffixes.
> 
> You could have omitted the following line, because all urls reaching the end 
> without a positive match by a regular expression will be skipped.
> 
> 
> 
> Furthermore the image file suffixes, you are interested in, are correctly 
> ommited in your urlfilter-suffix configuration.
> 
> 
> I think the filter configuration you previously sent should work.
> 
> 
> Are your urlfilters correctly configured in your nutch-default.xml?
> 
> 
> Can you please provide more information about that?
> 
> 
> I am also using version 1.5.1 for the moment and I included your regex into 
> my configuration and a given jpg image was fetched!!
> 
> 
> How do you check if images are fetched?
> 
> 
> 
> Cheers, Walter
> 
> 
> 
> 
> Am 08.03.2013 17:22, schrieb Eyeris Rodriguez Rueda:
>> Hi all.
>>
>> Tejas.
>>  Im tring changing nutch to 1.5.1 and not use 1.4 anymore for images, i need 
>> a explanation about how function url filters in nutch and how avoid 
>> colisions betwen rules in regex urlfilter files.
>>
>> ----- Mensaje original -----
>> De: "Eyeris Rodriguez Rueda" <[email protected]>
>> Para: [email protected]
>> Enviados: Jueves, 7 de Marzo 2013 9:31:22
>> Asunto: Re: image crawling with nutch
>>
>> Thanks tejas for yor reply, last month i was asking about a similar topic 
>> and you anwer me a recomendation that i implemented in regex-urlfilter.txt 
>> as you can see, i have tried to crawl only 
>> image(+\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|bmp|BMP)$) but nutch is telling me 
>> that no url to fetch and I don´t understand why is hapenning
>>
>>
>>
>>
> 
> 
> 


-- 

--------------------------------
Walter Tietze
Senior Software Developer

Neofonie GmbH
Robert-Koch-Platz 4
10115 Berlin

T: +49 30 246 27 318

[email protected]
http://www.neofonie.de

Handelsregister
Berlin-Charlottenburg: HRB 67460

Geschäftsführung
Thomas Kitlitschko
--------------------------------

Re: image crawling with nutch

Reply via email to