Situtaion with 404 errors forced me to find a solution for REFERRER header
sendind.


Markus Jelsma-2 wrote
> 
> If you are looking for inlinks to 404 URL's but cannot find them in the
> LinkDB, it sounds like your should check the db.ignore.* configuration
> directives. IIRC the LinkDB will not populate internal links.
>  
>  
> -----Original message-----
>> From:SebaZ <sebastian.zaborowski@>
>> Sent: Wed 20-Jun-2012 16:01
>> To: [email protected]
>> Subject: RE: HTTP REFERER is missing
>> 
>> 
>> Markus Jelsma-2 wrote
>> > 
>> > Nutch cannot do this by default and is tricky to make because there may
>> > not be one unique referrer per page.
>> > 
>> I don't realy need unique referrer. All I want is to inform requested
>> server
>> on which URL crawler found the link.
>> 
>> There is some site which admin informed me that he has a lot of 404
>> errors
>> on logs from my Search server. Crawler is opening weard urls like
>> http://www.domain.com/~tdz/sbd/zabezpieczanie_baz.pdf;O=A but it should
>> be
>> http://www.domain.com/~tdz/sbd/zabezpieczanie_baz.pdf, without *;O=A*. I
>> was
>> searching linkdb and it don't have any information about this good and
>> bad
>> url. Without Referrer I can't find on which site is wrong link or code
>> directing to wrong urls.
>> 
>> 
>> 
>> Markus Jelsma-2 wrote
>> > 
>> > What you can try is to add the referrer to outlinks when parsing
>> records.
>> > This outlink can be added to CrawlDatum's MetaData which you can then
>> > later use to set the referrer. To set the referrer you must hack
>> Can you help me with it a little bit? Can I do it in configuration of
>> Nutch?
>> I am not good at JAVA programming also. I'm using Nutch as a crawler app
>> only. I was trying to find exact file/code where I can change it (http
>> plugin) but I didn't find any solution.
>> 
>> 
>> Regards
>> SZ
>> 
>> 
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/HTTP-REFERER-is-missing-tp3987967p3990533.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
> 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/HTTP-REFERER-is-missing-tp3987967p3990666.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to