If you are looking for inlinks to 404 URL's but cannot find them in the LinkDB, 
it sounds like your should check the db.ignore.* configuration directives. IIRC 
the LinkDB will not populate internal links.
 
 
-----Original message-----
> From:SebaZ <[email protected]>
> Sent: Wed 20-Jun-2012 16:01
> To: [email protected]
> Subject: RE: HTTP REFERER is missing
> 
> 
> Markus Jelsma-2 wrote
> > 
> > Nutch cannot do this by default and is tricky to make because there may
> > not be one unique referrer per page.
> > 
> I don't realy need unique referrer. All I want is to inform requested server
> on which URL crawler found the link.
> 
> There is some site which admin informed me that he has a lot of 404 errors
> on logs from my Search server. Crawler is opening weard urls like
> http://www.domain.com/~tdz/sbd/zabezpieczanie_baz.pdf;O=A but it should be
> http://www.domain.com/~tdz/sbd/zabezpieczanie_baz.pdf, without *;O=A*. I was
> searching linkdb and it don't have any information about this good and bad
> url. Without Referrer I can't find on which site is wrong link or code
> directing to wrong urls.
> 
> 
> 
> Markus Jelsma-2 wrote
> > 
> > What you can try is to add the referrer to outlinks when parsing records.
> > This outlink can be added to CrawlDatum's MetaData which you can then
> > later use to set the referrer. To set the referrer you must hack
> Can you help me with it a little bit? Can I do it in configuration of Nutch?
> I am not good at JAVA programming also. I'm using Nutch as a crawler app
> only. I was trying to find exact file/code where I can change it (http
> plugin) but I didn't find any solution.
> 
> 
> Regards
> SZ
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/HTTP-REFERER-is-missing-tp3987967p3990533.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 

Reply via email to