Situtaion with 404 errors forced me to find a solution for REFERRER header sendind.
Markus Jelsma-2 wrote > > If you are looking for inlinks to 404 URL's but cannot find them in the > LinkDB, it sounds like your should check the db.ignore.* configuration > directives. IIRC the LinkDB will not populate internal links. > > > -----Original message----- >> From:SebaZ <sebastian.zaborowski@> >> Sent: Wed 20-Jun-2012 16:01 >> To: [email protected] >> Subject: RE: HTTP REFERER is missing >> >> >> Markus Jelsma-2 wrote >> > >> > Nutch cannot do this by default and is tricky to make because there may >> > not be one unique referrer per page. >> > >> I don't realy need unique referrer. All I want is to inform requested >> server >> on which URL crawler found the link. >> >> There is some site which admin informed me that he has a lot of 404 >> errors >> on logs from my Search server. Crawler is opening weard urls like >> http://www.domain.com/~tdz/sbd/zabezpieczanie_baz.pdf;O=A but it should >> be >> http://www.domain.com/~tdz/sbd/zabezpieczanie_baz.pdf, without *;O=A*. I >> was >> searching linkdb and it don't have any information about this good and >> bad >> url. Without Referrer I can't find on which site is wrong link or code >> directing to wrong urls. >> >> >> >> Markus Jelsma-2 wrote >> > >> > What you can try is to add the referrer to outlinks when parsing >> records. >> > This outlink can be added to CrawlDatum's MetaData which you can then >> > later use to set the referrer. To set the referrer you must hack >> Can you help me with it a little bit? Can I do it in configuration of >> Nutch? >> I am not good at JAVA programming also. I'm using Nutch as a crawler app >> only. I was trying to find exact file/code where I can change it (http >> plugin) but I didn't find any solution. >> >> >> Regards >> SZ >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/HTTP-REFERER-is-missing-tp3987967p3990533.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> > -- View this message in context: http://lucene.472066.n3.nabble.com/HTTP-REFERER-is-missing-tp3987967p3990666.html Sent from the Nutch - User mailing list archive at Nabble.com.

