On Wednesday 21 September 2011 17:21:43 Oleg Mürk wrote: > Hello, > > When I fetch the following links with nutch 1.3: > http://blog.mises.org/archives/010450.asp > > http://feedproxy.google.com/~r/readwriteweb/~3/frC1ndi7-V8/google_docs_goe > s_back_to_schoo.php and > http.redirect.max = 2 > The first of these links is fetched OK, including the two redirects: > http://blog.mises.org/?p=010450 > http://blog.mises.org/10450/what-the-bubble-did-to-technology/ > However for the second link (feedproxy.google.com) the redirects are > not being followed during the fetch. > Both redirects are "301 Moved Permanently". > > May be somebody could suggest what is causing such behavior? I am > using the default settings + http.agent.name and http.robots.agents. > > Further, if I update the crawldb with the results of the fetch and > then generate a new segment, the link > > http://www.readwriteweb.com/archives/google_docs_goes_back_to_schoo.php?ut > m_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+readwriteweb+%28Re > adWriteWeb%29 which is redirected from > > http://feedproxy.google.com/~r/readwriteweb/~3/frC1ndi7-V8/google_docs_goe > s_back_to_schoo.php is never added to the new segment.
Check your URL filters. It's most likely thrown away. > > What am I doing wrong? :) > > Thank You! > Oleg Mürk -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

