On Wednesday 21 September 2011 17:21:43 Oleg Mürk wrote:
> Hello,
> 
> When I fetch the following links with nutch 1.3:
>   http://blog.mises.org/archives/010450.asp
>  
> http://feedproxy.google.com/~r/readwriteweb/~3/frC1ndi7-V8/google_docs_goe
> s_back_to_schoo.php and
>   http.redirect.max = 2
> The first of these links is fetched OK, including the two redirects:
>   http://blog.mises.org/?p=010450
>   http://blog.mises.org/10450/what-the-bubble-did-to-technology/
> However for the second link (feedproxy.google.com) the redirects are
> not being followed during the fetch.
> Both redirects are "301 Moved Permanently".
> 
> May be somebody could suggest what is causing such behavior? I am
> using the default settings + http.agent.name and http.robots.agents.
> 
> Further, if I update the crawldb with the results of the fetch and
> then generate a new segment, the link
>   
> http://www.readwriteweb.com/archives/google_docs_goes_back_to_schoo.php?ut
> m_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+readwriteweb+%28Re
> adWriteWeb%29 which is redirected from
>   
> http://feedproxy.google.com/~r/readwriteweb/~3/frC1ndi7-V8/google_docs_goe
> s_back_to_schoo.php is never added to the new segment.

Check your URL filters. It's most likely thrown away.

> 
> What am I doing wrong? :)
> 
> Thank You!
> Oleg Mürk

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to