Re: http.redirect.max

2012-03-02 Thread Lewis John Mcgibbney
Sent: Fri, Feb 24, 2012 1:31 am Subject: Re: http.redirect.max The config file is used for some proof of concept testing so the content might be confusing, please ignore some incorrect part. Yes from my end I can see the crawl for website http://www.scotland.gov.uk is redirected

Re: http.redirect.max

2012-03-01 Thread alxsss
. -Original Message- From: xuyuanme xuyua...@gmail.com To: user user@nutch.apache.org Sent: Fri, Feb 24, 2012 1:31 am Subject: Re: http.redirect.max The config file is used for some proof of concept testing so the content might be confusing, please ignore some incorrect part. Yes from my end I can

Re: http.redirect.max

2012-02-24 Thread xuyuanme
The config file is used for some proof of concept testing so the content might be confusing, please ignore some incorrect part. Yes from my end I can see the crawl for website http://www.scotland.gov.uk is redirected as expected. However the website I tried to crawl is a bit more tricky. Here's

Re: http.redirect.max

2012-02-23 Thread Lewis John Mcgibbney
Hi, Can you post your nutch-site.xml and I will give it a spin. Thank you Lewis On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme xuyua...@gmail.com wrote: Just checked the latest code in 1.4 but it's the same. See code line 138 in below link:

Re: http.redirect.max

2012-02-23 Thread xuyuanme
Thanks! The config file can be get here: http://dl.dropbox.com/u/6614015/temp/config.zip http://dl.dropbox.com/u/6614015/temp/config.zip lewis john mcgibbney wrote Hi, Can you post your nutch-site.xml and I will give it a spin. Thank you Lewis On Thu, Feb 23, 2012 at 5:07 AM,

Re: http.redirect.max

2012-02-23 Thread Lewis John Mcgibbney
I've checked working with redirects and everything seems to work fine for me. The site I checked on http://www.scotland.gov.uk temp redirect to http://home.scotland.gov.uk/home Nutch gets this fine when I do some tweaking with nutch-site.xml redirects property -1 (just to demonstrate, I

Re: http.redirect.max

2012-02-22 Thread xuyuanme
Thanks for the information. But I found the wiki page http://wiki.apache.org/nutch/RedirectHandling http://wiki.apache.org/nutch/RedirectHandling still doesn't have too much content about Nutch redirects. I found even if I set http.redirect.max=2 and db.ignore.external.links=false, the crawler

Re: http.redirect.max

2012-02-22 Thread remi tassing
Would you give Nucth-1.4 a try? Maybe this bug is already solved? Remi On Thursday, February 23, 2012, xuyuanme xuyua...@gmail.com wrote: Thanks for the information. But I found the wiki page http://wiki.apache.org/nutch/RedirectHandling http://wiki.apache.org/nutch/RedirectHandling still

Re: http.redirect.max

2012-02-22 Thread xuyuanme
Just checked the latest code in 1.4 but it's the same. See code line 138 in below link: http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup

Re: http.redirect.max

2011-11-21 Thread Lewis John Mcgibbney
Hi Rafael, The page we are talking about will be added on the link below. http://wiki.apache.org/nutch/InternalDocumentation and will be available here http://wiki.apache.org/nutch/RedirectHandling I guess the poor documentation of nutch/hadoop is the biggest problem for beginners like me.

Re: http.redirect.max

2011-11-18 Thread Rafael Pappert
:01 am Subject: Re: http.redirect.max Thanks for updating the list. On 11/17/2011 02:52 PM, Rafael Pappert wrote: Hi, after some investigation i got the problem. I had db.ignore.external.links set to true, this is why fetcher isn't following the redirection from domain.com

Re: http.redirect.max

2011-11-18 Thread Rafael Pappert
Hi Lewis, The honest truth is that there needs to be comprehensive documentation on the wiki for the way that Nutch handles redirects. This is a question that has gone fully unanswered for sometime. That's true. In the meantime, can you adivise if there is anything over and above the

Re: http.redirect.max

2011-11-17 Thread Rafael Pappert
Hi, after some investigation i got the problem. I had db.ignore.external.links set to true, this is why fetcher isn't following the redirection from domain.com to www.domain.com. Rafael. On 16/Nov/ 2011, at 20:17 , Rafael Pappert wrote: Hello List, is it possible to follow http 301

Re: http.redirect.max

2011-11-17 Thread Ferdy Galema
Thanks for updating the list. On 11/17/2011 02:52 PM, Rafael Pappert wrote: Hi, after some investigation i got the problem. I had db.ignore.external.links set to true, this is why fetcher isn't following the redirection from domain.com to www.domain.com. Rafael. On 16/Nov/ 2011, at 20:17 ,

Re: http.redirect.max and duplicate fetch/parse

2011-10-18 Thread Markus Jelsma
That sounds creepy indeed. It would still need a similar amount of RAM plus network overhead. Would a bloom filter be useful at all? It takes a lot less space and i can live with a non-deterministic approach. On Tuesday 18 October 2011 01:45:20 Sergey A Volkov wrote: Hi I think some

Re: http.redirect.max and duplicate fetch/parse

2011-10-18 Thread Sergey A Volkov
Actually some kv storages use bloom filter for similar purpose. What is your queue size? And what is redirect rate? If most redirects are not crossdomain and average number of urls per domain is not very big some fixed size chache in FetchItemQueue may help. But this leads to lots of changes