Sent: Fri, Feb 24, 2012 1:31 am
Subject: Re: http.redirect.max
The config file is used for some proof of concept testing so the content
might be confusing, please ignore some incorrect part.
Yes from my end I can see the crawl for website http://www.scotland.gov.uk
is redirected
.
-Original Message-
From: xuyuanme xuyua...@gmail.com
To: user user@nutch.apache.org
Sent: Fri, Feb 24, 2012 1:31 am
Subject: Re: http.redirect.max
The config file is used for some proof of concept testing so the content
might be confusing, please ignore some incorrect part.
Yes from my end I can
The config file is used for some proof of concept testing so the content
might be confusing, please ignore some incorrect part.
Yes from my end I can see the crawl for website http://www.scotland.gov.uk
is redirected as expected.
However the website I tried to crawl is a bit more tricky.
Here's
Hi,
Can you post your nutch-site.xml and I will give it a spin.
Thank you
Lewis
On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme xuyua...@gmail.com wrote:
Just checked the latest code in 1.4 but it's the same. See code line 138 in
below link:
Thanks! The config file can be get here:
http://dl.dropbox.com/u/6614015/temp/config.zip
http://dl.dropbox.com/u/6614015/temp/config.zip
lewis john mcgibbney wrote
Hi,
Can you post your nutch-site.xml and I will give it a spin.
Thank you
Lewis
On Thu, Feb 23, 2012 at 5:07 AM,
I've checked working with redirects and everything seems to work fine for
me.
The site I checked on
http://www.scotland.gov.uk
temp redirect to
http://home.scotland.gov.uk/home
Nutch gets this fine when I do some tweaking with nutch-site.xml
redirects property -1 (just to demonstrate, I
Thanks for the information. But I found the wiki page
http://wiki.apache.org/nutch/RedirectHandling
http://wiki.apache.org/nutch/RedirectHandling still doesn't have too much
content about Nutch redirects.
I found even if I set http.redirect.max=2 and
db.ignore.external.links=false, the crawler
Would you give Nucth-1.4 a try? Maybe this bug is already solved?
Remi
On Thursday, February 23, 2012, xuyuanme xuyua...@gmail.com wrote:
Thanks for the information. But I found the wiki page
http://wiki.apache.org/nutch/RedirectHandling
http://wiki.apache.org/nutch/RedirectHandling still
Just checked the latest code in 1.4 but it's the same. See code line 138 in
below link:
http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
Hi Rafael,
The page we are talking about will be added on the link below.
http://wiki.apache.org/nutch/InternalDocumentation
and will be available here
http://wiki.apache.org/nutch/RedirectHandling
I guess the poor documentation of nutch/hadoop is the biggest problem for
beginners like me.
:01 am
Subject: Re: http.redirect.max
Thanks for updating the list.
On 11/17/2011 02:52 PM, Rafael Pappert wrote:
Hi,
after some investigation i got the problem.
I had db.ignore.external.links set to true, this is why
fetcher isn't following the redirection from domain.com
Hi Lewis,
The honest truth is that there needs to be comprehensive documentation on
the wiki for the way that Nutch handles redirects. This is a question that
has gone fully unanswered for sometime.
That's true.
In the meantime, can you adivise if there is anything over
and above the
Hi,
after some investigation i got the problem.
I had db.ignore.external.links set to true, this is why
fetcher isn't following the redirection from domain.com to
www.domain.com.
Rafael.
On 16/Nov/ 2011, at 20:17 , Rafael Pappert wrote:
Hello List,
is it possible to follow http 301
Thanks for updating the list.
On 11/17/2011 02:52 PM, Rafael Pappert wrote:
Hi,
after some investigation i got the problem.
I had db.ignore.external.links set to true, this is why
fetcher isn't following the redirection from domain.com to
www.domain.com.
Rafael.
On 16/Nov/ 2011, at 20:17 ,
That sounds creepy indeed. It would still need a similar amount of RAM plus
network overhead. Would a bloom filter be useful at all? It takes a lot less
space and i can live with a non-deterministic approach.
On Tuesday 18 October 2011 01:45:20 Sergey A Volkov wrote:
Hi
I think some
Actually some kv storages use bloom filter for similar purpose.
What is your queue size? And what is redirect rate?
If most redirects are not crossdomain and average number of urls per
domain is not very big some fixed size chache in FetchItemQueue may
help. But this leads to lots of changes
16 matches
Mail list logo