Re: http.redirect.max

2012-03-02 Thread Lewis John Mcgibbney
Hi Alex,

Can you please have a look at NUTCH-1042?

Might it be the case that your redirect possibly has a crawl-delay which
then falls into the boundary case we witness in the issue above?

You may want to chabge your log properties to debug for a while and run
some small crawls on your problem URLs, maybe try adding in some LOG.debug
statements to see what kind of conditions are being satisfied around the
fetcher areas mentioned in NUTCH-1042.

hth

On Thu, Mar 1, 2012 at 8:09 PM, alx...@aim.com wrote:


  Hello,

 I tried 1, 2, -1 for the config http.redirect.max, but nutch still
 postpones redirected urls to later depths.
 What is the correct config  setting to have nutch crawl redirected urls
 immediately. I need it because I have restriction on depth be at most 2.

 Thanks.
 Alex.





 -Original Message-
 From: xuyuanme xuyua...@gmail.com
 To: user user@nutch.apache.org
 Sent: Fri, Feb 24, 2012 1:31 am
 Subject: Re: http.redirect.max


 The config file is used for some proof of concept testing so the content
 might be confusing, please ignore some incorrect part.

 Yes from my end I can see the crawl for website http://www.scotland.gov.uk
 is redirected as expected.

 However the website I tried to crawl is a bit more tricky.

 Here's what I want to do:

 1. Set

 http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_BrowseDrugInitial=B
 as the seed page

 2. And try to crawl one of the link
 (
 http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.OverviewDrugName=BACIGUENT
 )
 as a test

 If you click the link, you'll find the website use redirect and cookie to
 control page navigation. So I used protocol-httpclient plugin instead of
 protocol-http to handle the cookie.

 However, the redirect does not happen as expected. The only way I can fetch
 second link is to manually change response = getResponse(u, datum,
 *false*) call to response = getResponse(u, datum, *true*) in
 org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the
 lib-http plugin.

 So my issue is related to this specific site

 http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_BrowseDrugInitial=B


 lewis john mcgibbney wrote
 
  I've checked working with redirects and everything seems to work fine for
  me.
 
  The site I checked on
 
  http://www.scotland.gov.uk
 
  temp redirect to
 
  http://home.scotland.gov.uk/home
 
  Nutch gets this fine when I do some tweaking with nutch-site.xml
 
  redirects property -1 (just to demonstrate, I would usually not set it
 so)
 
  Lewis
 

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html
 Sent from the Nutch - User mailing list archive at Nabble.com.





-- 
*Lewis*


Re: http.redirect.max

2012-03-01 Thread alxsss

 Hello,

I tried 1, 2, -1 for the config http.redirect.max, but nutch still postpones 
redirected urls to later depths.
What is the correct config  setting to have nutch crawl redirected urls 
immediately. I need it because I have restriction on depth be at most 2.

Thanks.
Alex.

 

 

-Original Message-
From: xuyuanme xuyua...@gmail.com
To: user user@nutch.apache.org
Sent: Fri, Feb 24, 2012 1:31 am
Subject: Re: http.redirect.max


The config file is used for some proof of concept testing so the content
might be confusing, please ignore some incorrect part.

Yes from my end I can see the crawl for website http://www.scotland.gov.uk
is redirected as expected.

However the website I tried to crawl is a bit more tricky.

Here's what I want to do:

1. Set
http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_BrowseDrugInitial=B
as the seed page

2. And try to crawl one of the link
(http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.OverviewDrugName=BACIGUENT)
as a test

If you click the link, you'll find the website use redirect and cookie to
control page navigation. So I used protocol-httpclient plugin instead of
protocol-http to handle the cookie.

However, the redirect does not happen as expected. The only way I can fetch
second link is to manually change response = getResponse(u, datum,
*false*) call to response = getResponse(u, datum, *true*) in
org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the
lib-http plugin.

So my issue is related to this specific site
http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_BrowseDrugInitial=B


lewis john mcgibbney wrote
 
 I've checked working with redirects and everything seems to work fine for
 me.
 
 The site I checked on
 
 http://www.scotland.gov.uk
 
 temp redirect to
 
 http://home.scotland.gov.uk/home
 
 Nutch gets this fine when I do some tweaking with nutch-site.xml
 
 redirects property -1 (just to demonstrate, I would usually not set it so)
 
 Lewis
 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html
Sent from the Nutch - User mailing list archive at Nabble.com.

 


Re: http.redirect.max

2012-02-24 Thread xuyuanme
The config file is used for some proof of concept testing so the content
might be confusing, please ignore some incorrect part.

Yes from my end I can see the crawl for website http://www.scotland.gov.uk
is redirected as expected.

However the website I tried to crawl is a bit more tricky.

Here's what I want to do:

1. Set
http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_BrowseDrugInitial=B
as the seed page

2. And try to crawl one of the link
(http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.OverviewDrugName=BACIGUENT)
as a test

If you click the link, you'll find the website use redirect and cookie to
control page navigation. So I used protocol-httpclient plugin instead of
protocol-http to handle the cookie.

However, the redirect does not happen as expected. The only way I can fetch
second link is to manually change response = getResponse(u, datum,
*false*) call to response = getResponse(u, datum, *true*) in
org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the
lib-http plugin.

So my issue is related to this specific site
http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_BrowseDrugInitial=B


lewis john mcgibbney wrote
 
 I've checked working with redirects and everything seems to work fine for
 me.
 
 The site I checked on
 
 http://www.scotland.gov.uk
 
 temp redirect to
 
 http://home.scotland.gov.uk/home
 
 Nutch gets this fine when I do some tweaking with nutch-site.xml
 
 redirects property -1 (just to demonstrate, I would usually not set it so)
 
 Lewis
 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: http.redirect.max

2012-02-23 Thread Lewis John Mcgibbney
Hi,

Can you post your nutch-site.xml and I will give it a spin.

Thank you

Lewis

On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme xuyua...@gmail.com wrote:

 Just checked the latest code in 1.4 but it's the same. See code line 138 in
 below link:


 http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup

 http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup

 The method just call getResponse() and set followRedirects parameter to
 *false*.

 So I guess the http.redirect.max setting is not working on it?


 remi tassing wrote
 
  Would you give Nucth-1.4 a try? Maybe this bug is already solved?
 
  Remi
 
  On Thursday, February 23, 2012, xuyuanme lt;xuyuanme@gt; wrote:
  Thanks for the information. But I found the wiki page
  http://wiki.apache.org/nutch/RedirectHandling
  http://wiki.apache.org/nutch/RedirectHandling  still doesn't have too
  much
  content about Nutch redirects.
 
  I found even if I set http.redirect.max=2 and
  db.ignore.external.links=false, the crawler still can't get redirect
  pages.
  And with further digging, I found the plugin lib-http (in Nutch 1.1)
  contains following code:
 
  Java file: org.apache.nutch.protocol.http.api.HttpBase
 
   public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {
  ..
 response = getResponse(u, datum, */false/*); // make a request
  ..
   }
 
   protected abstract Response getResponse(URL url,
   CrawlDatum datum,
   boolean followRedirects)
 throws ProtocolException, IOException;
 
  After I changed the call to getResponse(u, datum, */true/*) and
 recompile
  the plugin, the crawler fetches redirected pages as expected.
 
  So is this a bug in lib-http library or I had some misunderstanding on
  how
  redirect works?
 

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768744.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




-- 
*Lewis*


Re: http.redirect.max

2012-02-23 Thread xuyuanme
Thanks! The config file can be get here: 
http://dl.dropbox.com/u/6614015/temp/config.zip
http://dl.dropbox.com/u/6614015/temp/config.zip 
 

lewis john mcgibbney wrote
 
 Hi,
 
 Can you post your nutch-site.xml and I will give it a spin.
 
 Thank you
 
 Lewis
 
 On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme lt;xuyuanme@gt; wrote:
 
 Just checked the latest code in 1.4 but it's the same. See code line 138
 in
 below link:


 http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup

 http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup

 The method just call getResponse() and set followRedirects parameter to
 *false*.

 So I guess the http.redirect.max setting is not working on it?


 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3769491.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: http.redirect.max

2012-02-23 Thread Lewis John Mcgibbney
I've checked working with redirects and everything seems to work fine for
me.

The site I checked on

http://www.scotland.gov.uk

temp redirect to

http://home.scotland.gov.uk/home

Nutch gets this fine when I do some tweaking with nutch-site.xml

redirects property -1 (just to demonstrate, I would usually not set it so)

Lewis

On Thu, Feb 23, 2012 at 3:18 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Additionally in your nutch-site.xml we don't maintain any query-(plugins),
 and there is no parse-text plugin either.


 On Thu, Feb 23, 2012 at 3:13 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

 OK, for starters we don't use crawl-urlfilter.txt anymore, this is
 deprecated as of Nutch 1.2 iirc.

 Secondly, what are you trying to achieve here? Your url filter includes
 +^http://www
 \.accessdata\.fda\.gov/scripts/cder/drugsatfda/index\.cfm\?fuseaction=Search\.SearchResults_BrowseDrugInitial=B$
 +^http://www
 \.accessdata\.fda\.gov/scripts/cder/drugsatfda/index\.cfm\?fuseaction=Search\.OverviewDrugName=BACIGUENT$

 Your seed urls are also not exactly what I would expect for a seed list.

 One last thing, your fetcher.threads.per.host is pretty aggressive, I
 wouldn't personally set it this high unless it was my own server I was
 communicating with.

 So what exactly is it that you are having problems with?

 Lewis




 On Thu, Feb 23, 2012 at 12:11 PM, xuyuanme xuyua...@gmail.com wrote:

 Thanks! The config file can be get here:
 http://dl.dropbox.com/u/6614015/temp/config.zip
 http://dl.dropbox.com/u/6614015/temp/config.zip


 lewis john mcgibbney wrote
 
  Hi,
 
  Can you post your nutch-site.xml and I will give it a spin.
 
  Thank you
 
  Lewis
 
  On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme lt;xuyuanme@gt; wrote:
 
  Just checked the latest code in 1.4 but it's the same. See code line
 138
  in
  below link:
 
 
 
 http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
 
 
 http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
 
  The method just call getResponse() and set followRedirects parameter
 to
  *false*.
 
  So I guess the http.redirect.max setting is not working on it?
 
 
 

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3769491.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




 --
 *Lewis*




 --
 *Lewis*




-- 
*Lewis*


Re: http.redirect.max

2012-02-22 Thread xuyuanme
Thanks for the information. But I found the wiki page 
http://wiki.apache.org/nutch/RedirectHandling
http://wiki.apache.org/nutch/RedirectHandling  still doesn't have too much
content about Nutch redirects.

I found even if I set http.redirect.max=2 and
db.ignore.external.links=false, the crawler still can't get redirect pages.
And with further digging, I found the plugin lib-http (in Nutch 1.1)
contains following code:

Java file: org.apache.nutch.protocol.http.api.HttpBase

  public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {
..
response = getResponse(u, datum, */false/*); // make a request
..
  }

  protected abstract Response getResponse(URL url,
  CrawlDatum datum,
  boolean followRedirects)
throws ProtocolException, IOException;

After I changed the call to getResponse(u, datum, */true/*) and recompile
the plugin, the crawler fetches redirected pages as expected.

So is this a bug in lib-http library or I had some misunderstanding on how
redirect works?

Thanks!

lewis john mcgibbney wrote
 
 Hi Rafael,
 
 The page we are talking about will be added on the link below.
 
 http://wiki.apache.org/nutch/InternalDocumentation
 
 and will be available here
 
 http://wiki.apache.org/nutch/RedirectHandling
 
 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768657.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: http.redirect.max

2012-02-22 Thread remi tassing
Would you give Nucth-1.4 a try? Maybe this bug is already solved?

Remi

On Thursday, February 23, 2012, xuyuanme xuyua...@gmail.com wrote:
 Thanks for the information. But I found the wiki page
 http://wiki.apache.org/nutch/RedirectHandling
 http://wiki.apache.org/nutch/RedirectHandling  still doesn't have too much
 content about Nutch redirects.

 I found even if I set http.redirect.max=2 and
 db.ignore.external.links=false, the crawler still can't get redirect
pages.
 And with further digging, I found the plugin lib-http (in Nutch 1.1)
 contains following code:

 Java file: org.apache.nutch.protocol.http.api.HttpBase

  public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {
 ..
response = getResponse(u, datum, */false/*); // make a request
 ..
  }

  protected abstract Response getResponse(URL url,
  CrawlDatum datum,
  boolean followRedirects)
throws ProtocolException, IOException;

 After I changed the call to getResponse(u, datum, */true/*) and recompile
 the plugin, the crawler fetches redirected pages as expected.

 So is this a bug in lib-http library or I had some misunderstanding on how
 redirect works?

 Thanks!

 lewis john mcgibbney wrote

 Hi Rafael,

 The page we are talking about will be added on the link below.

 http://wiki.apache.org/nutch/InternalDocumentation

 and will be available here

 http://wiki.apache.org/nutch/RedirectHandling




 --
 View this message in context:
http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768657.html
 Sent from the Nutch - User mailing list archive at Nabble.com.



Re: http.redirect.max

2012-02-22 Thread xuyuanme
Just checked the latest code in 1.4 but it's the same. See code line 138 in
below link:

http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
 

The method just call getResponse() and set followRedirects parameter to
*false*.

So I guess the http.redirect.max setting is not working on it?


remi tassing wrote
 
 Would you give Nucth-1.4 a try? Maybe this bug is already solved?
 
 Remi
 
 On Thursday, February 23, 2012, xuyuanme lt;xuyuanme@gt; wrote:
 Thanks for the information. But I found the wiki page
 http://wiki.apache.org/nutch/RedirectHandling
 http://wiki.apache.org/nutch/RedirectHandling  still doesn't have too
 much
 content about Nutch redirects.

 I found even if I set http.redirect.max=2 and
 db.ignore.external.links=false, the crawler still can't get redirect
 pages.
 And with further digging, I found the plugin lib-http (in Nutch 1.1)
 contains following code:

 Java file: org.apache.nutch.protocol.http.api.HttpBase

  public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {
 ..
response = getResponse(u, datum, */false/*); // make a request
 ..
  }

  protected abstract Response getResponse(URL url,
  CrawlDatum datum,
  boolean followRedirects)
throws ProtocolException, IOException;

 After I changed the call to getResponse(u, datum, */true/*) and recompile
 the plugin, the crawler fetches redirected pages as expected.

 So is this a bug in lib-http library or I had some misunderstanding on
 how
 redirect works?
 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768744.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: http.redirect.max

2011-11-21 Thread Lewis John Mcgibbney
Hi Rafael,

The page we are talking about will be added on the link below.

http://wiki.apache.org/nutch/InternalDocumentation

and will be available here

http://wiki.apache.org/nutch/RedirectHandling


 I guess the poor documentation of nutch/hadoop is the biggest problem for
 beginners like me. I started with nutch ~4-6 month ago (not full time, but
 several
 hours every week). At first I wrote some plugins (parser/indexer). This was
 a bit tricky because i had learn directly from the source. Because most of
 the tutorials/documents were outdated (1.0) or simply wrong.


Please note we are trying to remove as much duplication documentation
regarding Nutch  Hadoop as possible. The Nutch wiki has been updated
recently and this is ongoing work so hopefully we can improve this more in
the near future. As Nutch focuses purely on web crawling the Hadoop
material can be viewed directly in the Hadoop wiki. I've added a link to
this on our wiki Nutch Hadoop Tutorial.


 My crawler is now running and I need to scale it up. The current version
 runs in local mode but thats not really fast. So I started to setup a
 hadoop
 cluster (4 Nodes) to run nutch in the deploy mode. This is were I'm today
 and
 my current questions are:

 - i will buy some new hardware for the hadoop cluster, but i'm shure about
 the configuration. Is nutch i/o or cpu heavy?


On a brand new hardware configuration I have not hard of anyone blowing
gaskets or anything similar. If thereis something wrong, it can usually be
fixed by improving configuration.



 - what is the difference between protocol-httpclient and protocol-http?
 Just
 ssl and authentication? What about performance?


protocol-httpclient is broken, please see the jira issue that has been
filed. You will also need to have a look at the code for this as I am by no
means an expert with the protocol-httpclient material.


 - what is a good value for the following configuration parameter:
- fetcher.threads.fetch
- fetcher.threads.per.queue
- mapred.tasktracker.map.tasks.maximum
- mapred.tasktracker.reduce.tasks.maximum
- mapred.map.tasks
- mapred.reduce.tasks


Impossible to say, this varies significantly from crawl/network/nature of
crawl data etc. You simply need to experiment and read as much existing
documentation as possible. Sorry about this one.


My current hardware is a 4 Node Cluster  of  dual CPU (quad core
 xeon), 32GB RAM, 2*2TB SATA HDD.
I know it's impossible to define the always right value. But a
 rule of the thumb, to use as start value, would be very a great thing
and would save me a lot of try-and-error investigation.


Unfortunately this open source software you are using. Maybe Cloudera or
some of the other commercially motivated experts can help you with this
stuff. This is outwith my experience. Try here
http://wiki.apache.org/nutch/Support


 - what's the difference fetcher.threads.fetch from the configuration an
 the -threads option from the crawl
 command?

This depends on how you wish to monitor/schedule your Nutch crawls. As you
know, running individual commands gives you more flexibility/control over
how Nutch does the work for you.


 - is it possible to follow external links only on 301 redirects?

Not got a clue but will definitely include this type of material in the
wiki page I created above. Mayeb you can do a bit of investigation and halp
me out when I get round to writing up on this stuff.



 - what is happening if a page is marked as db_redir_temp / db_redir_perm?
Refetch after db.fetch.interval.default?

 Again we will need to work together to get our heads around this, if you
have a look at the code then maybe we can get somethign written up in due
course.

Sorry about the vague answers however its a pretty large task to answer
everything fully considering there are ~5-10 questions all in. I'm sure
there must be some material on the user@ archives so please have a look
there as well.

hth

Lewis


Re: http.redirect.max

2011-11-18 Thread Rafael Pappert
Hi Alex,

this is not really a bug. It's a undocumented feature.
db.ignore.external.links prevents the fetcher from breaking
out of your set of domains. And this is what you need, if you
won't crawl the whole web.

Best regards,
Rafael.


On 17/Nov/ 2011, at 23:05 , alx...@aim.com wrote:

 
 Hi,
 
 Is this issue resolved in https://issues.apache.org/jira/browse/NUTCH-1044
 for the case when 
 db.ignore.external.links set to true
 ?
 
 Thanks.
 Alex.
 
 
 
 
 
 
 -Original Message-
 From: Ferdy Galema ferdy.gal...@kalooga.com
 To: user user@nutch.apache.org
 Sent: Thu, Nov 17, 2011 6:01 am
 Subject: Re: http.redirect.max
 
 
 Thanks for updating the list.
 
 On 11/17/2011 02:52 PM, Rafael Pappert wrote:
 Hi,
 
 after some investigation i got the problem.
 I had db.ignore.external.links set to true, this is why
 fetcher isn't following the redirection from domain.com to
 www.domain.com.
 
 Rafael.
 
 
 
 On 16/Nov/ 2011, at 20:17 , Rafael Pappert wrote:
 
 Hello List,
 
 is it possible to follow http 301 redirects immediately?
 
 I tried to set http.redirect.max to 3 but the page is
 still not indexed. readdb is still showing 1 page is
 unfetched / db_redir_perm. And I can't find the
 redirection target in the crawldb.
 
 How does nutch handle redirects?
 
 Thanks in advance,
 Rafael.
 
 
 
 
 
 



Re: http.redirect.max

2011-11-18 Thread Rafael Pappert
Hi Lewis,
 
 The honest truth is that there needs to be comprehensive documentation on
 the wiki for the way that Nutch handles redirects. This is a question that
 has gone fully unanswered for sometime.

That's true.

  In the meantime, can you adivise if there is anything over
 and above the files in nutch-default.xml and o.a.n.protocol package which
 you would like to see documented?

I guess the poor documentation of nutch/hadoop is the biggest problem for
beginners like me. I started with nutch ~4-6 month ago (not full time, but 
several
hours every week). At first I wrote some plugins (parser/indexer). This was 
a bit tricky because i had learn directly from the source. Because most of
the tutorials/documents were outdated (1.0) or simply wrong.

My crawler is now running and I need to scale it up. The current version
runs in local mode but thats not really fast. So I started to setup a hadoop
cluster (4 Nodes) to run nutch in the deploy mode. This is were I'm today and
my current questions are:

- i will buy some new hardware for the hadoop cluster, but i'm shure about
the configuration. Is nutch i/o or cpu heavy?

http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/

- what is the difference between protocol-httpclient and protocol-http? Just
ssl and authentication? What about performance?

- what is a good value for the following configuration parameter:
- fetcher.threads.fetch
- fetcher.threads.per.queue
- mapred.tasktracker.map.tasks.maximum
- mapred.tasktracker.reduce.tasks.maximum
- mapred.map.tasks
- mapred.reduce.tasks

My current hardware is a 4 Node Cluster  of  dual CPU (quad core xeon), 
32GB RAM, 2*2TB SATA HDD. 
I know it's impossible to define the always right value. But a rule 
of the thumb, to use as start value, would be very a great thing
and would save me a lot of try-and-error investigation.

- what's the difference fetcher.threads.fetch from the configuration an the 
-threads option from the crawl
command?

- is it possible to follow external links only on 301 redirects?

- what is happening if a page is marked as db_redir_temp / db_redir_perm? 
Refetch after db.fetch.interval.default?


I found loads tutorials and all of them have the same content, only the the
very very basics (how to do your first crawl). I guess a comprehensive 
documentation
would be a big step for the amazing nutch/hadoop project.

Thanks in advance,
Rafael.


 
 Thanks
 
 On Wed, Nov 16, 2011 at 7:17 PM, Rafael Pappert r...@fwpsystems.com wrote:
 
 Hello List,
 
 is it possible to follow http 301 redirects immediately?
 
 I tried to set http.redirect.max to 3 but the page is
 still not indexed. readdb is still showing 1 page is
 unfetched / db_redir_perm. And I can't find the
 redirection target in the crawldb.
 
 How does nutch handle redirects?
 
 Thanks in advance,
 Rafael.
 
 
 
 
 
 
 
 -- 
 *Lewis*



Re: http.redirect.max

2011-11-17 Thread Rafael Pappert
Hi,

after some investigation i got the problem.
I had db.ignore.external.links set to true, this is why
fetcher isn't following the redirection from domain.com to
www.domain.com.

Rafael.



On 16/Nov/ 2011, at 20:17 , Rafael Pappert wrote:

 Hello List,
 
 is it possible to follow http 301 redirects immediately?
 
 I tried to set http.redirect.max to 3 but the page is
 still not indexed. readdb is still showing 1 page is
 unfetched / db_redir_perm. And I can't find the
 redirection target in the crawldb.
 
 How does nutch handle redirects?
 
 Thanks in advance,
 Rafael.
 
 
 
 



Re: http.redirect.max

2011-11-17 Thread Ferdy Galema

Thanks for updating the list.

On 11/17/2011 02:52 PM, Rafael Pappert wrote:

Hi,

after some investigation i got the problem.
I had db.ignore.external.links set to true, this is why
fetcher isn't following the redirection from domain.com to
www.domain.com.

Rafael.



On 16/Nov/ 2011, at 20:17 , Rafael Pappert wrote:


Hello List,

is it possible to follow http 301 redirects immediately?

I tried to set http.redirect.max to 3 but the page is
still not indexed. readdb is still showing 1 page is
unfetched / db_redir_perm. And I can't find the
redirection target in the crawldb.

How does nutch handle redirects?

Thanks in advance,
Rafael.






Re: http.redirect.max and duplicate fetch/parse

2011-10-18 Thread Markus Jelsma
That sounds creepy indeed. It would still need a similar amount of RAM plus 
network overhead. Would a bloom filter be useful at all? It takes a lot less 
space and i can live with a non-deterministic approach.

On Tuesday 18 October 2011 01:45:20 Sergey A Volkov wrote:
 Hi
 
 I think some external key-value storage may replace map. They are fast
 enough and overhead will be unsignificant (for many threads)
 But this is very creepy solution.
 
 Sergey Volkov.
 
 On Tue 18 Oct 2011 03:15:33 AM MSK, Markus Jelsma wrote:
  Anyone?
  
  Hi,
  
  With a  0 value for http.redirect.max there's a possibility for
  fetching and parsing duplicates, this is especially true for fetch
  lists with many domains, even with just a few (+10) records per
  domain/host queue.
  
  Assuming there's only one thread per queue, how can we use
  http.redirect.max and prevent fetch and parse of duplicates?
  
  I'm not a big fan of keeping a map of fetched records in memory as it'll
  blow up the heap. We can also not safely remove a record from the fetch
  queue as the queue feeder may not have finished and duplicates may still
  enter a queue.
  
  Any thoughts?
  
  Thanks,
  Markus

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: http.redirect.max and duplicate fetch/parse

2011-10-18 Thread Sergey A Volkov

Actually some kv storages use bloom filter for similar purpose.

What is your queue size? And what is redirect rate?

If most redirects are not crossdomain and average number of urls per 
domain is not very big some fixed size chache in FetchItemQueue may 
help. But this leads to lots of changes in fetcher.


On Tue 18 Oct 2011 05:01:06 PM MSK, Markus Jelsma wrote:

That sounds creepy indeed. It would still need a similar amount of RAM plus
network overhead. Would a bloom filter be useful at all? It takes a lot less
space and i can live with a non-deterministic approach.

On Tuesday 18 October 2011 01:45:20 Sergey A Volkov wrote:

Hi

I think some external key-value storage may replace map. They are fast
enough and overhead will be unsignificant (for many threads)
But this is very creepy solution.

Sergey Volkov.

On Tue 18 Oct 2011 03:15:33 AM MSK, Markus Jelsma wrote:

Anyone?


Hi,

With a   0 value for http.redirect.max there's a possibility for
fetching and parsing duplicates, this is especially true for fetch
lists with many domains, even with just a few (+10) records per
domain/host queue.

Assuming there's only one thread per queue, how can we use
http.redirect.max and prevent fetch and parse of duplicates?

I'm not a big fan of keeping a map of fetched records in memory as it'll
blow up the heap. We can also not safely remove a record from the fetch
queue as the queue feeder may not have finished and duplicates may still
enter a queue.

Any thoughts?

Thanks,
Markus