Could not obtain block Error

2006-08-09 Thread Uygar Yüzsüren

Hi Everyone,

I have problems while getting summaries from the DFS segments. Sometimes, on
random segments and on random DFS blocks,
I face with the following error :

java.io.IOException: Could not obtain block: blk_1996629287798238182
file=/data/crawl/segments/20060616121845/parse_text/part-00047/index
offset=0

Could this problem somehow be related to my hadoop-site configuration, or
could it be related to my nutch version (hadoop-0.4.0 and nutch-2006-07-19)?

When I try to get the parse_text of the url with segread script, there is no
problem. So I assume it does not related to special URLs.


Re: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-08-09 Thread Uroš Gruber

Sami Siren wrote:


I set DEBUG level loging and I've checked time during operations and 
when doint MapReduce job which is run after every page it takes 3-4 
seconds till next url is fethed.

I have some local site and fetching 100 pages takes about 6 minutes.


You are fetching a single site yes? Then you can get more performance 
by tweaking the configuration

of fetcher.

property
 namefetcher.server.delay/name
 value/value
 descriptionThe number of seconds the fetcher will delay between
  successive requests to the same server./description
/property

property
 namefetcher.threads.per.host/name
 value/value
 descriptionThis number is the maximum number of threads that
   should be allowed to access a host at one time./description
/property


Hi,

I've manage to test nutch speed on several machines with different OS as 
well.

I looks that fetcher.threads.per.host makes fetcher run faster.

What I still don't understand is this.

When fetcher threads was set to default value fetcher was doing 
mapreduce after every url.

But now job is run on about 400 urls or maybe more.

--
Uros

--
Sami Siren




Re: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-08-09 Thread e w

What do you now set fetcher.threads.per.host to? Can you tell me what your
generate.max.per.host value is as well?

I got big improvements after setting:

property
 namefetcher.server.delay/name
 value0.5/value
 descriptionThe number of seconds the fetcher will delay between
  successive requests to the same server./description
/property

even though I'm only generating 5 urls per host (generate.max.per.host=5). I
don't know whether fetcher.server.delay also affects requests made through a
proxy (anyone?) since I'm using a proxy.

Also, I still can't see any logging output from the fetchers i.e. what url
is being requested in any log file anywhere. I'm not so hot with java but
can anyone here tell whether:

log4j.threshhold=ALL

is conf/log4j.properties should be threshhold with 1 h or are 2 h's the
java way?

And is there any reason why the lines in the function below are commented
out:

 public void configure(JobConf job) {
   setConf(job);

   this.segmentName = job.get(SEGMENT_NAME_KEY);
   this.storingContent = isStoringContent(job);
   this.parsing = isParsing(job);

//if (job.getBoolean(fetcher.verbose, false)) {
//  LOG.setLevel(Level.FINE);
//}
 }

Is this parameter now read somewhere else?

Any enlightenment always appreciated.

-Ed

On 8/9/06, Uroš Gruber [EMAIL PROTECTED] wrote:


Sami Siren wrote:

 I set DEBUG level loging and I've checked time during operations and
 when doint MapReduce job which is run after every page it takes 3-4
 seconds till next url is fethed.
 I have some local site and fetching 100 pages takes about 6 minutes.

 You are fetching a single site yes? Then you can get more performance
 by tweaking the configuration
 of fetcher.

 property
  namefetcher.server.delay/name
  value/value
  descriptionThe number of seconds the fetcher will delay between
   successive requests to the same server./description
 /property

 property
  namefetcher.threads.per.host/name
  value/value
  descriptionThis number is the maximum number of threads that
should be allowed to access a host at one time./description
 /property

Hi,

I've manage to test nutch speed on several machines with different OS as
well.
I looks that fetcher.threads.per.host makes fetcher run faster.

What I still don't understand is this.

When fetcher threads was set to default value fetcher was doing
mapreduce after every url.
But now job is run on about 400 urls or maybe more.

--
Uros
 --
 Sami Siren




Re: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-08-09 Thread Uroš Gruber

e w wrote:
What do you now set fetcher.threads.per.host to? Can you tell me what 
your

generate.max.per.host value is as well?


property
namefetcher.server.delay/name
value0/value
descriptionThe number of seconds the fetcher will delay between
 successive requests to the same server./description
/property

property
 namefetcher.threads.fetch/name
 value10/value
/property

property
 namegenerate.max.per.host/name
 value400/value
/property

property
namefetcher.threads.per.host/name
value10/value
/property

property
 namehttp.max.delays/name
 value30/value
/property


I got big improvements after setting:

property
 namefetcher.server.delay/name
 value0.5/value
 descriptionThe number of seconds the fetcher will delay between
  successive requests to the same server./description
/property

even though I'm only generating 5 urls per host 
(generate.max.per.host=5). I
don't know whether fetcher.server.delay also affects requests made 
through a

proxy (anyone?) since I'm using a proxy.

Also, I still can't see any logging output from the fetchers i.e. what 
url

is being requested in any log file anywhere. I'm not so hot with java but
can anyone here tell whether:

log4j.threshhold=ALL


I set this

log4j.logger.org.apache.nutch=DEBUG
log4j.logger.org.apache.hadoop=DEBUG

That I can see what is going on.

--
Uros
is conf/log4j.properties should be threshhold with 1 h or are 2 
h's the

java way?

And is there any reason why the lines in the function below are commented
out:

 public void configure(JobConf job) {
   setConf(job);

   this.segmentName = job.get(SEGMENT_NAME_KEY);
   this.storingContent = isStoringContent(job);
   this.parsing = isParsing(job);

//if (job.getBoolean(fetcher.verbose, false)) {
//  LOG.setLevel(Level.FINE);
//}
 }

Is this parameter now read somewhere else?

Any enlightenment always appreciated.

-Ed

On 8/9/06, Uroš Gruber [EMAIL PROTECTED] wrote:


Sami Siren wrote:

 I set DEBUG level loging and I've checked time during operations and
 when doint MapReduce job which is run after every page it takes 3-4
 seconds till next url is fethed.
 I have some local site and fetching 100 pages takes about 6 minutes.

 You are fetching a single site yes? Then you can get more performance
 by tweaking the configuration
 of fetcher.

 property
  namefetcher.server.delay/name
  value/value
  descriptionThe number of seconds the fetcher will delay between
   successive requests to the same server./description
 /property

 property
  namefetcher.threads.per.host/name
  value/value
  descriptionThis number is the maximum number of threads that
should be allowed to access a host at one time./description
 /property

Hi,

I've manage to test nutch speed on several machines with different OS as
well.
I looks that fetcher.threads.per.host makes fetcher run faster.

What I still don't understand is this.

When fetcher threads was set to default value fetcher was doing
mapreduce after every url.
But now job is run on about 400 urls or maybe more.

--
Uros
 --
 Sami Siren








[jira] Created: (NUTCH-346) Improve readability of logs/hadoop.log

2006-08-09 Thread Renaud Richardet (JIRA)
Improve readability of logs/hadoop.log
--

 Key: NUTCH-346
 URL: http://issues.apache.org/jira/browse/NUTCH-346
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: ubuntu dapper
Reporter: Renaud Richardet
Priority: Minor


adding
log4j.logger.org.apache.nutch.plugin.PluginRepository=WARN
to conf/log4j.properties
dramatically improves the readability of the logs in logs/hadoop.log (removes 
all INFO)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-344) Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks

2006-08-09 Thread Jason Calabrese (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-344?page=all ]

Jason Calabrese updated NUTCH-344:
--

Attachment: HttpBase.patch

This fix missed 1 little change that caused BLOCKED_ADDR_TO_TIME and 
BLOCKED_ADDR_QUEUE to get out of sync.

To fix the problem you only need to change the remove on line 385 to:
BLOCKED_ADDR_QUEUE.remove(i);

I can report the the fetch is now much faster with both of these fixes

 Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
 -

 Key: NUTCH-344
 URL: http://issues.apache.org/jira/browse/NUTCH-344
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8.1, 0.9.0, 0.8
 Environment: All
Reporter: Greg Kim
 Fix For: 0.8.1, 0.9.0

 Attachments: cleanExpiredServerBlocks.patch, HttpBase.patch


 With the recent change to the following code in HttpBase.java has tendencies 
 to block fetcher threads while one thread busy waits... 
   private static void cleanExpiredServerBlocks() {
 synchronized (BLOCKED_ADDR_TO_TIME) {
   while (!BLOCKED_ADDR_QUEUE.isEmpty()) {   = LINE 3:   
 String host = (String) BLOCKED_ADDR_QUEUE.getLast();
 long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
 if (time = System.currentTimeMillis()) {   
   BLOCKED_ADDR_TO_TIME.remove(host);
   BLOCKED_ADDR_QUEUE.removeLast();
 }
   }
 }
   }
 LINE3:  As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the 
 thread that first enters this block busy-waits until it becomes empty while 
 all other threads block on the synchronized block.  This leads to extremely 
 poor fetcher performance.  
 Since the checkin to respect crawlDelay in robots.txt, we are no longer 
 guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is 
 to iterate the queue once rather than busy waiting...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-344) Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks

2006-08-09 Thread Jacob Brunson (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-344?page=comments#action_12427096 ] 

Jacob Brunson commented on NUTCH-344:
-

I'm having problems with the patch committed in revision #429779.  I used to be 
having the fetch aborted with X hung threads problem.  After updating to this 
revision, fetching goes fine for a while, but then I get this error on just 
about every page fetch attempt:
2006-08-09 23:27:28,548 INFO  fetcher.Fetcher - fetching 
http://www.xmission.com/~nelsonb/resources.htm
2006-08-09 23:27:28,549 ERROR http.Http - java.lang.NullPointerException
2006-08-09 23:27:28,549 ERROR http.Http - at 
org.apache.nutch.protocol.http.api.HttpBase.cleanExpiredServerBlocks(HttpBase.java:382)
2006-08-09 23:27:28,549 ERROR http.Http - at 
org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:323)
2006-08-09 23:27:28,549 ERROR http.Http - at 
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:188)
2006-08-09 23:27:28,549 ERROR http.Http - at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:144)
2006-08-09 23:27:28,549 INFO  fetcher.Fetcher - fetch of 
http://www.xmission.com/~nelsonb/resources.htm failed with: 
java.lang.NullPointerException


 Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
 -

 Key: NUTCH-344
 URL: http://issues.apache.org/jira/browse/NUTCH-344
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8.1, 0.9.0, 0.8
 Environment: All
Reporter: Greg Kim
 Fix For: 0.8.1, 0.9.0

 Attachments: cleanExpiredServerBlocks.patch, HttpBase.patch


 With the recent change to the following code in HttpBase.java has tendencies 
 to block fetcher threads while one thread busy waits... 
   private static void cleanExpiredServerBlocks() {
 synchronized (BLOCKED_ADDR_TO_TIME) {
   while (!BLOCKED_ADDR_QUEUE.isEmpty()) {   = LINE 3:   
 String host = (String) BLOCKED_ADDR_QUEUE.getLast();
 long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
 if (time = System.currentTimeMillis()) {   
   BLOCKED_ADDR_TO_TIME.remove(host);
   BLOCKED_ADDR_QUEUE.removeLast();
 }
   }
 }
   }
 LINE3:  As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the 
 thread that first enters this block busy-waits until it becomes empty while 
 all other threads block on the synchronized block.  This leads to extremely 
 poor fetcher performance.  
 Since the checkin to respect crawlDelay in robots.txt, we are no longer 
 guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is 
 to iterate the queue once rather than busy waiting...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-344) Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks

2006-08-09 Thread Greg Kim (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-344?page=comments#action_12427100 ] 

Greg Kim commented on NUTCH-344:


Had the correct version in my workspace; blotched the copy over to the vendor 
trunk. doh!   Thanks Jason for catching it!

Jacob, your problem should be resolved w/ the one line patch that Jason 
provided. 

 Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
 -

 Key: NUTCH-344
 URL: http://issues.apache.org/jira/browse/NUTCH-344
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8.1, 0.9.0, 0.8
 Environment: All
Reporter: Greg Kim
 Fix For: 0.8.1, 0.9.0

 Attachments: cleanExpiredServerBlocks.patch, HttpBase.patch


 With the recent change to the following code in HttpBase.java has tendencies 
 to block fetcher threads while one thread busy waits... 
   private static void cleanExpiredServerBlocks() {
 synchronized (BLOCKED_ADDR_TO_TIME) {
   while (!BLOCKED_ADDR_QUEUE.isEmpty()) {   = LINE 3:   
 String host = (String) BLOCKED_ADDR_QUEUE.getLast();
 long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
 if (time = System.currentTimeMillis()) {   
   BLOCKED_ADDR_TO_TIME.remove(host);
   BLOCKED_ADDR_QUEUE.removeLast();
 }
   }
 }
   }
 LINE3:  As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the 
 thread that first enters this block busy-waits until it becomes empty while 
 all other threads block on the synchronized block.  This leads to extremely 
 poor fetcher performance.  
 Since the checkin to respect crawlDelay in robots.txt, we are no longer 
 guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is 
 to iterate the queue once rather than busy waiting...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira