Could not obtain block Error
Hi Everyone, I have problems while getting summaries from the DFS segments. Sometimes, on random segments and on random DFS blocks, I face with the following error : java.io.IOException: Could not obtain block: blk_1996629287798238182 file=/data/crawl/segments/20060616121845/parse_text/part-00047/index offset=0 Could this problem somehow be related to my hadoop-site configuration, or could it be related to my nutch version (hadoop-0.4.0 and nutch-2006-07-19)? When I try to get the parse_text of the url with segread script, there is no problem. So I assume it does not related to special URLs.
Re: (NUTCH-339) Refactor nutch to allow fetcher improvements
Sami Siren wrote: I set DEBUG level loging and I've checked time during operations and when doint MapReduce job which is run after every page it takes 3-4 seconds till next url is fethed. I have some local site and fetching 100 pages takes about 6 minutes. You are fetching a single site yes? Then you can get more performance by tweaking the configuration of fetcher. property namefetcher.server.delay/name value/value descriptionThe number of seconds the fetcher will delay between successive requests to the same server./description /property property namefetcher.threads.per.host/name value/value descriptionThis number is the maximum number of threads that should be allowed to access a host at one time./description /property Hi, I've manage to test nutch speed on several machines with different OS as well. I looks that fetcher.threads.per.host makes fetcher run faster. What I still don't understand is this. When fetcher threads was set to default value fetcher was doing mapreduce after every url. But now job is run on about 400 urls or maybe more. -- Uros -- Sami Siren
Re: (NUTCH-339) Refactor nutch to allow fetcher improvements
What do you now set fetcher.threads.per.host to? Can you tell me what your generate.max.per.host value is as well? I got big improvements after setting: property namefetcher.server.delay/name value0.5/value descriptionThe number of seconds the fetcher will delay between successive requests to the same server./description /property even though I'm only generating 5 urls per host (generate.max.per.host=5). I don't know whether fetcher.server.delay also affects requests made through a proxy (anyone?) since I'm using a proxy. Also, I still can't see any logging output from the fetchers i.e. what url is being requested in any log file anywhere. I'm not so hot with java but can anyone here tell whether: log4j.threshhold=ALL is conf/log4j.properties should be threshhold with 1 h or are 2 h's the java way? And is there any reason why the lines in the function below are commented out: public void configure(JobConf job) { setConf(job); this.segmentName = job.get(SEGMENT_NAME_KEY); this.storingContent = isStoringContent(job); this.parsing = isParsing(job); //if (job.getBoolean(fetcher.verbose, false)) { // LOG.setLevel(Level.FINE); //} } Is this parameter now read somewhere else? Any enlightenment always appreciated. -Ed On 8/9/06, Uroš Gruber [EMAIL PROTECTED] wrote: Sami Siren wrote: I set DEBUG level loging and I've checked time during operations and when doint MapReduce job which is run after every page it takes 3-4 seconds till next url is fethed. I have some local site and fetching 100 pages takes about 6 minutes. You are fetching a single site yes? Then you can get more performance by tweaking the configuration of fetcher. property namefetcher.server.delay/name value/value descriptionThe number of seconds the fetcher will delay between successive requests to the same server./description /property property namefetcher.threads.per.host/name value/value descriptionThis number is the maximum number of threads that should be allowed to access a host at one time./description /property Hi, I've manage to test nutch speed on several machines with different OS as well. I looks that fetcher.threads.per.host makes fetcher run faster. What I still don't understand is this. When fetcher threads was set to default value fetcher was doing mapreduce after every url. But now job is run on about 400 urls or maybe more. -- Uros -- Sami Siren
Re: (NUTCH-339) Refactor nutch to allow fetcher improvements
e w wrote: What do you now set fetcher.threads.per.host to? Can you tell me what your generate.max.per.host value is as well? property namefetcher.server.delay/name value0/value descriptionThe number of seconds the fetcher will delay between successive requests to the same server./description /property property namefetcher.threads.fetch/name value10/value /property property namegenerate.max.per.host/name value400/value /property property namefetcher.threads.per.host/name value10/value /property property namehttp.max.delays/name value30/value /property I got big improvements after setting: property namefetcher.server.delay/name value0.5/value descriptionThe number of seconds the fetcher will delay between successive requests to the same server./description /property even though I'm only generating 5 urls per host (generate.max.per.host=5). I don't know whether fetcher.server.delay also affects requests made through a proxy (anyone?) since I'm using a proxy. Also, I still can't see any logging output from the fetchers i.e. what url is being requested in any log file anywhere. I'm not so hot with java but can anyone here tell whether: log4j.threshhold=ALL I set this log4j.logger.org.apache.nutch=DEBUG log4j.logger.org.apache.hadoop=DEBUG That I can see what is going on. -- Uros is conf/log4j.properties should be threshhold with 1 h or are 2 h's the java way? And is there any reason why the lines in the function below are commented out: public void configure(JobConf job) { setConf(job); this.segmentName = job.get(SEGMENT_NAME_KEY); this.storingContent = isStoringContent(job); this.parsing = isParsing(job); //if (job.getBoolean(fetcher.verbose, false)) { // LOG.setLevel(Level.FINE); //} } Is this parameter now read somewhere else? Any enlightenment always appreciated. -Ed On 8/9/06, Uroš Gruber [EMAIL PROTECTED] wrote: Sami Siren wrote: I set DEBUG level loging and I've checked time during operations and when doint MapReduce job which is run after every page it takes 3-4 seconds till next url is fethed. I have some local site and fetching 100 pages takes about 6 minutes. You are fetching a single site yes? Then you can get more performance by tweaking the configuration of fetcher. property namefetcher.server.delay/name value/value descriptionThe number of seconds the fetcher will delay between successive requests to the same server./description /property property namefetcher.threads.per.host/name value/value descriptionThis number is the maximum number of threads that should be allowed to access a host at one time./description /property Hi, I've manage to test nutch speed on several machines with different OS as well. I looks that fetcher.threads.per.host makes fetcher run faster. What I still don't understand is this. When fetcher threads was set to default value fetcher was doing mapreduce after every url. But now job is run on about 400 urls or maybe more. -- Uros -- Sami Siren
[jira] Created: (NUTCH-346) Improve readability of logs/hadoop.log
Improve readability of logs/hadoop.log -- Key: NUTCH-346 URL: http://issues.apache.org/jira/browse/NUTCH-346 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: ubuntu dapper Reporter: Renaud Richardet Priority: Minor adding log4j.logger.org.apache.nutch.plugin.PluginRepository=WARN to conf/log4j.properties dramatically improves the readability of the logs in logs/hadoop.log (removes all INFO) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-344) Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
[ http://issues.apache.org/jira/browse/NUTCH-344?page=all ] Jason Calabrese updated NUTCH-344: -- Attachment: HttpBase.patch This fix missed 1 little change that caused BLOCKED_ADDR_TO_TIME and BLOCKED_ADDR_QUEUE to get out of sync. To fix the problem you only need to change the remove on line 385 to: BLOCKED_ADDR_QUEUE.remove(i); I can report the the fetch is now much faster with both of these fixes Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks - Key: NUTCH-344 URL: http://issues.apache.org/jira/browse/NUTCH-344 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8.1, 0.9.0, 0.8 Environment: All Reporter: Greg Kim Fix For: 0.8.1, 0.9.0 Attachments: cleanExpiredServerBlocks.patch, HttpBase.patch With the recent change to the following code in HttpBase.java has tendencies to block fetcher threads while one thread busy waits... private static void cleanExpiredServerBlocks() { synchronized (BLOCKED_ADDR_TO_TIME) { while (!BLOCKED_ADDR_QUEUE.isEmpty()) { = LINE 3: String host = (String) BLOCKED_ADDR_QUEUE.getLast(); long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue(); if (time = System.currentTimeMillis()) { BLOCKED_ADDR_TO_TIME.remove(host); BLOCKED_ADDR_QUEUE.removeLast(); } } } } LINE3: As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the thread that first enters this block busy-waits until it becomes empty while all other threads block on the synchronized block. This leads to extremely poor fetcher performance. Since the checkin to respect crawlDelay in robots.txt, we are no longer guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to iterate the queue once rather than busy waiting... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-344) Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
[ http://issues.apache.org/jira/browse/NUTCH-344?page=comments#action_12427096 ] Jacob Brunson commented on NUTCH-344: - I'm having problems with the patch committed in revision #429779. I used to be having the fetch aborted with X hung threads problem. After updating to this revision, fetching goes fine for a while, but then I get this error on just about every page fetch attempt: 2006-08-09 23:27:28,548 INFO fetcher.Fetcher - fetching http://www.xmission.com/~nelsonb/resources.htm 2006-08-09 23:27:28,549 ERROR http.Http - java.lang.NullPointerException 2006-08-09 23:27:28,549 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.cleanExpiredServerBlocks(HttpBase.java:382) 2006-08-09 23:27:28,549 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:323) 2006-08-09 23:27:28,549 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:188) 2006-08-09 23:27:28,549 ERROR http.Http - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:144) 2006-08-09 23:27:28,549 INFO fetcher.Fetcher - fetch of http://www.xmission.com/~nelsonb/resources.htm failed with: java.lang.NullPointerException Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks - Key: NUTCH-344 URL: http://issues.apache.org/jira/browse/NUTCH-344 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8.1, 0.9.0, 0.8 Environment: All Reporter: Greg Kim Fix For: 0.8.1, 0.9.0 Attachments: cleanExpiredServerBlocks.patch, HttpBase.patch With the recent change to the following code in HttpBase.java has tendencies to block fetcher threads while one thread busy waits... private static void cleanExpiredServerBlocks() { synchronized (BLOCKED_ADDR_TO_TIME) { while (!BLOCKED_ADDR_QUEUE.isEmpty()) { = LINE 3: String host = (String) BLOCKED_ADDR_QUEUE.getLast(); long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue(); if (time = System.currentTimeMillis()) { BLOCKED_ADDR_TO_TIME.remove(host); BLOCKED_ADDR_QUEUE.removeLast(); } } } } LINE3: As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the thread that first enters this block busy-waits until it becomes empty while all other threads block on the synchronized block. This leads to extremely poor fetcher performance. Since the checkin to respect crawlDelay in robots.txt, we are no longer guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to iterate the queue once rather than busy waiting... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-344) Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
[ http://issues.apache.org/jira/browse/NUTCH-344?page=comments#action_12427100 ] Greg Kim commented on NUTCH-344: Had the correct version in my workspace; blotched the copy over to the vendor trunk. doh! Thanks Jason for catching it! Jacob, your problem should be resolved w/ the one line patch that Jason provided. Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks - Key: NUTCH-344 URL: http://issues.apache.org/jira/browse/NUTCH-344 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8.1, 0.9.0, 0.8 Environment: All Reporter: Greg Kim Fix For: 0.8.1, 0.9.0 Attachments: cleanExpiredServerBlocks.patch, HttpBase.patch With the recent change to the following code in HttpBase.java has tendencies to block fetcher threads while one thread busy waits... private static void cleanExpiredServerBlocks() { synchronized (BLOCKED_ADDR_TO_TIME) { while (!BLOCKED_ADDR_QUEUE.isEmpty()) { = LINE 3: String host = (String) BLOCKED_ADDR_QUEUE.getLast(); long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue(); if (time = System.currentTimeMillis()) { BLOCKED_ADDR_TO_TIME.remove(host); BLOCKED_ADDR_QUEUE.removeLast(); } } } } LINE3: As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the thread that first enters this block busy-waits until it becomes empty while all other threads block on the synchronized block. This leads to extremely poor fetcher performance. Since the checkin to respect crawlDelay in robots.txt, we are no longer guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to iterate the queue once rather than busy waiting... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira