Hi

The log output doesn't tell you what the task is actually doing, it is only 
Hadoop output and initialization of the URL filters. There should be no real 
problem with the parser job and URL filter programming in Nutch, we crawl large 
parts of the internet but the parser never stalls, at least not on URL filter 
processing. Anyway, check your CrawlDB, there could (or must) be very long 
URL's choking the regexes.

If your CrawlDB isn't too large you can dump it as CSV and grep for lines 
longer than 500 or 250 characters. You could also keep a back up of your 
CrawlDB and limit the URL length. You can also try parsing the bad segment with 
the same URL length limitting regex, it should solve the problem.

Cheers

 
 
-----Original message-----
> From:sidbatra <[email protected]>
> Sent: Mon 02-Jul-2012 20:52
> To: [email protected]
> Subject: Re: ParseSegment taking a long time to finish
> 
> I have a recent example here from the logs during Parsing:
> 
> 2012-06-30 23:46:55,763 INFO org.apache.hadoop.mapred.ReduceTask (main):
> Merging 0 segments, 0 bytes from memory into reduce
> 2012-06-30 23:46:55,763 INFO org.apache.hadoop.mapred.Merger (main): Merging
> 4 sorted segments
> 2012-06-30 23:46:55,766 INFO org.apache.hadoop.mapred.Merger (main): Down to
> the last merge-pass, with 4 segments left of total size: 960691756 bytes
> 2012-06-30 23:46:55,767 INFO org.apache.hadoop.conf.Configuration (main):
> found resource regex-urlfilter.txt at
> file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201206280129_0071/jars/regex-urlfilter.txt
> 2012-06-30 23:46:55,768 INFO org.apache.hadoop.conf.Configuration (main):
> found resource regex-normalize.xml at
> file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201206280129_0071/jars/regex-normalize.xml
> 2012-06-30 23:46:55,829 INFO
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer (main): can't
> find rules for scope 'outlink', using default
> 2012-07-01 04:42:24,802 INFO
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer (main): can't
> find rules for scope 'fetcher', using default
> 2012-07-01 08:48:03,952 INFO org.apache.hadoop.mapred.Task (main):
> Task:attempt_201206280129_0071_r_000002_0 is done. And is in the process of
> commiting
> 2012-07-01 08:48:12,698 INFO org.apache.hadoop.mapred.Task (main): Task
> 'attempt_201206280129_0071_r_000002_0' done.
> 2012-07-01 08:48:12,699 INFO org.apache.hadoop.mapred.TaskLogsTruncater
> (main): Initializing logs' truncater with mapRetainSize=-1 and
> reduceRetainSize=-1
> 
> It takes 4 hours after the each of these messages:
> RegexURLNormalizer (main): can't find rules for scope 'outlink', using
> default
> RegexURLNormalizer (main): can't find rules for scope 'fetcher', using
> default
> 
> 
> There is a recommendation somewhere on the mailing list to reduce the number
> of inlinks and outlinks. 
> 
> What is odd is that this isn't hanging issue isn't consistent with the
> number of links being parsed. Other instances of parsing finish in 1/5 the
> time.
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-tp3758053p3992586.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 

Reply via email to