You already have that rule configured?
Yes, its -^.{350,}$
Is it one of the first simple expressions you have?
This is an excellent point. It's not the first one. I'll move it to first
place and see if it helps.
How many records are you processing each time, is it roughly the same for
all segments?
It's roughly the same for all segments - 300,000 URLs per segment. Each
segment finishes in 2.5 hours but one segment took 10 hours.
And are you running on Hadoop or pseudo or local?
Hadoop on Amazon EC2, one machine m2.2xlarge with 34.20 GB RAM and 13 (4
cores x 3.25 units) compute units.
mapred.child.java.opts -Xmx4096m
mapred.tasktracker.map.tasks.maximum 6
mapred.tasktracker.reduce.tasks.maximum 2
--
View this message in context:
http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-tp3758053p3992605.html
Sent from the Nutch - User mailing list archive at Nabble.com.