Regex order matters. Happy to hear the results.
Considering your hardware you should parse this amount of pages in less than an hour. And you should decrease your mapper/reducer heap size significantly, it doesn't take 4G of RAM. 1G mapper and 500M reducer is safe enough. You can then allocate more task slots and have higher throughput. -----Original message----- > From:sidbatra <[email protected]> > Sent: Mon 02-Jul-2012 23:02 > To: [email protected] > Subject: RE: ParseSegment taking a long time to finish > > You already have that rule configured? > > Yes, its -^.{350,}$ > > Is it one of the first simple expressions you have? > > This is an excellent point. It's not the first one. I'll move it to first > place and see if it helps. > > How many records are you processing each time, is it roughly the same for > all segments? > > It's roughly the same for all segments - 300,000 URLs per segment. Each > segment finishes in 2.5 hours but one segment took 10 hours. > > And are you running on Hadoop or pseudo or local? > > Hadoop on Amazon EC2, one machine m2.2xlarge with 34.20 GB RAM and 13 (4 > cores x 3.25 units) compute units. > > mapred.child.java.opts -Xmx4096m > mapred.tasktracker.map.tasks.maximum 6 > mapred.tasktracker.reduce.tasks.maximum 2 > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-tp3758053p3992605.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

