Regex order matters. Happy to hear the results.

Considering your hardware you should parse this amount of pages in less than an 
hour. And you should decrease your mapper/reducer heap size significantly, it 
doesn't take 4G of RAM. 1G mapper and 500M reducer is safe enough. You can then 
allocate more task slots and have higher throughput.

 
 
-----Original message-----
> From:sidbatra <[email protected]>
> Sent: Mon 02-Jul-2012 23:02
> To: [email protected]
> Subject: RE: ParseSegment taking a long time to finish
> 
> You already have that rule configured? 
> 
> Yes, its    -^.{350,}$
> 
> Is it one of the first simple expressions you have? 
> 
> This is an excellent point. It's not the first one. I'll move it to first
> place and see if it helps.
> 
> How many records are you processing each time, is it roughly the same for
> all segments?
> 
> It's roughly the same for all segments - 300,000 URLs per segment. Each
> segment finishes in 2.5 hours but one segment took 10 hours.
> 
>  And are you running on Hadoop or pseudo or local? 
> 
> Hadoop on Amazon EC2, one machine m2.2xlarge with 34.20 GB RAM and 13 (4
> cores x 3.25 units) compute units.
> 
> mapred.child.java.opts        -Xmx4096m
> mapred.tasktracker.map.tasks.maximum   6
> mapred.tasktracker.reduce.tasks.maximum 2
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-tp3758053p3992605.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 

Reply via email to