Julien, You might be right, the swap might be the bottle neck.
On the other hand, I am guessing because I am not taking advantage of map/reduce since I am running in local single node mode.Thus One mapper and one reduce I assume? Might that be an issue? I am thinking of running it on cluster. Would you be able to give some tip on the number of mapper reducer I should set. Let's say number of mapper per Gib of Memory per node? If I use a slave node with 8 Gib Memory, what would be the ideal number of mapper for me to start? I have been browsing the Wiki and mailing list. It ranges from 2 mapper/reducer per box to 99 mapper/reducer per box. Any tips would be lovely. Cheers, Ye On Mon, Mar 11, 2013 at 4:56 PM, Julien Nioche < [email protected]> wrote: > > My guess is that 48hr to parse 100k urls does not sound efficient. > > > > that's definitely not right. You mentioned that you are using a medium > instance and that the memory is at ~100% usage so my guess would be that > you are swapping a lot. Check the swap usage. Maybe move to a large > instance instead and allow more memory per mapper / reducer as the parsing > step can be gready. > > > > Unfortunately 100k is just the beginning for me. :( I am looking at 10 > > Millions per fetch cycle. I am looking for ideas and pointer on how to > gain > > speed. May be using/tweaking Map Reduce would the the answer? > > > > I think your problem is a lot more basic than that > > Julien > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >

