Hi Roland, You say you start a fetch run, does this mean the FetcherJob or GeneratorJob? What kind of settings do you run your zNutch server with?
On Wednesday, February 20, 2013, Roland <[email protected]> wrote: > Hi list, > > we're experimenting with nutch 2.1 and cassandra 1.2.1 (on ? hosts). > Our cassandra 'webpage' store has about 31GB right now on disk, we add URLs by 'injecting' them, about 100k-300k per cycle. > When starting a 'fetch' run, it now needs about an hour before the queues are set up / the first page is fetched. > During this time we can see about 180MBit/s network traffic from the cassandra host to the nutch host (outgoing of cassandra). > If I calculate the transferred data during this time (taking only 150Mbit/s into account): > 150MBit/s*1000*1000/8/1024/1024/1024*3600sec ~= 62GB > > So, why does nutch load all data from the db, and not only the relevant data of this fetch? And why does it happen twice? > > Thanks, > Roland > -- *Lewis*

