Hi list,

we're experimenting with nutch 2.1 and cassandra 1.2.1 (on different hosts).
Our cassandra 'webpage' store has about 31GB right now on disk, we add URLs by 'injecting' them, about 100k-300k per cycle. When starting a 'fetch' run, it now needs about an hour before the queues are set up / the first page is fetched. During this time we can see about 180MBit/s network traffic from the cassandra host to the nutch host (outgoing of cassandra). If I calculate the transferred data during this time (taking only 150Mbit/s into account):
150MBit/s*1000*1000/8/1024/1024/1024*3600sec ~= 62GB

So, why does nutch load all data from the db, and not only the relevant data of this fetch? And why does it happen twice?

Thanks,
Roland

Reply via email to