Hi list,
we're experimenting with nutch 2.1 and cassandra 1.2.1 (on different hosts).
Our cassandra 'webpage' store has about 31GB right now on disk, we add
URLs by 'injecting' them, about 100k-300k per cycle.
When starting a 'fetch' run, it now needs about an hour before the
queues are set up / the first page is fetched.
During this time we can see about 180MBit/s network traffic from the
cassandra host to the nutch host (outgoing of cassandra).
If I calculate the transferred data during this time (taking only
150Mbit/s into account):
150MBit/s*1000*1000/8/1024/1024/1024*3600sec ~= 62GB
So, why does nutch load all data from the db, and not only the relevant
data of this fetch? And why does it happen twice?
Thanks,
Roland