Hi Lewis,

ok, first a few words about hardware: nutch is running on a 16 core AMD Opteron 2GHz. Cassandra is on an 8 core Intel Xeon 3.3 GHz, both have 128GB RAM and are connected via GBit network.

Here is the timing of a generation job (after injecting 228007 urls):
time ./bin/nutch generate
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: done
GeneratorJob: generated batch id: 1361519572-1552351269

real 16m26.089s
user 3m3.303s
sys 0m43.123s

The fetcher job for this ID is now running since 70min and used 63min CPU time, the load of both servers is <1.0, but network traffic is somewhere at 180-200MBit/s as described before. Both servers are reacting fine, and doing a few other jobs without problems. The Terminal shows this:
VM Started: FetcherJob: starting
FetcherJob: batchId: 1361519572-1552351269
FetcherJob: threads: 30
FetcherJob: parsing: true
FetcherJob: resuming: true
FetcherJob : timelimit set for : -1

--Roland

Am 22.02.2013 09:39, schrieb Lewis John Mcgibbney:
Roland, i am curious to know exactly what is happening between the
fetcherJob initiation and actual fetch of 1st URL. Does the terminal just
hang? Can you track some metrics of the job?

On Thursday, February 21, 2013, Roland <[email protected]> wrote:
Hi Julien,

the point I personally don't get, is: why is generating fast - fetching
not.
If it's possible to filter the generatorJob at the backend (what I think
it does), shouldn't it be possible to do the same for the fetcher?
--Roland

Am 21.02.2013 12:27, schrieb Julien Nioche:
Lewis,

The point is whether the filtering is done on the backend side (e.g.
using
queries, indices, etc...) then passed on to MapReduce via GORA or as I
assume by looking at the code filtered within the MapReduce which means
that all the entries are pulled from the backend anyway.
This makes quite a difference in terms of performance if you think e.g
about a large webtable which would have to be entirely passed to
mapreduce
even if only a handful of entries are to be processed.

Makes sense?

Julien


On 21 February 2013 01:52, Lewis John Mcgibbney
<[email protected]>wrote:

Those filters are applied only to URLs which do not have a null
GENERATE_MARK
e.g.

      if (Mark.GENERATE_MARK.checkMark(page) != null) {
        if (GeneratorJob.LOG.isDebugEnabled()) {
          GeneratorJob.LOG.debug("Skipping " + url + "; already
generated");
        }
        return;

Therefore filters will be applied to all URLs which have a null
GENERATE_MARK value.

On Wed, Feb 20, 2013 at 2:45 PM, <[email protected]> wrote:

Hi,

Are those filters put on all data selected from hbase or sent to hbase
as
filters to select a subset of all hbase records?

Thanks.
Alex.







-----Original Message-----
From: Lewis John Mcgibbney <[email protected]>
To: user <[email protected]>
Sent: Wed, Feb 20, 2013 12:56 pm
Subject: Re: nutch with cassandra internal network usage


Hi Alex,

On Wed, Feb 20, 2013 at 11:54 AM, <[email protected]> wrote:

The generator also does not have filters. Its mapper  goes over all
records as far as I know. If you use hadoop you can see how many
records
go
as input to mappers. Also see this

I don't think this is true. The GeneratorMapper filters URLs before
selecting them for inclusion based on the following
- distance
- URLNormalizer(s)
- URLFilter(s)
in that order.
I am going to start a new thread on improvements to the GeneratorJob
regarding better configuration as it is a crucial stage in the crawl
process.

So the issue here, as you correctly explain, is with the Fetcher
obtaining
the URLs which have been marked with a desired batchId. This would be
done
via scanner in Gora.



--
*Lewis*




Reply via email to