Re: nutch with cassandra internal network usage

Roland Fri, 22 Feb 2013 01:29:19 -0800

Hi Lewis,

ok, first a few words about hardware: nutch is running on a 16 core AMDOpteron 2GHz.Cassandra is on an 8 core Intel Xeon 3.3 GHz, both have 128GB RAM andare connected via GBit network.


Here is the timing of a generation job (after injecting 228007 urls):
time ./bin/nutch generate
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: done
GeneratorJob: generated batch id: 1361519572-1552351269

real 16m26.089s
user 3m3.303s
sys 0m43.123s

The fetcher job for this ID is now running since 70min and used 63minCPU time, the load of both servers is <1.0, but network traffic issomewhere at 180-200MBit/s as described before.Both servers are reacting fine, and doing a few other jobs withoutproblems. The Terminal shows this:

VM Started: FetcherJob: starting
FetcherJob: batchId: 1361519572-1552351269
FetcherJob: threads: 30
FetcherJob: parsing: true
FetcherJob: resuming: true
FetcherJob : timelimit set for : -1

--Roland

Am 22.02.2013 09:39, schrieb Lewis John Mcgibbney:

Roland, i am curious to know exactly what is happening between the
fetcherJob initiation and actual fetch of 1st URL. Does the terminal just
hang? Can you track some metrics of the job?

On Thursday, February 21, 2013, Roland <[email protected]> wrote:

Hi Julien,

the point I personally don't get, is: why is generating fast - fetching

not.

If it's possible to filter the generatorJob at the backend (what I think

it does), shouldn't it be possible to do the same for the fetcher?

--Roland

Am 21.02.2013 12:27, schrieb Julien Nioche:

Lewis,

The point is whether the filtering is done on the backend side (e.g.

using

queries, indices, etc...) then passed on to MapReduce via GORA or as I
assume by looking at the code filtered within the MapReduce which means
that all the entries are pulled from the backend anyway.
This makes quite a difference in terms of performance if you think e.g
about a large webtable which would have to be entirely passed to

mapreduce

even if only a handful of entries are to be processed.

Makes sense?

Julien


On 21 February 2013 01:52, Lewis John Mcgibbney
<[email protected]>wrote:

Those filters are applied only to URLs which do not have a null
GENERATE_MARK
e.g.

      if (Mark.GENERATE_MARK.checkMark(page) != null) {
        if (GeneratorJob.LOG.isDebugEnabled()) {
          GeneratorJob.LOG.debug("Skipping " + url + "; already

generated");

        }
        return;

Therefore filters will be applied to all URLs which have a null
GENERATE_MARK value.

On Wed, Feb 20, 2013 at 2:45 PM, <[email protected]> wrote:

Hi,

Are those filters put on all data selected from hbase or sent to hbase

as

filters to select a subset of all hbase records?

Thanks.
Alex.







-----Original Message-----
From: Lewis John Mcgibbney <[email protected]>
To: user <[email protected]>
Sent: Wed, Feb 20, 2013 12:56 pm
Subject: Re: nutch with cassandra internal network usage


Hi Alex,

On Wed, Feb 20, 2013 at 11:54 AM, <[email protected]> wrote:

The generator also does not have filters. Its mapper  goes over all
records as far as I know. If you use hadoop you can see how many

records

go

as input to mappers. Also see this

I don't think this is true. The GeneratorMapper filters URLs before
selecting them for inclusion based on the following
- distance
- URLNormalizer(s)
- URLFilter(s)
in that order.
I am going to start a new thread on improvements to the GeneratorJob
regarding better configuration as it is a crucial stage in the crawl
process.

So the issue here, as you correctly explain, is with the Fetcher

obtaining

the URLs which have been marked with a desired batchId. This would be

done

via scanner in Gora.

--
*Lewis*

Re: nutch with cassandra internal network usage

Reply via email to