Re: nutch with cassandra internal network usage

Roland Thu, 21 Feb 2013 09:02:05 -0800

Hi Julien,

the point I personally don't get, is: why is generating fast - fetching not.

If it's possible to filter the generatorJob at the backend (what I thinkit does), shouldn't it be possible to do the same for the fetcher?


--Roland

Am 21.02.2013 12:27, schrieb Julien Nioche:

Lewis,

The point is whether the filtering is done on the backend side (e.g. using
queries, indices, etc...) then passed on to MapReduce via GORA or as I
assume by looking at the code filtered within the MapReduce which means
that all the entries are pulled from the backend anyway.
This makes quite a difference in terms of performance if you think e.g
about a large webtable which would have to be entirely passed to mapreduce
even if only a handful of entries are to be processed.

Makes sense?

Julien


On 21 February 2013 01:52, Lewis John Mcgibbney
<[email protected]>wrote:

Those filters are applied only to URLs which do not have a null
GENERATE_MARK
e.g.

     if (Mark.GENERATE_MARK.checkMark(page) != null) {
       if (GeneratorJob.LOG.isDebugEnabled()) {
         GeneratorJob.LOG.debug("Skipping " + url + "; already generated");
       }
       return;

Therefore filters will be applied to all URLs which have a null
GENERATE_MARK value.

On Wed, Feb 20, 2013 at 2:45 PM, <[email protected]> wrote:

Hi,

Are those filters put on all data selected from hbase or sent to hbase as
filters to select a subset of all hbase records?

Thanks.
Alex.







-----Original Message-----
From: Lewis John Mcgibbney <[email protected]>
To: user <[email protected]>
Sent: Wed, Feb 20, 2013 12:56 pm
Subject: Re: nutch with cassandra internal network usage


Hi Alex,

On Wed, Feb 20, 2013 at 11:54 AM, <[email protected]> wrote:

The generator also does not have filters. Its mapper  goes over all
records as far as I know. If you use hadoop you can see how many

records

go

as input to mappers. Also see this

I don't think this is true. The GeneratorMapper filters URLs before
selecting them for inclusion based on the following
- distance
- URLNormalizer(s)
- URLFilter(s)
in that order.
I am going to start a new thread on improvements to the GeneratorJob
regarding better configuration as it is a crucial stage in the crawl
process.

So the issue here, as you correctly explain, is with the Fetcher

obtaining

the URLs which have been marked with a desired batchId. This would be

done

via scanner in Gora.


--
*Lewis*

Re: nutch with cassandra internal network usage

Reply via email to