Re: nutch with cassandra internal network usage

Roland Sun, 03 Mar 2013 23:26:33 -0800

Hi all,

I've read the sources ;)
(no, not really all, but enough, I hope)

So, major difference between generator & fetcher are the fields thatit's loading from db.As I had fetcher.store.content=true in the beginning, there was a lotdata in the contents fields.I run with fetcher.parse=true and that's why it loads all content duringstart-up of fetcherJob.


I did this in my local 2.1 sources:
Index: src/java/org/apache/nutch/fetcher/FetcherJob.java
===================================================================
--- src/java/org/apache/nutch/fetcher/FetcherJob.java   (revision 1448112)
+++ src/java/org/apache/nutch/fetcher/FetcherJob.java   (working copy)
@@ -140,6 +140,8 @@
     if (job.getConfiguration().getBoolean(PARSE_KEY, false)) {
       ParserJob parserJob = new ParserJob();
       fields.addAll(parserJob.getFields(job));
+      fields.remove(WebPage.Field.CONTENT); // FIXME
+      fields.remove(WebPage.Field.OUTLINKS); // FIXME
     }

ProtocolFactory protocolFactory = newProtocolFactory(job.getConfiguration());

     fields.addAll(protocolFactory.getFields());

and now start-up time of an fetcherJob is about 10 minutes :)

--Roland


Am 22.02.2013 10:28, schrieb Roland:

Hi Julien,

ok, so thanks for the clarification, I think I have to read thesources :)


--Roland

Am 22.02.2013 10:10, schrieb Julien Nioche:

Hi Roland

My previous email should have started with "The point Alex is makingis ..."

and not just "The point is ...".
I don't have an explanation as to why the generator is faster than the

fetching as I don't use 2.x at all but it would definitely beinterestingto find out. The behaviour of the fetcher is how I expect GORA tobehave in

its current form i.e. pull everything - filter - process.

Julien


On 21 February 2013 16:58, Roland <[email protected]> wrote:

Hi Julien,

the point I personally don't get, is: why is generating fast - fetching
not.

If it's possible to filter the generatorJob at the backend (what Ithink

it does), shouldn't it be possible to do the same for the fetcher?

--Roland

Am 21.02.2013 12:27, schrieb Julien Nioche:

  Lewis,

The point is whether the filtering is done on the backend side(e.g. using

queries, indices, etc...) then passed on to MapReduce via GORA or as I

assume by looking at the code filtered within the MapReduce whichmeans

that all the entries are pulled from the backend anyway.
This makes quite a difference in terms of performance if you think e.g

about a large webtable which would have to be entirely passed tomapreduce

even if only a handful of entries are to be processed.

Makes sense?

Julien


On 21 February 2013 01:52, Lewis John Mcgibbney
<[email protected]>**wrote:

  Those filters are applied only to URLs which do not have a null

GENERATE_MARK
e.g.

      if (Mark.GENERATE_MARK.checkMark(**page) != null) {
        if (GeneratorJob.LOG.**isDebugEnabled()) {
          GeneratorJob.LOG.debug("**Skipping " + url + "; already
generated");
        }
        return;

Therefore filters will be applied to all URLs which have a null
GENERATE_MARK value.

On Wed, Feb 20, 2013 at 2:45 PM, <[email protected]> wrote:

  Hi,

Are those filters put on all data selected from hbase or sent tohbase

as
filters to select a subset of all hbase records?

Thanks.
Alex.







-----Original Message-----
From: Lewis John Mcgibbney <[email protected]>
To: user <[email protected]>
Sent: Wed, Feb 20, 2013 12:56 pm
Subject: Re: nutch with cassandra internal network usage


Hi Alex,

On Wed, Feb 20, 2013 at 11:54 AM, <[email protected]> wrote:

  The generator also does not have filters. Its mapper goes over all

records as far as I know. If you use hadoop you can see how many

records
go

as input to mappers. Also see this
I don't think this is true. The GeneratorMapper filters URLsbefore

selecting them for inclusion based on the following
- distance
- URLNormalizer(s)
- URLFilter(s)
in that order.
I am going to start a new thread on improvements to the GeneratorJob
regarding better configuration as it is a crucial stage in the crawl
process.

So the issue here, as you correctly explain, is with the Fetcher

obtaining

the URLs which have been marked with a desired batchId. Thiswould be

done

via scanner in Gora.

--
*Lewis*

Re: nutch with cassandra internal network usage

Reply via email to