Re: nutch with cassandra internal network usage

Lewis John Mcgibbney Wed, 20 Feb 2013 17:52:28 -0800

Those filters are applied only to URLs which do not have a null
GENERATE_MARK
e.g.


    if (Mark.GENERATE_MARK.checkMark(page) != null) {
      if (GeneratorJob.LOG.isDebugEnabled()) {
        GeneratorJob.LOG.debug("Skipping " + url + "; already generated");
      }
      return;

Therefore filters will be applied to all URLs which have a null
GENERATE_MARK value.

On Wed, Feb 20, 2013 at 2:45 PM, <[email protected]> wrote:

> Hi,
>
> Are those filters put on all data selected from hbase or sent to hbase as
> filters to select a subset of all hbase records?
>
> Thanks.
> Alex.
>
>
>
>
>
>
>
> -----Original Message-----
> From: Lewis John Mcgibbney <[email protected]>
> To: user <[email protected]>
> Sent: Wed, Feb 20, 2013 12:56 pm
> Subject: Re: nutch with cassandra internal network usage
>
>
> Hi Alex,
>
> On Wed, Feb 20, 2013 at 11:54 AM, <[email protected]> wrote:
>
> >
> > The generator also does not have filters. Its mapper  goes over all
> > records as far as I know. If you use hadoop you can see how many records
> go
> > as input to mappers. Also see this
> >
>
> I don't think this is true. The GeneratorMapper filters URLs before
> selecting them for inclusion based on the following
> - distance
> - URLNormalizer(s)
> - URLFilter(s)
> in that order.
> I am going to start a new thread on improvements to the GeneratorJob
> regarding better configuration as it is a crucial stage in the crawl
> process.
>
> So the issue here, as you correctly explain, is with the Fetcher obtaining
> the URLs which have been marked with a desired batchId. This would be done
> via scanner in Gora.
>
>
>


-- 
*Lewis*

Re: nutch with cassandra internal network usage

Reply via email to