Re: nutch with cassandra internal network usage

Lewis John Mcgibbney Thu, 21 Feb 2013 08:57:40 -0800

I get it fine. I do think it important to discuss the current filtering
code in the generator though. Yeah, okay, it turns out that our current
implementation (which reads all entries then does filtering on Nutch side)
can be horribly expensive but at least there is some mechanism in place
right? We will work on the scan over in Gora after 0.3 release.


Null unions in avro schemas e.g. GORA-174 has been kicking our head in but
we are getting there. As always, anyone interested in contributing to the
cause, please shoot over to user@gora
Thanks
Lewis

On Thursday, February 21, 2013, Julien Nioche <lists.digitalpeb...@gmail.com>
wrote:
> Lewis,
>
> The point is whether the filtering is done on the backend side (e.g. using
> queries, indices, etc...) then passed on to MapReduce via GORA or as I
> assume by looking at the code filtered within the MapReduce which means
> that all the entries are pulled from the backend anyway.
> This makes quite a difference in terms of performance if you think e.g
> about a large webtable which would have to be entirely passed to mapreduce
> even if only a handful of entries are to be processed.
>
> Makes sense?
>
> Julien
>
>
> On 21 February 2013 01:52, Lewis John Mcgibbney
> <lewis.mcgibb...@gmail.com>wrote:
>
>> Those filters are applied only to URLs which do not have a null
>> GENERATE_MARK
>> e.g.
>>
>>     if (Mark.GENERATE_MARK.checkMark(page) != null) {
>>       if (GeneratorJob.LOG.isDebugEnabled()) {
>>         GeneratorJob.LOG.debug("Skipping " + url + "; already
generated");
>>       }
>>       return;
>>
>> Therefore filters will be applied to all URLs which have a null
>> GENERATE_MARK value.
>>
>> On Wed, Feb 20, 2013 at 2:45 PM, <alx...@aim.com> wrote:
>>
>> > Hi,
>> >
>> > Are those filters put on all data selected from hbase or sent to hbase
as
>> > filters to select a subset of all hbase records?
>> >
>> > Thanks.
>> > Alex.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > -----Original Message-----
>> > From: Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
>> > To: user <user@nutch.apache.org>
>> > Sent: Wed, Feb 20, 2013 12:56 pm
>> > Subject: Re: nutch with cassandra internal network usage
>> >
>> >
>> > Hi Alex,
>> >
>> > On Wed, Feb 20, 2013 at 11:54 AM, <alx...@aim.com> wrote:
>> >
>> > >
>> > > The generator also does not have filters. Its mapper  goes over all
>> > > records as far as I know. If you use hadoop you can see how many
>> records
>> > go
>> > > as input to mappers. Also see this
>> > >
>> >
>> > I don't think this is true. The GeneratorMapper filters URLs before
>> > selecting them for inclusion based on the following
>> > - distance
>> > - URLNormalizer(s)
>> > - URLFilter(s)
>> > in that order.
>> > I am going to start a new thread on improvements to the GeneratorJob
>> > regarding better configuration as it is a crucial stage in the crawl
>> > process.
>> >
>> > So the issue here, as you correctly explain, is with the Fetcher
>> obtaining
>> > the URLs which have been marked with a desired batchId. This would be
>> done
>> > via scanner in Gora.
>> >
>> >
>> >
>>
>>
>> --
>> *Lewis*
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

-- 
*Lewis*

Re: nutch with cassandra internal network usage

Reply via email to