Re: nutch with cassandra internal network usage

Julien Nioche Fri, 22 Feb 2013 01:11:26 -0800

Hi Roland

My previous email should have started with "The point Alex is making is ..."
and not just "The point is ...".
I don't have an explanation as to why the generator is faster than the
fetching as I don't use 2.x at all but it would definitely be interesting
to find out. The behaviour of the fetcher is how I expect GORA to behave in
its current form i.e. pull everything - filter - process.


Julien


On 21 February 2013 16:58, Roland <[email protected]> wrote:

> Hi Julien,
>
> the point I personally don't get, is: why is generating fast - fetching
> not.
> If it's possible to filter the generatorJob at the backend (what I think
> it does), shouldn't it be possible to do the same for the fetcher?
>
> --Roland
>
> Am 21.02.2013 12:27, schrieb Julien Nioche:
>
>  Lewis,
>>
>> The point is whether the filtering is done on the backend side (e.g. using
>> queries, indices, etc...) then passed on to MapReduce via GORA or as I
>> assume by looking at the code filtered within the MapReduce which means
>> that all the entries are pulled from the backend anyway.
>> This makes quite a difference in terms of performance if you think e.g
>> about a large webtable which would have to be entirely passed to mapreduce
>> even if only a handful of entries are to be processed.
>>
>> Makes sense?
>>
>> Julien
>>
>>
>> On 21 February 2013 01:52, Lewis John Mcgibbney
>> <[email protected]>**wrote:
>>
>>  Those filters are applied only to URLs which do not have a null
>>> GENERATE_MARK
>>> e.g.
>>>
>>>      if (Mark.GENERATE_MARK.checkMark(**page) != null) {
>>>        if (GeneratorJob.LOG.**isDebugEnabled()) {
>>>          GeneratorJob.LOG.debug("**Skipping " + url + "; already
>>> generated");
>>>        }
>>>        return;
>>>
>>> Therefore filters will be applied to all URLs which have a null
>>> GENERATE_MARK value.
>>>
>>> On Wed, Feb 20, 2013 at 2:45 PM, <[email protected]> wrote:
>>>
>>>  Hi,
>>>>
>>>> Are those filters put on all data selected from hbase or sent to hbase
>>>> as
>>>> filters to select a subset of all hbase records?
>>>>
>>>> Thanks.
>>>> Alex.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Lewis John Mcgibbney <[email protected]>
>>>> To: user <[email protected]>
>>>> Sent: Wed, Feb 20, 2013 12:56 pm
>>>> Subject: Re: nutch with cassandra internal network usage
>>>>
>>>>
>>>> Hi Alex,
>>>>
>>>> On Wed, Feb 20, 2013 at 11:54 AM, <[email protected]> wrote:
>>>>
>>>>  The generator also does not have filters. Its mapper  goes over all
>>>>> records as far as I know. If you use hadoop you can see how many
>>>>>
>>>> records
>>>
>>>> go
>>>>
>>>>> as input to mappers. Also see this
>>>>>
>>>>>  I don't think this is true. The GeneratorMapper filters URLs before
>>>> selecting them for inclusion based on the following
>>>> - distance
>>>> - URLNormalizer(s)
>>>> - URLFilter(s)
>>>> in that order.
>>>> I am going to start a new thread on improvements to the GeneratorJob
>>>> regarding better configuration as it is a crucial stage in the crawl
>>>> process.
>>>>
>>>> So the issue here, as you correctly explain, is with the Fetcher
>>>>
>>> obtaining
>>>
>>>> the URLs which have been marked with a desired batchId. This would be
>>>>
>>> done
>>>
>>>> via scanner in Gora.
>>>>
>>>>
>>>>
>>>>
>>> --
>>> *Lewis*
>>>
>>>
>>
>>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: nutch with cassandra internal network usage

Reply via email to