Re: nutch with cassandra internal network usage

Julien Nioche Mon, 04 Mar 2013 02:06:21 -0800

Hi Roland,

Can you please open a JIRA for this? Thanks for investigating, the
explanation makes a lot of sense


Julien

On 4 March 2013 07:26, Roland <[email protected]> wrote:

> Hi all,
>
> I've read the sources ;)
> (no, not really all, but enough, I hope)
>
> So, major difference between generator & fetcher are the fields that it's
> loading from db.
> As I had fetcher.store.content=true in the beginning, there was a lot data
> in the contents fields.
> I run with fetcher.parse=true and that's why it loads all content during
> start-up of fetcherJob.
>
> I did this in my local 2.1 sources:
> Index: src/java/org/apache/nutch/**fetcher/FetcherJob.java
> ==============================**==============================**=======
> --- src/java/org/apache/nutch/**fetcher/FetcherJob.java   (revision
> 1448112)
> +++ src/java/org/apache/nutch/**fetcher/FetcherJob.java   (working copy)
> @@ -140,6 +140,8 @@
>      if (job.getConfiguration().**getBoolean(PARSE_KEY, false)) {
>        ParserJob parserJob = new ParserJob();
>        fields.addAll(parserJob.**getFields(job));
> +      fields.remove(WebPage.Field.**CONTENT); // FIXME
> +      fields.remove(WebPage.Field.**OUTLINKS); // FIXME
>      }
>      ProtocolFactory protocolFactory = new ProtocolFactory(job.**
> getConfiguration());
>      fields.addAll(protocolFactory.**getFields());
>
> and now start-up time of an fetcherJob is about 10 minutes :)
>
> --Roland
>
>
> Am 22.02.2013 10:28, schrieb Roland:
>
>  Hi Julien,
>>
>> ok, so thanks for the clarification, I think I have to read the sources :)
>>
>> --Roland
>>
>> Am 22.02.2013 10:10, schrieb Julien Nioche:
>>
>>> Hi Roland
>>>
>>> My previous email should have started with "The point Alex is making is
>>> ..."
>>> and not just "The point is ...".
>>> I don't have an explanation as to why the generator is faster than the
>>> fetching as I don't use 2.x at all but it would definitely be interesting
>>> to find out. The behaviour of the fetcher is how I expect GORA to behave
>>> in
>>> its current form i.e. pull everything - filter - process.
>>>
>>> Julien
>>>
>>>
>>> On 21 February 2013 16:58, Roland <[email protected]> wrote:
>>>
>>>  Hi Julien,
>>>>
>>>> the point I personally don't get, is: why is generating fast - fetching
>>>> not.
>>>> If it's possible to filter the generatorJob at the backend (what I think
>>>> it does), shouldn't it be possible to do the same for the fetcher?
>>>>
>>>> --Roland
>>>>
>>>> Am 21.02.2013 12:27, schrieb Julien Nioche:
>>>>
>>>>   Lewis,
>>>>
>>>>> The point is whether the filtering is done on the backend side (e.g.
>>>>> using
>>>>> queries, indices, etc...) then passed on to MapReduce via GORA or as I
>>>>> assume by looking at the code filtered within the MapReduce which means
>>>>> that all the entries are pulled from the backend anyway.
>>>>> This makes quite a difference in terms of performance if you think e.g
>>>>> about a large webtable which would have to be entirely passed to
>>>>> mapreduce
>>>>> even if only a handful of entries are to be processed.
>>>>>
>>>>> Makes sense?
>>>>>
>>>>> Julien
>>>>>
>>>>>
>>>>> On 21 February 2013 01:52, Lewis John Mcgibbney
>>>>> <[email protected]>****wrote:
>>>>>
>>>>>   Those filters are applied only to URLs which do not have a null
>>>>>
>>>>>> GENERATE_MARK
>>>>>> e.g.
>>>>>>
>>>>>>       if (Mark.GENERATE_MARK.checkMark(****page) != null) {
>>>>>>         if (GeneratorJob.LOG.****isDebugEnabled()) {
>>>>>>           GeneratorJob.LOG.debug("****Skipping " + url + "; already
>>>>>> generated");
>>>>>>         }
>>>>>>         return;
>>>>>>
>>>>>> Therefore filters will be applied to all URLs which have a null
>>>>>> GENERATE_MARK value.
>>>>>>
>>>>>> On Wed, Feb 20, 2013 at 2:45 PM, <[email protected]> wrote:
>>>>>>
>>>>>>   Hi,
>>>>>>
>>>>>>> Are those filters put on all data selected from hbase or sent to
>>>>>>> hbase
>>>>>>> as
>>>>>>> filters to select a subset of all hbase records?
>>>>>>>
>>>>>>> Thanks.
>>>>>>> Alex.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Lewis John Mcgibbney <[email protected]>
>>>>>>> To: user <[email protected]>
>>>>>>> Sent: Wed, Feb 20, 2013 12:56 pm
>>>>>>> Subject: Re: nutch with cassandra internal network usage
>>>>>>>
>>>>>>>
>>>>>>> Hi Alex,
>>>>>>>
>>>>>>> On Wed, Feb 20, 2013 at 11:54 AM, <[email protected]> wrote:
>>>>>>>
>>>>>>>   The generator also does not have filters. Its mapper goes over all
>>>>>>>
>>>>>>>> records as far as I know. If you use hadoop you can see how many
>>>>>>>>
>>>>>>>>  records
>>>>>>> go
>>>>>>>
>>>>>>>  as input to mappers. Also see this
>>>>>>>>
>>>>>>>>   I don't think this is true. The GeneratorMapper filters URLs
>>>>>>>> before
>>>>>>>>
>>>>>>> selecting them for inclusion based on the following
>>>>>>> - distance
>>>>>>> - URLNormalizer(s)
>>>>>>> - URLFilter(s)
>>>>>>> in that order.
>>>>>>> I am going to start a new thread on improvements to the GeneratorJob
>>>>>>> regarding better configuration as it is a crucial stage in the crawl
>>>>>>> process.
>>>>>>>
>>>>>>> So the issue here, as you correctly explain, is with the Fetcher
>>>>>>>
>>>>>>>  obtaining
>>>>>>
>>>>>>  the URLs which have been marked with a desired batchId. This would be
>>>>>>>
>>>>>>>  done
>>>>>>
>>>>>>  via scanner in Gora.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  --
>>>>>> *Lewis*
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: nutch with cassandra internal network usage

Reply via email to