Hi Roland My previous email should have started with "The point Alex is making is ..." and not just "The point is ...". I don't have an explanation as to why the generator is faster than the fetching as I don't use 2.x at all but it would definitely be interesting to find out. The behaviour of the fetcher is how I expect GORA to behave in its current form i.e. pull everything - filter - process.
Julien On 21 February 2013 16:58, Roland <[email protected]> wrote: > Hi Julien, > > the point I personally don't get, is: why is generating fast - fetching > not. > If it's possible to filter the generatorJob at the backend (what I think > it does), shouldn't it be possible to do the same for the fetcher? > > --Roland > > Am 21.02.2013 12:27, schrieb Julien Nioche: > > Lewis, >> >> The point is whether the filtering is done on the backend side (e.g. using >> queries, indices, etc...) then passed on to MapReduce via GORA or as I >> assume by looking at the code filtered within the MapReduce which means >> that all the entries are pulled from the backend anyway. >> This makes quite a difference in terms of performance if you think e.g >> about a large webtable which would have to be entirely passed to mapreduce >> even if only a handful of entries are to be processed. >> >> Makes sense? >> >> Julien >> >> >> On 21 February 2013 01:52, Lewis John Mcgibbney >> <[email protected]>**wrote: >> >> Those filters are applied only to URLs which do not have a null >>> GENERATE_MARK >>> e.g. >>> >>> if (Mark.GENERATE_MARK.checkMark(**page) != null) { >>> if (GeneratorJob.LOG.**isDebugEnabled()) { >>> GeneratorJob.LOG.debug("**Skipping " + url + "; already >>> generated"); >>> } >>> return; >>> >>> Therefore filters will be applied to all URLs which have a null >>> GENERATE_MARK value. >>> >>> On Wed, Feb 20, 2013 at 2:45 PM, <[email protected]> wrote: >>> >>> Hi, >>>> >>>> Are those filters put on all data selected from hbase or sent to hbase >>>> as >>>> filters to select a subset of all hbase records? >>>> >>>> Thanks. >>>> Alex. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Lewis John Mcgibbney <[email protected]> >>>> To: user <[email protected]> >>>> Sent: Wed, Feb 20, 2013 12:56 pm >>>> Subject: Re: nutch with cassandra internal network usage >>>> >>>> >>>> Hi Alex, >>>> >>>> On Wed, Feb 20, 2013 at 11:54 AM, <[email protected]> wrote: >>>> >>>> The generator also does not have filters. Its mapper goes over all >>>>> records as far as I know. If you use hadoop you can see how many >>>>> >>>> records >>> >>>> go >>>> >>>>> as input to mappers. Also see this >>>>> >>>>> I don't think this is true. The GeneratorMapper filters URLs before >>>> selecting them for inclusion based on the following >>>> - distance >>>> - URLNormalizer(s) >>>> - URLFilter(s) >>>> in that order. >>>> I am going to start a new thread on improvements to the GeneratorJob >>>> regarding better configuration as it is a crucial stage in the crawl >>>> process. >>>> >>>> So the issue here, as you correctly explain, is with the Fetcher >>>> >>> obtaining >>> >>>> the URLs which have been marked with a desired batchId. This would be >>>> >>> done >>> >>>> via scanner in Gora. >>>> >>>> >>>> >>>> >>> -- >>> *Lewis* >>> >>> >> >> > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

