Roland, i am curious to know exactly what is happening between the fetcherJob initiation and actual fetch of 1st URL. Does the terminal just hang? Can you track some metrics of the job?
On Thursday, February 21, 2013, Roland <[email protected]> wrote: > Hi Julien, > > the point I personally don't get, is: why is generating fast - fetching not. > If it's possible to filter the generatorJob at the backend (what I think it does), shouldn't it be possible to do the same for the fetcher? > > --Roland > > Am 21.02.2013 12:27, schrieb Julien Nioche: >> >> Lewis, >> >> The point is whether the filtering is done on the backend side (e.g. using >> queries, indices, etc...) then passed on to MapReduce via GORA or as I >> assume by looking at the code filtered within the MapReduce which means >> that all the entries are pulled from the backend anyway. >> This makes quite a difference in terms of performance if you think e.g >> about a large webtable which would have to be entirely passed to mapreduce >> even if only a handful of entries are to be processed. >> >> Makes sense? >> >> Julien >> >> >> On 21 February 2013 01:52, Lewis John Mcgibbney >> <[email protected]>wrote: >> >>> Those filters are applied only to URLs which do not have a null >>> GENERATE_MARK >>> e.g. >>> >>> if (Mark.GENERATE_MARK.checkMark(page) != null) { >>> if (GeneratorJob.LOG.isDebugEnabled()) { >>> GeneratorJob.LOG.debug("Skipping " + url + "; already generated"); >>> } >>> return; >>> >>> Therefore filters will be applied to all URLs which have a null >>> GENERATE_MARK value. >>> >>> On Wed, Feb 20, 2013 at 2:45 PM, <[email protected]> wrote: >>> >>>> Hi, >>>> >>>> Are those filters put on all data selected from hbase or sent to hbase as >>>> filters to select a subset of all hbase records? >>>> >>>> Thanks. >>>> Alex. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Lewis John Mcgibbney <[email protected]> >>>> To: user <[email protected]> >>>> Sent: Wed, Feb 20, 2013 12:56 pm >>>> Subject: Re: nutch with cassandra internal network usage >>>> >>>> >>>> Hi Alex, >>>> >>>> On Wed, Feb 20, 2013 at 11:54 AM, <[email protected]> wrote: >>>> >>>>> The generator also does not have filters. Its mapper goes over all >>>>> records as far as I know. If you use hadoop you can see how many >>> >>> records >>>> >>>> go >>>>> >>>>> as input to mappers. Also see this >>>>> >>>> I don't think this is true. The GeneratorMapper filters URLs before >>>> selecting them for inclusion based on the following >>>> - distance >>>> - URLNormalizer(s) >>>> - URLFilter(s) >>>> in that order. >>>> I am going to start a new thread on improvements to the GeneratorJob >>>> regarding better configuration as it is a crucial stage in the crawl >>>> process. >>>> >>>> So the issue here, as you correctly explain, is with the Fetcher >>> >>> obtaining >>>> >>>> the URLs which have been marked with a desired batchId. This would be >>> >>> done >>>> >>>> via scanner in Gora. >>>> >>>> >>>> >>> >>> -- >>> *Lewis* >>> >> >> > > -- *Lewis*

