Hi Roland, Can you please open a JIRA for this? Thanks for investigating, the explanation makes a lot of sense
Julien On 4 March 2013 07:26, Roland <[email protected]> wrote: > Hi all, > > I've read the sources ;) > (no, not really all, but enough, I hope) > > So, major difference between generator & fetcher are the fields that it's > loading from db. > As I had fetcher.store.content=true in the beginning, there was a lot data > in the contents fields. > I run with fetcher.parse=true and that's why it loads all content during > start-up of fetcherJob. > > I did this in my local 2.1 sources: > Index: src/java/org/apache/nutch/**fetcher/FetcherJob.java > ==============================**==============================**======= > --- src/java/org/apache/nutch/**fetcher/FetcherJob.java (revision > 1448112) > +++ src/java/org/apache/nutch/**fetcher/FetcherJob.java (working copy) > @@ -140,6 +140,8 @@ > if (job.getConfiguration().**getBoolean(PARSE_KEY, false)) { > ParserJob parserJob = new ParserJob(); > fields.addAll(parserJob.**getFields(job)); > + fields.remove(WebPage.Field.**CONTENT); // FIXME > + fields.remove(WebPage.Field.**OUTLINKS); // FIXME > } > ProtocolFactory protocolFactory = new ProtocolFactory(job.** > getConfiguration()); > fields.addAll(protocolFactory.**getFields()); > > and now start-up time of an fetcherJob is about 10 minutes :) > > --Roland > > > Am 22.02.2013 10:28, schrieb Roland: > > Hi Julien, >> >> ok, so thanks for the clarification, I think I have to read the sources :) >> >> --Roland >> >> Am 22.02.2013 10:10, schrieb Julien Nioche: >> >>> Hi Roland >>> >>> My previous email should have started with "The point Alex is making is >>> ..." >>> and not just "The point is ...". >>> I don't have an explanation as to why the generator is faster than the >>> fetching as I don't use 2.x at all but it would definitely be interesting >>> to find out. The behaviour of the fetcher is how I expect GORA to behave >>> in >>> its current form i.e. pull everything - filter - process. >>> >>> Julien >>> >>> >>> On 21 February 2013 16:58, Roland <[email protected]> wrote: >>> >>> Hi Julien, >>>> >>>> the point I personally don't get, is: why is generating fast - fetching >>>> not. >>>> If it's possible to filter the generatorJob at the backend (what I think >>>> it does), shouldn't it be possible to do the same for the fetcher? >>>> >>>> --Roland >>>> >>>> Am 21.02.2013 12:27, schrieb Julien Nioche: >>>> >>>> Lewis, >>>> >>>>> The point is whether the filtering is done on the backend side (e.g. >>>>> using >>>>> queries, indices, etc...) then passed on to MapReduce via GORA or as I >>>>> assume by looking at the code filtered within the MapReduce which means >>>>> that all the entries are pulled from the backend anyway. >>>>> This makes quite a difference in terms of performance if you think e.g >>>>> about a large webtable which would have to be entirely passed to >>>>> mapreduce >>>>> even if only a handful of entries are to be processed. >>>>> >>>>> Makes sense? >>>>> >>>>> Julien >>>>> >>>>> >>>>> On 21 February 2013 01:52, Lewis John Mcgibbney >>>>> <[email protected]>****wrote: >>>>> >>>>> Those filters are applied only to URLs which do not have a null >>>>> >>>>>> GENERATE_MARK >>>>>> e.g. >>>>>> >>>>>> if (Mark.GENERATE_MARK.checkMark(****page) != null) { >>>>>> if (GeneratorJob.LOG.****isDebugEnabled()) { >>>>>> GeneratorJob.LOG.debug("****Skipping " + url + "; already >>>>>> generated"); >>>>>> } >>>>>> return; >>>>>> >>>>>> Therefore filters will be applied to all URLs which have a null >>>>>> GENERATE_MARK value. >>>>>> >>>>>> On Wed, Feb 20, 2013 at 2:45 PM, <[email protected]> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>>> Are those filters put on all data selected from hbase or sent to >>>>>>> hbase >>>>>>> as >>>>>>> filters to select a subset of all hbase records? >>>>>>> >>>>>>> Thanks. >>>>>>> Alex. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Lewis John Mcgibbney <[email protected]> >>>>>>> To: user <[email protected]> >>>>>>> Sent: Wed, Feb 20, 2013 12:56 pm >>>>>>> Subject: Re: nutch with cassandra internal network usage >>>>>>> >>>>>>> >>>>>>> Hi Alex, >>>>>>> >>>>>>> On Wed, Feb 20, 2013 at 11:54 AM, <[email protected]> wrote: >>>>>>> >>>>>>> The generator also does not have filters. Its mapper goes over all >>>>>>> >>>>>>>> records as far as I know. If you use hadoop you can see how many >>>>>>>> >>>>>>>> records >>>>>>> go >>>>>>> >>>>>>> as input to mappers. Also see this >>>>>>>> >>>>>>>> I don't think this is true. The GeneratorMapper filters URLs >>>>>>>> before >>>>>>>> >>>>>>> selecting them for inclusion based on the following >>>>>>> - distance >>>>>>> - URLNormalizer(s) >>>>>>> - URLFilter(s) >>>>>>> in that order. >>>>>>> I am going to start a new thread on improvements to the GeneratorJob >>>>>>> regarding better configuration as it is a crucial stage in the crawl >>>>>>> process. >>>>>>> >>>>>>> So the issue here, as you correctly explain, is with the Fetcher >>>>>>> >>>>>>> obtaining >>>>>> >>>>>> the URLs which have been marked with a desired batchId. This would be >>>>>>> >>>>>>> done >>>>>> >>>>>> via scanner in Gora. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>> *Lewis* >>>>>> >>>>>> >>>>>> >>>>> >>> >> > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

