Hi Marcus, I had mentioned In my emaIl that I had tried that parameter already but it didn't work. Is that the only way to achieve this? Can I add code in some plugin for this somewhere?
Thanks, Safdar On Aug 29, 2012 11:18 PM, "Markus Jelsma" <[email protected]> wrote: > Hi > > Check the db.parsemeta.to.crawldb parameter. It'll send your parse meta > keys to the CrawlDatum meta data. > > Cheers > > > > -----Original message----- > > From:Safdar Kureishy <[email protected]> > > Sent: Wed 29-Aug-2012 21:26 > > To: [email protected] > > Subject: Need to transfer Parse metadata obtained in > HtmlParseFilter.filter() to the CrawlDb > > > > Hi, > > > > I've built a custom HtmlParseFilter and am doing custom language > > identification in the filter() API. Here, I am able to set the relevant > > lang id properties on a ParseResult object via > getParseMeta().put("LangId", > > id). I am also able to retrieve these properties in my custom > > ScoringFilter, for use during distributeScoreToOutlinks(). However, what > I > > also need is to persist this data as as metadata in the relevant CrawlDb > > record (i.e., in the CrawlDatum.getMetadata() data structure). My intent, > > from all this, is finally to be able to write custom Hadoop jobs to > gather > > language distribution statistics directy from the Crawldb (without having > > to do any joins on the ParseText, Content, ParseData types). The only > way I > > see this being possible, is if each URL's CrawlDatum also has the lang-id > > in its metadata. > > > > This is turning out to be a challenge. I first tried transfering the > parse > > properties in my custom ScoringFilter.distributeScoreToOutlinks() API, > > because that API offers access to the ParseResut as well as an "adjust" > > CrawlDatum parameter for updating the CrawlDb (according to the > Javadocs). > > However, doing that is not updating the crawldb. Then, in the newsgroup > > archives, I stumbled upon a thread about the > > "db.max.outlinks.per.page"property being used by the ParseOutputFormat > > class to do exactly the same > > property transfer at a different stage of the crawl cycle, but that > doesn't > > work either. > > > > So, I'm writing to the newsgroup hoping someone could give me specific > > advice on which API I should override, or which configuration setting I > > should change, so as to transfer custom parse-time metadata to the > CrawlDb. > > > > Thanks in advance. > > > > Cheers, > > Safdar > > >

