Hi Marcus,

I had mentioned In my emaIl that I had tried that parameter already but it
didn't work. Is that the only way to achieve this? Can I add code in some
plugin for this somewhere?

Thanks,
Safdar
On Aug 29, 2012 11:18 PM, "Markus Jelsma" <[email protected]>
wrote:

> Hi
>
> Check the db.parsemeta.to.crawldb parameter. It'll send your parse meta
> keys to the CrawlDatum meta data.
>
> Cheers
>
>
>
> -----Original message-----
> > From:Safdar Kureishy <[email protected]>
> > Sent: Wed 29-Aug-2012 21:26
> > To: [email protected]
> > Subject: Need to transfer Parse metadata obtained in
> HtmlParseFilter.filter() to the CrawlDb
> >
> > Hi,
> >
> > I've built a custom HtmlParseFilter and am doing custom language
> > identification in the filter() API. Here, I am able to set the relevant
> > lang id properties on a ParseResult object via
> getParseMeta().put("LangId",
> > id). I am also able to retrieve these properties in my custom
> > ScoringFilter, for use during distributeScoreToOutlinks(). However, what
> I
> > also need is to persist this data as as metadata in the relevant CrawlDb
> > record (i.e., in the CrawlDatum.getMetadata() data structure). My intent,
> > from all this, is finally to be able to write custom Hadoop jobs to
> gather
> > language distribution statistics directy from the Crawldb (without having
> > to do any joins on the ParseText, Content, ParseData types). The only
> way I
> > see this being possible, is if each URL's CrawlDatum also has the lang-id
> > in its metadata.
> >
> > This is turning out to be a challenge. I first tried transfering the
> parse
> > properties in my custom ScoringFilter.distributeScoreToOutlinks() API,
> > because that API offers access to the ParseResut as well as an "adjust"
> > CrawlDatum parameter for updating the CrawlDb (according to the
> Javadocs).
> > However, doing that is not updating the crawldb. Then, in the newsgroup
> > archives, I stumbled upon a thread about the
> > "db.max.outlinks.per.page"property being used by the ParseOutputFormat
> > class to do exactly the same
> > property transfer at a different stage of the crawl cycle, but that
> doesn't
> > work either.
> >
> > So, I'm writing to the newsgroup hoping someone could give me specific
> > advice on which API I should override, or which configuration setting I
> > should change, so as to transfer custom parse-time metadata to the
> CrawlDb.
> >
> > Thanks in advance.
> >
> > Cheers,
> > Safdar
> >
>

Reply via email to