Re: Getting seed url

remi tassing Mon, 11 Jun 2012 15:46:15 -0700

Segments have a field called 'outlinks', could this help?

On Tuesday, June 12, 2012, Sebastian Nagel wrote:


> Hi Sandeep,
>
> tracking the seed(s) for a document could be done by a scoring filter.
> The seed URL must be passed:
>  0  into CrawlDatum's meta by injectedScore()
>    (alternatively, use additional fields in the seed file:
>      <url> <tab> seed=<url>
>     see Injector Javadoc)
>  1  in passScoreBeforeParsing():
>    from CrawlDatum to Content
>  2  in passScoreAfterParsing():
>    from Content to ParseData
>  3  in distributeScoreToOutlinks():
>    from source ParseData to all target/outlink CrawlDatum objects
>  4  in updateDbScore():
>    resolve inlinks from multiple seeds
>
> Point 4 shows a little problem: a page may be reachable from multiple
> seeds.
> The web is a graph not a forest of trees each with one seed as root!
>
> Finally: amazon.com is definitely linked from apache.org
> but it is not a "project" site.
> Wouldn't a mapping <domain name> -> <meta data> be more reliable
> (though notoriously incomplete)?
>
> Best,
> Sebastian
>
> On 06/11/2012 08:09 PM, Sandeep C R wrote:
> > Hello,
> >
> > I am trying to find a way in which I can get the seed url of current url
> > being parsed. I have many URL's in seed.txt. I am trying to add
> additional
> > metadata for each URL crawled. The metadata depends on the seed URL of
> the
> > current URL. This metadata will be later picked by the indexer. I have
> > written a custom plugin for this purpose. However I am unable to get the
> > seed url of the current url being parsed.
> >
> > Ex: This is my seed.txt
> >
> > http://apache.org
> > http://amazon.com
> > http://w3.org
> >
> > For all URL's crawled for every seed URL, I want to add metadata. The
> value
> > of metadata will depend on seed URL. I have a properties file which will
> > map seed url to metadata value. If seed url is http://apache.org then my
> > metadata will be something like "project". If it is http://amazon.comthen
> > it will be "estore". I have written a plugin which will add metadata.
> This
> > plugin extends HtmlParserFilter. However I am not able find a way to get
> > the seed url of current url. If http://nutch.apache.org is being parsed
> > currently, then how do we know the seed url(http:/apache.org) of this
> url?
> > Is there any API which I could use in my plugin? Or is there any better
> way
> > to achieve this?
> >
> > Regards,
> > Sandeep
> >
>
>

Re: Getting seed url

Reply via email to