Re: Getting seed url

Julien Nioche Tue, 12 Jun 2012 06:43:11 -0700

forgot to say : this would work by adding a seed metadata to the urls in
the seed list, the value of which is then propagated by the scoring filter
in urlmeta


On 12 June 2012 14:41, Julien Nioche <[email protected]> wrote:

> That's the idea indeed. The urlmeta plugin allows to do that simply by
> setting urlmeta.tags in nutch-site.xml (see nutch-default.xml for
> description etc...)
>
>
>
> On 11 June 2012 22:45, Sebastian Nagel <[email protected]> wrote:
>
>> Hi Sandeep,
>>
>> tracking the seed(s) for a document could be done by a scoring filter.
>> The seed URL must be passed:
>>  0  into CrawlDatum's meta by injectedScore()
>>    (alternatively, use additional fields in the seed file:
>>      <url> <tab> seed=<url>
>>     see Injector Javadoc)
>>  1  in passScoreBeforeParsing():
>>    from CrawlDatum to Content
>>  2  in passScoreAfterParsing():
>>    from Content to ParseData
>>  3  in distributeScoreToOutlinks():
>>    from source ParseData to all target/outlink CrawlDatum objects
>>  4  in updateDbScore():
>>    resolve inlinks from multiple seeds
>>
>> Point 4 shows a little problem: a page may be reachable from multiple
>> seeds.
>> The web is a graph not a forest of trees each with one seed as root!
>>
>> Finally: amazon.com is definitely linked from apache.org
>> but it is not a "project" site.
>> Wouldn't a mapping <domain name> -> <meta data> be more reliable
>> (though notoriously incomplete)?
>>
>> Best,
>> Sebastian
>>
>> On 06/11/2012 08:09 PM, Sandeep C R wrote:
>> > Hello,
>> >
>> > I am trying to find a way in which I can get the seed url of current url
>> > being parsed. I have many URL's in seed.txt. I am trying to add
>> additional
>> > metadata for each URL crawled. The metadata depends on the seed URL of
>> the
>> > current URL. This metadata will be later picked by the indexer. I have
>> > written a custom plugin for this purpose. However I am unable to get the
>> > seed url of the current url being parsed.
>> >
>> > Ex: This is my seed.txt
>> >
>> > http://apache.org
>> > http://amazon.com
>> > http://w3.org
>> >
>> > For all URL's crawled for every seed URL, I want to add metadata. The
>> value
>> > of metadata will depend on seed URL. I have a properties file which will
>> > map seed url to metadata value. If seed url is http://apache.org then
>> my
>> > metadata will be something like "project". If it is http://amazon.comthen
>> > it will be "estore". I have written a plugin which will add metadata.
>> This
>> > plugin extends HtmlParserFilter. However I am not able find a way to get
>> > the seed url of current url. If http://nutch.apache.org is being parsed
>> > currently, then how do we know the seed url(http:/apache.org) of this
>> url?
>> > Is there any API which I could use in my plugin? Or is there any better
>> way
>> > to achieve this?
>> >
>> > Regards,
>> > Sandeep
>> >
>>
>>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Getting seed url

Reply via email to