Re: Getting seed url

Sebastian Nagel Tue, 12 Jun 2012 11:53:38 -0700

Thanks Julien,

I've missed that urlmeta passes the tags to the outlinks.


Sebastian

On 06/12/2012 03:42 PM, Julien Nioche wrote:
> forgot to say : this would work by adding a seed metadata to the urls in
> the seed list, the value of which is then propagated by the scoring filter
> in urlmeta
> 
> On 12 June 2012 14:41, Julien Nioche <[email protected]> wrote:
> 
>> That's the idea indeed. The urlmeta plugin allows to do that simply by
>> setting urlmeta.tags in nutch-site.xml (see nutch-default.xml for
>> description etc...)
>>
>>
>>
>> On 11 June 2012 22:45, Sebastian Nagel <[email protected]> wrote:
>>
>>> Hi Sandeep,
>>>
>>> tracking the seed(s) for a document could be done by a scoring filter.
>>> The seed URL must be passed:
>>>  0  into CrawlDatum's meta by injectedScore()
>>>    (alternatively, use additional fields in the seed file:
>>>      <url> <tab> seed=<url>
>>>     see Injector Javadoc)
>>>  1  in passScoreBeforeParsing():
>>>    from CrawlDatum to Content
>>>  2  in passScoreAfterParsing():
>>>    from Content to ParseData
>>>  3  in distributeScoreToOutlinks():
>>>    from source ParseData to all target/outlink CrawlDatum objects
>>>  4  in updateDbScore():
>>>    resolve inlinks from multiple seeds
>>>
>>> Point 4 shows a little problem: a page may be reachable from multiple
>>> seeds.
>>> The web is a graph not a forest of trees each with one seed as root!
>>>
>>> Finally: amazon.com is definitely linked from apache.org
>>> but it is not a "project" site.
>>> Wouldn't a mapping <domain name> -> <meta data> be more reliable
>>> (though notoriously incomplete)?
>>>
>>> Best,
>>> Sebastian
>>>
>>> On 06/11/2012 08:09 PM, Sandeep C R wrote:
>>>> Hello,
>>>>
>>>> I am trying to find a way in which I can get the seed url of current url
>>>> being parsed. I have many URL's in seed.txt. I am trying to add
>>> additional
>>>> metadata for each URL crawled. The metadata depends on the seed URL of
>>> the
>>>> current URL. This metadata will be later picked by the indexer. I have
>>>> written a custom plugin for this purpose. However I am unable to get the
>>>> seed url of the current url being parsed.
>>>>
>>>> Ex: This is my seed.txt
>>>>
>>>> http://apache.org
>>>> http://amazon.com
>>>> http://w3.org
>>>>
>>>> For all URL's crawled for every seed URL, I want to add metadata. The
>>> value
>>>> of metadata will depend on seed URL. I have a properties file which will
>>>> map seed url to metadata value. If seed url is http://apache.org then
>>> my
>>>> metadata will be something like "project". If it is http://amazon.comthen
>>>> it will be "estore". I have written a plugin which will add metadata.
>>> This
>>>> plugin extends HtmlParserFilter. However I am not able find a way to get
>>>> the seed url of current url. If http://nutch.apache.org is being parsed
>>>> currently, then how do we know the seed url(http:/apache.org) of this
>>> url?
>>>> Is there any API which I could use in my plugin? Or is there any better
>>> way
>>>> to achieve this?
>>>>
>>>> Regards,
>>>> Sandeep
>>>>
>>>
>>>
>>
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>>
> 
>

Re: Getting seed url

Reply via email to