Re: anchor text in crawldb/Generator

Nobin Mathew Fri, 24 Dec 2010 08:18:30 -0800

Hi,

On Fri, Dec 24, 2010 at 6:39 PM, Markus Jelsma
<[email protected]> wrote:
> You can find the anchor text in the LinkDB.


Thanks, but it how we will access LinkDb from CrawlDbFilter, I want to
have the filter in CrawlDbFilter because then only I can reduce the
amount of fetching required. I will fetch only the pages with specific
pattern of anchor text, if it is not matching that pattern, I will not
fetch that page(just like urlfilter, but here anchor is used instead
of url).

Correct me if I am wrong, i don't know whether I am missing something
in the basic nutch architecture.

>
> On Friday 24 December 2010 14:00:45 Nobin Mathew wrote:
>> Hi,
>>
>> I am Nobin, and I am working on a search engine based on nutch.
>>
>> I have some questions regarding nutch, and will be very helpful for me
>> if somebody can answer.
>>
>> I am working on a plugin(anchor based url filter) where i need to have
>> anchor text in CrawlDbFilter (nutch 1.2), but after going  through
>> source, it seems getting anchor in  CrawlDbFilter will not be easy,
>> because none of parameters in
>>
>> public void map(Text key, CrawlDatum value,
>> OutputCollector<Text, CrawlDatum> output,      Reporter reporter)
>>
>> stores the anchor text,
>>
>> is there any class through which i can access this anchor text?
>>
>> 2)in nutch 2.0 (nutch base) i think there is a way to get this anchor text
>> in
>>
>> class GeneratorMapper
>>
>> public void map(String reversedUrl, WebPage page,  Context context)
>>
>> through the WebPage class.
>>
>> But there is a problem, I think this Webpage object is for this url
>> (reverse of reversedUrl), not it's parent (parent's webpage(page
>> conatining this outlink),  only parent contain anchor text.
>>
>> 3)what is the use of reprUrl member in WebPage class.
>>
>> Thanks
>> Nobin Mathew
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Re: anchor text in crawldb/Generator

Reply via email to