Hi, On Fri, Dec 24, 2010 at 6:39 PM, Markus Jelsma <[email protected]> wrote: > You can find the anchor text in the LinkDB.
Thanks, but it how we will access LinkDb from CrawlDbFilter, I want to have the filter in CrawlDbFilter because then only I can reduce the amount of fetching required. I will fetch only the pages with specific pattern of anchor text, if it is not matching that pattern, I will not fetch that page(just like urlfilter, but here anchor is used instead of url). Correct me if I am wrong, i don't know whether I am missing something in the basic nutch architecture. > > On Friday 24 December 2010 14:00:45 Nobin Mathew wrote: >> Hi, >> >> I am Nobin, and I am working on a search engine based on nutch. >> >> I have some questions regarding nutch, and will be very helpful for me >> if somebody can answer. >> >> I am working on a plugin(anchor based url filter) where i need to have >> anchor text in CrawlDbFilter (nutch 1.2), but after going through >> source, it seems getting anchor in CrawlDbFilter will not be easy, >> because none of parameters in >> >> public void map(Text key, CrawlDatum value, >> OutputCollector<Text, CrawlDatum> output, Reporter reporter) >> >> stores the anchor text, >> >> is there any class through which i can access this anchor text? >> >> 2)in nutch 2.0 (nutch base) i think there is a way to get this anchor text >> in >> >> class GeneratorMapper >> >> public void map(String reversedUrl, WebPage page, Context context) >> >> through the WebPage class. >> >> But there is a problem, I think this Webpage object is for this url >> (reverse of reversedUrl), not it's parent (parent's webpage(page >> conatining this outlink), only parent contain anchor text. >> >> 3)what is the use of reprUrl member in WebPage class. >> >> Thanks >> Nobin Mathew > > -- > Markus Jelsma - CTO - Openindex > http://www.linkedin.com/in/markus17 > 050-8536620 / 06-50258350 >

