On Sep 1, 2014, at 3:46 AM, xan <[email protected]> wrote:

> As a start, I'm able to crawl websites and index the entire content to Solr.
> 
> But, I want to index only specific content between certain HTML tags instead
> of the whole page.
> 
> So, to achieve this, what should I use and how? Parser or filter. 

My answer would be both,you’ll need to write a parser plugin to extract the 
desired sections of the raw web content. Later on you’ll need to write an 
indexer plugin to actually index the data onto Solr. I’ll recommend you to take 
a closer look to how the default plugins are written. 

> 
> I browsed through the mailing archives and a lot of blogs but couldn't find
> any suitable methods of doing do.
> 

I’ve written a small post on how to write an indexing plugin[0] it’s not 
advanced at all, but it will give you an idea (hopefully). The target version 
of nutch is 1.x so for 2.x this explanation it’s no longer valid, although the 
general idea could still apply.

> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/HTML-tag-filtering-or-parsing-tp4156126.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

[0] 
https://jorgelbg.wordpress.com/2014/08/30/indexing-inlinks-and-outlinks-with-nutch-1-x/VII
 Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 
2014. Ver www.uci.cu

Reply via email to