On Sep 1, 2014, at 3:46 AM, xan <[email protected]> wrote: > As a start, I'm able to crawl websites and index the entire content to Solr. > > But, I want to index only specific content between certain HTML tags instead > of the whole page. > > So, to achieve this, what should I use and how? Parser or filter.
My answer would be both,you’ll need to write a parser plugin to extract the desired sections of the raw web content. Later on you’ll need to write an indexer plugin to actually index the data onto Solr. I’ll recommend you to take a closer look to how the default plugins are written. > > I browsed through the mailing archives and a lot of blogs but couldn't find > any suitable methods of doing do. > I’ve written a small post on how to write an indexing plugin[0] it’s not advanced at all, but it will give you an idea (hopefully). The target version of nutch is 1.x so for 2.x this explanation it’s no longer valid, although the general idea could still apply. > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/HTML-tag-filtering-or-parsing-tp4156126.html > Sent from the Nutch - User mailing list archive at Nabble.com. [0] https://jorgelbg.wordpress.com/2014/08/30/indexing-inlinks-and-outlinks-with-nutch-1-x/VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 2014. Ver www.uci.cu

