Separately indexing headings of the content

Elisabeth Adler Mon, 12 Sep 2011 01:59:16 -0700

Hi,

Since I'm relatively new to Nutch/Solr, I was wondering if the followingwould make sense:

Headings in web pages (h1, h2, h3) should be more important than anyother content of the page, so if a match to a query turns up in aheading, the ranking of the document should be higher. In order to boosta field, I would need to separately index it - this would mean onparsing the crawled pages, I would need to strip out the headings h1, h2and h3, index them in separate fields, and remove them from the contentfield. I presume I would have to modify the HTML Parser and Index Basicplugin for this, or is there an easier solution?


Any input appreciated,
Elisabeth

Separately indexing headings of the content

Reply via email to