Hi,

Since I'm relatively new to Nutch/Solr, I was wondering if the following would make sense:

Headings in web pages (h1, h2, h3) should be more important than any other content of the page, so if a match to a query turns up in a heading, the ranking of the document should be higher. In order to boost a field, I would need to separately index it - this would mean on parsing the crawled pages, I would need to strip out the headings h1, h2 and h3, index them in separate fields, and remove them from the content field. I presume I would have to modify the HTML Parser and Index Basic plugin for this, or is there an easier solution?

Any input appreciated,
Elisabeth

Reply via email to