tx! somehow missed that jira!

On 12.09.2011 11:20, Markus Jelsma wrote:
https://issues.apache.org/jira/browse/NUTCH-1005

Hi,

Since I'm relatively new to Nutch/Solr, I was wondering if the following
would make sense:

Headings in web pages (h1, h2, h3) should be more important than any
other content of the page, so if a match to a query turns up in a
heading, the ranking of the document should be higher. In order to boost
a field, I would need to separately index it - this would mean on
parsing the crawled pages, I would need to strip out the headings h1, h2
and h3, index them in separate fields, and remove them from the content
field. I presume I would have to modify the HTML Parser and Index Basic
plugin for this, or is there an easier solution?

Any input appreciated,
Elisabeth

Reply via email to