Subcategorizing Page Content

Peyman Mohajerian Sun, 27 Nov 2011 09:12:50 -0800

Hi,

I have used Nutch and Solr integration to crawl/index some content
successfully. However now I need to categorize the content into more
refined list, e.g. imagine the page has sports and news sections (in
one url) and I'd like to have each separately indexed in solr.
Obviously I have to customize the HTMLParser and look for some css
tags to see the main labels and items below those labels, is there any
parser that reads css tags? Also I need to modify schema.xml to have
other attributes instead of just 'content' it would have 'sport',
'news' and etc. Can these attributes have hierarchy e.g. under
'content' or they have to be separate fields?
Other than changing the parser what other things do I have to worry
about? I'm thinking this is not a very uncommon use case and there
maybe more clues or example? I hope I don't have to touch the
solrIndexer?
Another alternative, I think, is to have solr store the full 'content'
and do all the above things within solr, I don't have enough
experience to know which approach is better?


Thanks,
Peyman

Subcategorizing Page Content

Reply via email to