Hi, I have used Nutch and Solr integration to crawl/index some content successfully. However now I need to categorize the content into more refined list, e.g. imagine the page has sports and news sections (in one url) and I'd like to have each separately indexed in solr. Obviously I have to customize the HTMLParser and look for some css tags to see the main labels and items below those labels, is there any parser that reads css tags? Also I need to modify schema.xml to have other attributes instead of just 'content' it would have 'sport', 'news' and etc. Can these attributes have hierarchy e.g. under 'content' or they have to be separate fields? Other than changing the parser what other things do I have to worry about? I'm thinking this is not a very uncommon use case and there maybe more clues or example? I hope I don't have to touch the solrIndexer? Another alternative, I think, is to have solr store the full 'content' and do all the above things within solr, I don't have enough experience to know which approach is better?
Thanks, Peyman

