Hi Peyman, There are a couple of questions here. Some of which I must admit are completely Solr related.
1) You seem to have a pretty good idea of what needs to be customised and where. With regards to a CSS parser, I would assume that Tika would handle this for your. I would be extremely surprised if it didn't. Having had a quick look on the tika archives for keyword CSS [1], there is plenty there so hopefully you can implement something from the libraries. 2) With regards to the various fields, if you have a look at the new Solr 4.x schema support Andrzej added this will give you a flavour for more complex/expressive configurations. With regards to nesting of fields within some sort of hierarchy I am not entirely sure, maybe someone can advise, however even if this is not possible, you can still create individual fields as we do for numerous other elements. 3) I would imagine that an indexingfilter to handle all of this stuff will definitely leave you free from having to hack the SolrIndexer. [1] http://tika.markmail.org/search/?q=css On Sun, Nov 27, 2011 at 5:12 PM, Peyman Mohajerian <[email protected]>wrote: > Hi, > > I have used Nutch and Solr integration to crawl/index some content > successfully. However now I need to categorize the content into more > refined list, e.g. imagine the page has sports and news sections (in > one url) and I'd like to have each separately indexed in solr. > Obviously I have to customize the HTMLParser and look for some css > tags to see the main labels and items below those labels, is there any > parser that reads css tags? Also I need to modify schema.xml to have > other attributes instead of just 'content' it would have 'sport', > 'news' and etc. Can these attributes have hierarchy e.g. under > 'content' or they have to be separate fields? > Other than changing the parser what other things do I have to worry > about? I'm thinking this is not a very uncommon use case and there > maybe more clues or example? I hope I don't have to touch the > solrIndexer? > Another alternative, I think, is to have solr store the full 'content' > and do all the above things within solr, I don't have enough > experience to know which approach is better? > > Thanks, > Peyman > -- *Lewis*

