Re: Subcategorizing Page Content

Lewis John Mcgibbney Sun, 27 Nov 2011 10:19:39 -0800

Hi Peyman,

There are a couple of questions here. Some of which I must admit are
completely Solr related.

1) You seem to have a pretty good idea of what needs to be customised and
where. With regards to a CSS parser, I would assume that Tika would handle
this for your. I would be extremely surprised if it didn't. Having had a
quick look on the tika archives for keyword CSS [1], there is plenty there
so hopefully you can implement something from the libraries.
2) With regards to the various fields, if you have a look at the new Solr
4.x schema support Andrzej added this will give you a flavour for more
complex/expressive configurations. With regards to nesting of fields within
some sort of hierarchy I am not entirely sure, maybe someone can advise,
however even if this is not possible, you can still create individual
fields as we do for numerous other elements.
3) I would imagine that an indexingfilter to handle all of this stuff will
definitely leave you free from having to hack the SolrIndexer.

[1] http://tika.markmail.org/search/?q=css

On Sun, Nov 27, 2011 at 5:12 PM, Peyman Mohajerian <[email protected]>wrote:

> Hi,
>
> I have used Nutch and Solr integration to crawl/index some content
> successfully. However now I need to categorize the content into more
> refined list, e.g. imagine the page has sports and news sections (in
> one url) and I'd like to have each separately indexed in solr.
> Obviously I have to customize the HTMLParser and look for some css
> tags to see the main labels and items below those labels, is there any
> parser that reads css tags? Also I need to modify schema.xml to have
> other attributes instead of just 'content' it would have 'sport',
> 'news' and etc. Can these attributes have hierarchy e.g. under
> 'content' or they have to be separate fields?
> Other than changing the parser what other things do I have to worry
> about? I'm thinking this is not a very uncommon use case and there
> maybe more clues or example? I hope I don't have to touch the
> solrIndexer?
> Another alternative, I think, is to have solr store the full 'content'
> and do all the above things within solr, I don't have enough
> experience to know which approach is better?
>
> Thanks,
> Peyman
>

-- 
*Lewis*

Re: Subcategorizing Page Content

Reply via email to