Hi, I wonder if anyone has looked into how you could parse Magnolia's rich text fields (nodeData, JCR properties) in order to strip out the HTML from them before offering it to Lucene (through Jackrabbit) for indexing?
Currently when you use Magnolia's (Jackrabbit's) Lucene search engine, all text content (JCR string type) is treated equally and is indexed as is. For rich text fields this is unfortunate as it means that any HTML tags inside them is also indexed (in the full text index). This means you can search on these tags if you want. E.g.: http://www.magnolia-cms.com/top-level/searchResult.html?queryStr=%3Ch2%3E or worse: http://www.magnolia-cms.com/top-level/searchResult.html?queryStr=strong Even worse though, it means that you cannot use[url=http://wiki.apache.org/jackrabbit/ExcerptProvider] Jackrabbit's excerpt provider[/url], which is exposed when you use [url=http://www.openmindlab.com/lab/products/mgnlcriteria.html]Openmind's nice Criteria API[/url] for searching. N.b. Magnolia's own default excerpt implementation is very cumbersome and in fact does a new search, looking for occurrences of the search term in the JCR node tree. As Lucene [url=http://wiki.apache.org/lucene-java/LuceneFAQ#How_can_I_index_HTML_documents.3F]states[/url]: [quote]In order to index HTML documents you need to first parse them to extract text that you want to index from them[/quote] Now, Jackrabbit does provide built-in support for parsing HTML documents (in the latest Jackrabbit by means of the Tika framework) but I think the problem is that this is done on basis of the JCR mime type of the content and is only done for binary mime types (such as PDF or Docx). The richtext fields are seen as text and not as binaries of type HTML from a JCR point of view. Any ideas if there is a simple way to accomplish HTML parsing of rich text fields before indexing them in Magnolia's Lucene? PS: I you you can write your own Lucene analyzer and configure it in the Magnolia Jackrabbit configuration but as I understand it parsing content is not what you should use analyzers for? Analyzers are meant for analyzing (stemming, tokenizing and such) not for parsing (I think). -- Context is everything: http://forum.magnolia-cms.com/forum/thread.html?threadId=dd9d9376-c495-41f9-ad88-9c6d0a21a2cf ---------------------------------------------------------------- For list details, see http://www.magnolia-cms.com/community/mailing-lists.html Alternatively, use our forums: http://forum.magnolia-cms.com/ To unsubscribe, E-mail to: <[email protected]> ----------------------------------------------------------------
