[magnolia-user] Parse rich text (HTML) content before indexing in Lucene?

Edgar Vonk (via Magnolia Forums) Tue, 04 Dec 2012 07:02:45 -0800

Hi,

I wonder if anyone has looked into how you could parse Magnolia's rich text 
fields (nodeData, JCR properties) in order to strip out the HTML from them 
before offering it to Lucene (through Jackrabbit) for indexing?


Currently when you use Magnolia's (Jackrabbit's) Lucene search engine, all text 
content (JCR string type) is treated equally and is indexed as is. For rich 
text fields this is unfortunate as it means that any HTML tags inside them is 
also indexed (in the full text index). This means you can search on these tags 
if you want. E.g.:
http://www.magnolia-cms.com/top-level/searchResult.html?queryStr=%3Ch2%3E
or worse:
http://www.magnolia-cms.com/top-level/searchResult.html?queryStr=strong

Even worse though, it means that you cannot 
use[url=http://wiki.apache.org/jackrabbit/ExcerptProvider] Jackrabbit's excerpt 
provider[/url], which is exposed when you use 
[url=http://www.openmindlab.com/lab/products/mgnlcriteria.html]Openmind's nice 
Criteria API[/url] for searching. N.b. Magnolia's own default excerpt 
implementation is very cumbersome and in fact does a new search, looking for 
occurrences of the search term in the JCR node tree.

As Lucene 
[url=http://wiki.apache.org/lucene-java/LuceneFAQ#How_can_I_index_HTML_documents.3F]states[/url]:
[quote]In order to index HTML documents you need to first parse them to extract 
text that you want to index from them[/quote]

Now, Jackrabbit does provide built-in support for parsing HTML documents (in 
the latest Jackrabbit by means of the Tika framework) but I think the problem 
is that this is done on basis of the JCR mime type of the content and is only 
done for binary mime types (such as PDF or Docx). The richtext fields are seen 
as text and not as binaries of type HTML from a JCR point of view.

Any ideas if there is a simple way to accomplish HTML parsing of rich text 
fields before indexing them in Magnolia's Lucene?

PS: I you you can write your own Lucene analyzer and configure it in the 
Magnolia Jackrabbit configuration but as I understand it parsing content is not 
what you should use analyzers for? Analyzers are meant for analyzing (stemming, 
tokenizing and such) not for parsing (I think).

-- 
Context is everything: 
http://forum.magnolia-cms.com/forum/thread.html?threadId=dd9d9376-c495-41f9-ad88-9c6d0a21a2cf


----------------------------------------------------------------
For list details, see http://www.magnolia-cms.com/community/mailing-lists.html
Alternatively, use our forums: http://forum.magnolia-cms.com/
To unsubscribe, E-mail to: <[email protected]>
----------------------------------------------------------------

[magnolia-user] Parse rich text (HTML) content before indexing in Lucene?

Reply via email to