Re: Filtering HTML content in Solr 4.0.0

Rafał Kuć Fri, 26 Oct 2012 05:59:04 -0700

Hello!

You don't need a custom update request processor - there is a char
filter dedicated to strip HTML tags from your content and index only
relevant parts of it - 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory


However, you first need to properly send it to Solr for indexing. 

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

> I think you will have to write an UpdateProcessor to strip out html tags.

> http://wiki.apache.org/solr/UpdateRequestProcessor

> As per Solr 4.0 you can also use scripting languages like Python, Ruby and
> Javascript to write scripts for use as updateprocessors too.

> -----Mensagem Original----- 
> From: Pratyul Kapoor
> Sent: Friday, October 26, 2012 3:56 AM
> To: solr-user@lucene.apache.org
> Subject: Filtering HTML content in Solr 4.0.0

> Hi,

> I am using Solr 4.0.0. I have a HTML content as description of a product.
> If I index it without any filtering it is giving errors on search.
> How can I filter an HTML content.

> Pratyul

Re: Filtering HTML content in Solr 4.0.0

Reply via email to