Re: Filtering HTML content in Solr 4.0.0

2012-10-26 Thread Rafał Kuć
Hello!

You try to put the HTML into the XML sent to Solr right ? You should
use the proper UTF-8 encoding to do that. For example look at the
utf8-example.xml file from the exampledocs directory that comes with
Solr and you'll see something like this:

field name=featurestag with escaped chars: lt;nicetag/gt;/field

As you can see the  and  are properly encoded as lt; and gt;

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

 Hi,

 I am using Solr 4.0.0. I have a HTML content as description of a product.
 If I index it without any filtering it is giving errors on search.
 How can I filter an HTML content.

 Pratyul



Re: Filtering HTML content in Solr 4.0.0

2012-10-26 Thread Rogério Pereira Araújo

I think you will have to write an UpdateProcessor to strip out html tags.

http://wiki.apache.org/solr/UpdateRequestProcessor

As per Solr 4.0 you can also use scripting languages like Python, Ruby and 
Javascript to write scripts for use as updateprocessors too.


-Mensagem Original- 
From: Pratyul Kapoor

Sent: Friday, October 26, 2012 3:56 AM
To: solr-user@lucene.apache.org
Subject: Filtering HTML content in Solr 4.0.0

Hi,

I am using Solr 4.0.0. I have a HTML content as description of a product.
If I index it without any filtering it is giving errors on search.
How can I filter an HTML content.

Pratyul 



Re: Filtering HTML content in Solr 4.0.0

2012-10-26 Thread Rafał Kuć
Hello!

You don't need a custom update request processor - there is a char
filter dedicated to strip HTML tags from your content and index only
relevant parts of it - 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory

However, you first need to properly send it to Solr for indexing. 

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

 I think you will have to write an UpdateProcessor to strip out html tags.

 http://wiki.apache.org/solr/UpdateRequestProcessor

 As per Solr 4.0 you can also use scripting languages like Python, Ruby and
 Javascript to write scripts for use as updateprocessors too.

 -Mensagem Original- 
 From: Pratyul Kapoor
 Sent: Friday, October 26, 2012 3:56 AM
 To: solr-user@lucene.apache.org
 Subject: Filtering HTML content in Solr 4.0.0

 Hi,

 I am using Solr 4.0.0. I have a HTML content as description of a product.
 If I index it without any filtering it is giving errors on search.
 How can I filter an HTML content.

 Pratyul