amount of
string processing it does, the fact that it is a Reader probably does not
affect its performance.
Cheers,
Lance
-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 22, 2008 10:14 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing HTML
John,
Solr already has some of this stuff:
$ ff \*HTML\*java
./src/test/org/apache/solr/analysis/HTMLStripReaderTest.java
./src/java/org/apache/solr/analysis/HTMLStripStandardTokenizerFactory.java
./src/java/org/apache/solr/analysis/HTMLStripReader.java
./src/java/org/apache/solr/analysis/HTMLStr
Actually, it's very easy: http://us2.php.net/strip_tags
I also store the data in a separate field with the html intact for
display. In that case, I use urlencode on the string.
David
McBride, John wrote:
Hello,
In my application I wish to index articles which are stored in HTML
format.
Up
Hi,
Maybe this one?
http://htmlparser.sourceforge.net/
/Jimi
Quoting "McBride, John" <[EMAIL PROTECTED]>:
Hello,
In my application I wish to index articles which are stored in HTML
format.
Upon indexing these the html gets stored along with the content of the
article, which is undesirable.
Thanks Jérôme!
It seems to work now. I just hope the provided
HTMLStripWhitespaceTokenizerFactory will strip the right tags now.
I use Java and used HtmlEncoder provided in
http://itext.ugent.be/library/api/ for encoding with success. (just
in case someone happens to search this thread)
Ravi
You need to encode your html content so it can be include as a normal
'string' value in your xml element.
As far as remember, the only unsafe characters you have to encode as
entities are:
< -> <
> -> >
" -> "e;
& -> &
(google xml entities to be sure).
I dont know what language you use , but fo