Re: How to index X™ as ™ (HTML decimal entity)

Jack Krupansky Thu, 21 Nov 2013 10:32:28 -0800

Ah... now I understand your perspective - you have taken a narrow view ofwhat "text" is. A broader view is that it can contain formatting and special"entities" as well, or rich text in general. My "read" is that it alldepends on the nature of the application and its requirements, not a "onesize fits all" approach. The four main approaches being pure ASCII,Unicode/UTF-8, SGML for non-ASCII characters, and full HTML for formattingand rich text. And let the app needs determine which is most appropriate foreach piece of text.

The goal of SGML and HTML is not to hard-wire the final presentation, butsimply to preserve some level of source format and structure, and then applyfinal presentation formatting on top of that.

Some apps may opt to store the same information in multiple formats, such asone for raw text search, one for basic display, and one for "detail"display.

I'm more of a "platform" guy than an "app-specific" guy - give the appdeveloper tools that they can blend to meet their own requirements (orinterests or tastes.)

But Solr users should make no mistake, SGML entities are a perfectly validintermediate format for rich text.


-- Jack Krupansky

-----Original Message-----From: Walter Underwood

Sent: Thursday, November 21, 2013 11:44 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

And this is the exact problem. Some characters are stored as entities, someare not. When it is time to display, what else needs escaped? At a minimum,you would have to always store & as & to avoid escaping the leadingampersand in the entities.

You could store every single character as a numeric entity. Or you couldstore every non-ASCII character as a numeric entity. Or every non-Latin1character. Plus ampersand, of course.

In these e-mails, we are distinguishing between ™ and ™. How would youdo that? By storing "™" as "&trade;".

To avoid all this double-think, always store text as Unicode code points,encoded with a standard Unicode method (UTF-8, etc.).

When displaying, only make entities if the codepoints cannot be representedin the target character encoding. If you are sending things in US-ASCII, youwill be sending lots of entities.

A good encoding library has callbacks for characters that cannot berepresented. You can use these callbacks to format out-of-charset codepointsas entities. I've done this in product code, it really works.

Finally, if you don't believe me, believe the XML Infoset, where numericentities are always interpreted as treated as Unicode codepoints.

The other way to go insane is storing local time in the database. Alwaysstore UTC and convert at the edges.


wunder

On Nov 21, 2013, at 7:50 AM, "Jack Krupansky" <j...@basetechnology.com>wrote:

"Would you store "a" as "&#65;" ?"

No, not in any case.

-- Jack Krupansky

-----Original Message----- From: Michael Sokolov
Sent: Thursday, November 21, 2013 8:56 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

I have to agree w/Walter.  Use unicode as a storage format.  The entity
encodings are for transfer/interchange.  Encode/decode on the way in and
out if you have to.  Would you store "a" as "&#65;" ?  It makes it
impossible to search for, for one thing.  What if someone wants to
search for the TM character?

-Mike

On 11/20/13 12:07 PM, Jack Krupansky wrote:
AFAICT, it's not an "extremely bad idea" - using SGML/HTML as a formatfor storing text to be rendered. If you disagree - try explainingyourself.
But maybe TM should be encoded as "™". Ditto for other named SGMLentities.
-- Jack Krupansky

-----Original Message----- From: Walter Underwood
Sent: Wednesday, November 20, 2013 11:21 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)
Again, I'd like to know why this is wanted. It sounds like an X-Y,problem. Storing Unicode characters as XML/HTML encoded characterreferences is an extremely bad idea.
wunder
On Nov 20, 2013, at 5:01 AM, "Jack Krupansky" <j...@basetechnology.com>wrote:
Any analysis filtering affects the indexed value only, but the storedvalue would be unchanged from the original input value. An updateprocessor lets you modify the original input value that will be stored.
-- Jack Krupansky

-----Original Message----- From: Uwe Reh
Sent: Wednesday, November 20, 2013 5:43 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

What's about having a simple charfilter in the analyzer queue for
indexing *and* searching. e.g
<charFilter class="solr.PatternReplaceFilterFactory" pattern="™"
replacement="&#8482;" />
or
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-specials.txt" />

Uwe

Am 19.11.2013 23:46, schrieb Developer:
I have a data coming in to SOLR as below.

<field name="displayName">X™ - Black</field>
I need to store the HTML Entity (decimal) equivalent value (i.e.™)
in SOLR rather than storing the original value.

Is there a way to do this?
--
Walter Underwood
wun...@wunderwood.org


--
Walter Underwood
wun...@wunderwood.org

Re: How to index X™ as ™ (HTML decimal entity)

Reply via email to