And this is the exact problem. Some characters are stored as entities, some are 
not. When it is time to display, what else needs escaped? At a minimum, you 
would have to always store & as & to avoid escaping the leading ampersand 
in the entities.

You could store every single character as a numeric entity. Or you could store 
every non-ASCII character as a numeric entity. Or every non-Latin1 character. 
Plus ampersand, of course.

In these e-mails, we are distinguishing between ™ and ™. How would you do 
that? By storing "™" as "™".

To avoid all this double-think, always store text as Unicode code points, 
encoded with a standard Unicode method (UTF-8, etc.).

When displaying, only make entities if the codepoints cannot be represented in 
the target character encoding. If you are sending things in US-ASCII, you will 
be sending lots of entities.

A good encoding library has callbacks for characters that cannot be 
represented. You can use these callbacks to format out-of-charset codepoints as 
entities. I've done this in product code, it really works.

Finally, if you don't believe me, believe the XML Infoset, where numeric 
entities are always interpreted as treated as Unicode codepoints.

The other way to go insane is storing local time in the database. Always store 
UTC and convert at the edges.

wunder

On Nov 21, 2013, at 7:50 AM, "Jack Krupansky" <j...@basetechnology.com> wrote:

> "Would you store "a" as "&#65;" ?"
> 
> No, not in any case.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Michael Sokolov
> Sent: Thursday, November 21, 2013 8:56 AM
> To: solr-user@lucene.apache.org
> Subject: Re: How to index X™ as ™ (HTML decimal entity)
> 
> I have to agree w/Walter.  Use unicode as a storage format.  The entity
> encodings are for transfer/interchange.  Encode/decode on the way in and
> out if you have to.  Would you store "a" as "&#65;" ?  It makes it
> impossible to search for, for one thing.  What if someone wants to
> search for the TM character?
> 
> -Mike
> 
> On 11/20/13 12:07 PM, Jack Krupansky wrote:
>> AFAICT, it's not an "extremely bad idea" - using SGML/HTML as a format for 
>> storing text to be rendered. If you disagree - try explaining yourself.
>> 
>> But maybe TM should be encoded as "&trade;". Ditto for other named SGML 
>> entities.
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Walter Underwood
>> Sent: Wednesday, November 20, 2013 11:21 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>> 
>> Again, I'd like to know why this is wanted. It sounds like an X-Y, problem. 
>> Storing Unicode characters as XML/HTML encoded character references is an 
>> extremely bad idea.
>> 
>> wunder
>> 
>> On Nov 20, 2013, at 5:01 AM, "Jack Krupansky" <j...@basetechnology.com> 
>> wrote:
>> 
>>> Any analysis filtering affects the indexed value only, but the stored value 
>>> would be unchanged from the original input value. An update processor lets 
>>> you modify the original input value that will be stored.
>>> 
>>> -- Jack Krupansky
>>> 
>>> -----Original Message----- From: Uwe Reh
>>> Sent: Wednesday, November 20, 2013 5:43 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>>> 
>>> What's about having a simple charfilter in the analyzer queue for
>>> indexing *and* searching. e.g
>>> <charFilter class="solr.PatternReplaceFilterFactory" pattern="™"
>>> replacement="&#8482;" />
>>> or
>>> <charFilter class="solr.MappingCharFilterFactory"
>>> mapping="mapping-specials.txt" />
>>> 
>>> Uwe
>>> 
>>> Am 19.11.2013 23:46, schrieb Developer:
>>>> I have a data coming in to SOLR as below.
>>>> 
>>>> <field name="displayName">X™ - Black</field>
>>>> 
>>>> I need to store the HTML Entity (decimal) equivalent value (i.e. &#8482;)
>>>> in SOLR rather than storing the original value.
>>>> 
>>>> Is there a way to do this?
>>> 
>> 
>> -- 
>> Walter Underwood
>> wun...@wunderwood.org
>> 
>> 
> 

--
Walter Underwood
wun...@wunderwood.org



Reply via email to