I'm trying to create a test to make sure that character sequences like
egrave; are successfully converted to their equivalent utf
character (that is, in this case, รจ).
So, I'd like to search my solr index using the equivalent of the
following regular expression:
\w{1,6};
To find any escaped
StandardTokenizer will have stripped punctuation I think. You might try
searching for all the entity names though:
(agrave | egrave | omacron | etc... )
The names are pretty distinctive. Although you might have problems with
greek letters.
-Mike
On 04/28/2011 12:10 PM, Paul wrote:
I'm