Re: DIH import from MySQL results in garbage text for special chars

Pranav Prakash Wed, 26 Sep 2012 03:17:47 -0700

I looked at the HEX codes of the texts. The hex code in MySQL is different
from that which is stored in the index.


The hex code in index is longer than the hex code in MySQL, this leads me
to the fact that somewhere in between smething is messing up,

*Pranav Prakash*

"temet nosce"



On Fri, Sep 21, 2012 at 11:19 AM, Pranav Prakash <pra...@gmail.com> wrote:

> I am seeing the garbage text in browser, Luke Index Toolbox and everywhere
> it is the same. My servlet container is Jetty which is the out-of-box one.
> Many other special chars are getting indexed and stored properly, only few
> characters causes pain.
>
> *Pranav Prakash*
>
> "temet nosce"
>
>
>
>
> On Fri, Sep 14, 2012 at 6:36 PM, Erick Erickson 
> <erickerick...@gmail.com>wrote:
>
>> Is your _browser_ set to handle the appropriate character set? Or whatever
>> you're using to inspect your data? How about your servlet container?
>>
>>
>>
>> Best
>> Erick
>>
>> On Mon, Sep 10, 2012 at 7:47 AM, Pranav Prakash <pra...@gmail.com> wrote:
>> > Hi Folks,
>> >
>> > I am attempting to import documents to Solr from MySQL using DIH. One of
>> > the field contains the text - “Future of Mobile Value Added Services
>> (VAS)
>> > in Australia” .Notice the character “ and ”.
>> >
>> > When I am importing, it gets stored as - â€œFuture of Mobile Value Added
>> > Services (VAS) in Australiaâ€�.
>> >
>> > The datasource config clearly mentions use of UTF-8 as follows:
>> >
>> >   <dataSource type="JdbcDataSource"
>> >     driver="com.mysql.jdbc.Driver"
>> >     url="jdbc:mysql://localhost/ohapp_devel"
>> >     user="username"
>> >     useUnicode="true"
>> >     characterEncoding="UTF-8"
>> >     password="password"
>> >     zeroDateTimeBehavior="convertToNull"
>> >     name="app" />
>> >
>> >
>> > A plain SQL Select statement on the MySQL Console gives appropriate
>> text. I
>> > even tried using following scriptTransformer to get rid of this char,
>> but
>> > it was of no particular use in my case.
>> >
>> > function gsub(source, pattern, replacement) {
>> >       var match, result;
>> >       if (!((pattern != null) && (replacement != null))) {
>> >         return source;
>> >       }
>> >       result = '';
>> >       while (source.length > 0) {
>> >         if ((match = source.match(pattern))) {
>> >           result += source.slice(0, match.index);
>> >           result += replacement;
>> >           source = source.slice(match.index + match[0].length);
>> >         } else {
>> >           result += source;
>> >           source = '';
>> >         }
>> >       }
>> >       return result;
>> >     }
>> >
>> >     function fixQuotes(c){
>> >       c = gsub(c, /\342\200(?:\234|\235)/,'"');
>> >       c = gsub(c, /\342\200(?:\230|\231)/,"'");
>> >       c = gsub(c, /\342\200\223/,"-");
>> >       c = gsub(c, /\342\200\246/,"...");
>> >       c = gsub(c, /\303\242\342\202\254\342\204\242/,"'");
>> >       c = gsub(c, /\303\242\342\202\254\302\235/,'"');
>> >       c = gsub(c, /\303\242\342\202\254\305\223/,'"');
>> >       c = gsub(c, /\303\242\342\202\254"/,'-');
>> >       c = gsub(c, /\342\202\254\313\234/,'"');
>> >       c = gsub(c, /“/, '"');
>> >       return c;
>> >     }
>> >
>> > function cleanFields(row){
>> >       var fieldsToClean = ['title', 'description'];
>> >       for(i =0; i< fieldsToClean.length; i++){
>> >         var old_text = String(row.get(fieldsToClean[i]));
>> >         row.put(fieldsToClean[i], fixQuotes(old_text) );
>> >       }
>> >       return row;
>> >     }
>> >
>> > My understanding goes that this must be a very common problem. It also
>> > occurs with human names which have these chars. What is an appropriate
>> way
>> > to get the appropriate text indexed and searchable? The fieldtype where
>> > this is stored goes as follows
>> >
>> >   <fieldType name="text_commongrams" class="solr.TextField">
>> >     <analyzer>
>> >       <charFilter class="solr.HTMLStripCharFilterFactory" />
>> >       <tokenizer class="solr.StandardTokenizerFactory" />
>> >       <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>> >       <filter class="solr.TrimFilterFactory" />
>> >       <filter class="solr.LowerCaseFilterFactory" />
>> >       <filter class="solr.SnowballPorterFilterFactory"
>> language="English"
>> > protected="protwords.txt"/>
>> >     <filter class="solr.SynonymFilterFactory"
>> >       synonyms="synonyms.txt"
>> >       ignoreCase="true"
>> >       expand="true" />
>> >     <filter class="solr.CommonGramsFilterFactory"
>> >       words="stopwords_en.txt"
>> >       ignoreCase="true" />
>> >     <filter class="solr.StopFilterFactory"
>> >       words="stopwords_en.txt"
>> >       ignoreCase="true" />
>> >     <filter class="solr.WordDelimiterFilterFactory"
>> >       generateWordParts="1"
>> >       generateNumberParts="1"
>> >       catenateWords="1"
>> >       catenateNumbers="1"
>> >       catenateAll="0"
>> >       preserveOriginal="1" />
>> >   </analyzer>
>> > </fieldType>
>> >
>> >
>> > *Pranav Prakash*
>> >
>> > "temet nosce"
>>
>
>

Re: DIH import from MySQL results in garbage text for special chars

Reply via email to