Re: Problem with UTF-8 and Solr ISOLatin1AccentFilterFactory

aerox7 Fri, 20 Mar 2009 05:04:37 -0700

Yes ! i completely understand the problem. I'm just asking about your
solution to resolvre this problem.


I gess that you use Solar PERL Client to index your DATABASE. for my case i
use DataImportHandler, so to only solution that i have with this is to
create a transformer for DataImportHandler and try to convert my row from
latin to UTF-8. (see
http://wiki.apache.org/solr/DataImportHandler#head-27fcc2794bd71f7d727104ffc6b99e194bdb6ff9)
 

So i just wanna know if you use DataImportHandler two with a perl script
like a transformer ?


Óscar Marín Miró wrote:
> 
> What I mean is that unless "solène" travels to Solr in strict UTF-8,
> mapping-ISOLatin1Accent won't do anything, and posibly your DB query
> returns
> data in ISO-Latin1 (I always have this issue with UTF8-Mysql), so unless
> you
> transcode your data from Latin1 to UTF8 before sending it to SolR,
> mapping-ISOLatin1Accent won't know how to interpret it.
> 
> Does it make any sense? :P
> 
> On Fri, Mar 20, 2009 at 11:53 AM, aerox7 <amyne.berr...@me.com> wrote:
> 
>>
>> I'm using DataImportHandler to send my data to Solr ! so you mean it
>> possible
>> to apply a transformer in db-config.xml with a perl script ?
>>
>>
>> Óscar Marín Miró wrote:
>> >
>> > Hi,
>> >
>> > My guess is that *although* your DB is in UTF-8, the database engine
>> sends
>> > you the rows in ISO-Latin1, so before doing *anything* after receiving
>> the
>> > data, you should transcode from ISO-Latin1 to UTF-8 and then send that
>> to
>> > SolR. I'm no Java expert, but in perl (MySQL DB in utf-8) I have to do
>> > with
>> > any row:
>> >
>> > $row=decode("iso-8859-1",$row);
>> >
>> > ... and before building the xml to invoque and add document to SolR:
>> >
>> > $row=encode("utf8",$row);
>> >
>> > On Fri, Mar 20, 2009 at 10:55 AM, aerox7 <amyne.berr...@me.com> wrote:
>> >
>> >>
>> >> I add :
>> >> "Ã¨" => "e" to mapping-ISOLatin1Accent.txt
>> >>
>> >> and add the following fieldType:
>> >>
>> >> <fieldType name="textCharNorm" class="solr.TextField"
>> >> positionIncrementGap="100" >
>> >>  <analyzer>
>> >>    <charFilter class="solr.MappingCharFilterFactory"
>> >> mapping="mapping-ISOLatin1Accent.txt"/>
>> >>    <tokenizer class="solr.CharStreamAwareWhitespaceTokenizerFactory"/>
>> >>  </analyzer>
>> >> </fieldType>
>> >>
>> >> By still have the same probleme ! it's only work when i store ISO
>> string
>> >> into UTF-8 data base (ex: store solène not solÃ¨ne)............ :,(
>> >>
>> >>
>> >>
>> >>
>> >> aerox7 wrote:
>> >> >
>> >> > ==> where are you seeing it as ""SolÃ¨ne" as opposed to the
>> >> > correct way of solène?
>> >> >
>> >> > I have "SolÃ¨ne" in my Mysql DATA BASE ! so i don't know if this is
>> >> > correct or not ? i gess that "SolÃ¨ne" is solène in UTF-8 ?!
>> >> >
>> >> > I'vz tryed analysis in
>> http://localhost:8983/solr/admin/analysis.jsp,
>> >> so
>> >> > when i try with solène everything is ok ! but when i try with
>> SolÃ¨ne
>> >> > (like what i have in DB) analysis convert Ã in A delete ¨ so i get
>> >> SolAne
>> >> > !!!
>> >> >
>> >> > I think that ISOLatin1AccentFilterFactory take only string with
>> Charset
>> >> > ISO-8859-1 .
>> >> >
>> >> > So any solution to transform my string to ISO-8859-1 before indexing
>> >> > process. May be by creating transformer in DataImportHandler ?
>> (Never
>> >> code
>> >> > in java :( )
>> >> >
>> >> > Thank you all.
>> >> >
>> >> >
>> >> > Koji Sekiguchi-2 wrote:
>> >> >>
>> >> >> aerox7 wrote:
>> >> >>> Hi,
>> >> >>> I have a mysql data base in UTF-8. I have a row with "SolÃ¨ne"
>> >> (solène).
>> >> >>> I
>> >> >>> want to transforme this to solene, so i use Solr
>> >> >>> ISOLatin1AccentFilterFactory to perform this task but it dosn't
>> work
>> >> ?!!
>> >> >>>
>> >> >>> i gess that "SolÃ¨ne" is "solène" in UTF-8 ?! i also set tomcat to
>> >> utf-8
>> >> >>> so
>> >> >>> normaly ISOLatin1AccentFilterFactory have to replace the accent
>> >> .......
>> >> >>>
>> >> >>> any ideas ?
>> >> >>>
>> >> >>> i use DataImportHandler.
>> >> >>>
>> >> >>
>> >> >> If a mapping rule "Ã¨" to "e" is always true in your field, you can
>> >> try
>> >> >> to use MappingCharFilter
>> >> >> instead of ISOLatin1AccentFilter. Add the following line to
>> >> >> mapping-ISOLatin1Accent.txt:
>> >> >>
>> >> >> "Ã¨" => "e"
>> >> >>
>> >> >> and add the following fieldType:
>> >> >>
>> >> >> <fieldType name="textCharNorm" class="solr.TextField"
>> >> >> positionIncrementGap="100" >
>> >> >>   <analyzer>
>> >> >>     <charFilter class="solr.MappingCharFilterFactory"
>> >> >> mapping="mapping-ISOLatin1Accent.txt"/>
>> >> >>     <tokenizer
>> >> class="solr.CharStreamAwareWhitespaceTokenizerFactory"/>
>> >> >>   </analyzer>
>> >> >> </fieldType>
>> >> >>
>> >> >> MappingCharFilter and mapping-ISOLatin1Accent.txt are in nightly
>> >> build.
>> >> >>
>> >> >> Koji
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/Problem-with-UTF-8-and-Solr-ISOLatin1AccentFilterFactory-tp22607642p22617278.html
>> >> Sent from the Solr - User mailing list archive at Nabble.com.
>> >>
>> >>
>> >
>> >
>> > --
>> > “I may not believe in myself, but I believe in what I'm doing.”
>> >
>> > -- Jimmy Page
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Problem-with-UTF-8-and-Solr-ISOLatin1AccentFilterFactory-tp22607642p22618085.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> -- 
> “I may not believe in myself, but I believe in what I'm doing.”
> 
> -- Jimmy Page
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Problem-with-UTF-8-and-Solr-ISOLatin1AccentFilterFactory-tp22607642p22618999.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problem with UTF-8 and Solr ISOLatin1AccentFilterFactory

Reply via email to