RE: How to handle searches across traditional and simplifies Chinese?

2011-03-08 Thread Burton-West, Tom
This page discusses the reasons why it's not a simple one to one mapping

http://www.kanji.org/cjk/c2c/c2cbasis.htm

Tom
-Original Message-
 I have documents that contain both simplified and traditional Chinese 
 characters. Is there any way to search across them? For example, if someone 
 searches for 类 (simplified Chinese), I'd like to be able to recognize that 
 the equivalent character is 類 in traditional Chinese and search for 类 or 類 in 
 the documents


Re: How to handle searches across traditional and simplifies Chinese?

2011-03-07 Thread François Schiettecatte
I did a little research into this for a client a while. The character mapping 
is not one to one which complicates things (TC and SC have evolved 
independently) and if you want to do a perfect job you will need a dictionary. 
However there are tables out there (I can dig one up for you) that allow 
conversion from one to the other. So you would pick either TC or SC as your 
canonical Chinese, and just convert all the documents and searches to it.

I will stress that this is very much a brute force approach, the mapping is not 
perfect and the two character sets have evolved (much like UK and US English, I 
was brought up in the UK and live in the US).

Hope this helps.

Cheers

François

On Mar 7, 2011, at 5:02 PM, Andy wrote:

 I have documents that contain both simplified and traditional Chinese 
 characters. Is there any way to search across them? For example, if someone 
 searches for 类 (simplified Chinese), I'd like to be able to recognize that 
 the equivalent character is 類 in traditional Chinese and search for 类 or 類 in 
 the documents. 
 
 Is that something that Solr, or any related software, can do? Is there a 
 standard approach in dealing with this problem?
 
 Thanks.
 
 
 



Re: How to handle searches across traditional and simplifies Chinese?

2011-03-07 Thread Andy
Thanks. Please tell me more about the tables/software that does the conversion. 
Really appreciate your help.


--- On Mon, 3/7/11, François Schiettecatte fschietteca...@gmail.com wrote:

 From: François Schiettecatte fschietteca...@gmail.com
 Subject: Re: How to handle searches across traditional and simplifies Chinese?
 To: solr-user@lucene.apache.org
 Date: Monday, March 7, 2011, 5:24 PM
 I did a little research into this for
 a client a while. The character mapping is not one to one
 which complicates things (TC and SC have evolved
 independently) and if you want to do a perfect job you will
 need a dictionary. However there are tables out there (I can
 dig one up for you) that allow conversion from one to the
 other. So you would pick either TC or SC as your canonical
 Chinese, and just convert all the documents and searches to
 it.
 
 I will stress that this is very much a brute force
 approach, the mapping is not perfect and the two character
 sets have evolved (much like UK and US English, I was
 brought up in the UK and live in the US).
 
 Hope this helps.
 
 Cheers
 
 François
 
 On Mar 7, 2011, at 5:02 PM, Andy wrote:
 
  I have documents that contain both simplified and
 traditional Chinese characters. Is there any way to search
 across them? For example, if someone searches for 类
 (simplified Chinese), I'd like to be able to recognize that
 the equivalent character is 類 in traditional Chinese and
 search for 类 or 類 in the documents. 
  
  Is that something that Solr, or any related software,
 can do? Is there a standard approach in dealing with this
 problem?
  
  Thanks.
  
  
  
 
 





Re: How to handle searches across traditional and simplifies Chinese?

2011-03-07 Thread François Schiettecatte
Here are a bunch of resources which will help:


This does TC = SC conversions:


http://search.cpan.org/~audreyt/Encode-HanConvert-0.35/lib/Encode/HanConvert.pm


This has a TC = SC converter in there somewhere:

http://www.mediawiki.org/wiki/MediaWiki


This explains some of the issues behind TC = SC conversions:

http://people.w3.org/rishida/scripts/chinese/


Misc tools:

http://mandarintools.com/


François


On Mar 7, 2011, at 7:01 PM, Andy wrote:

 Thanks. Please tell me more about the tables/software that does the 
 conversion. Really appreciate your help.
 
 
 --- On Mon, 3/7/11, François Schiettecatte fschietteca...@gmail.com wrote:
 
 From: François Schiettecatte fschietteca...@gmail.com
 Subject: Re: How to handle searches across traditional and simplifies 
 Chinese?
 To: solr-user@lucene.apache.org
 Date: Monday, March 7, 2011, 5:24 PM
 I did a little research into this for
 a client a while. The character mapping is not one to one
 which complicates things (TC and SC have evolved
 independently) and if you want to do a perfect job you will
 need a dictionary. However there are tables out there (I can
 dig one up for you) that allow conversion from one to the
 other. So you would pick either TC or SC as your canonical
 Chinese, and just convert all the documents and searches to
 it.
 
 I will stress that this is very much a brute force
 approach, the mapping is not perfect and the two character
 sets have evolved (much like UK and US English, I was
 brought up in the UK and live in the US).
 
 Hope this helps.
 
 Cheers
 
 François
 
 On Mar 7, 2011, at 5:02 PM, Andy wrote:
 
 I have documents that contain both simplified and
 traditional Chinese characters. Is there any way to search
 across them? For example, if someone searches for 类
 (simplified Chinese), I'd like to be able to recognize that
 the equivalent character is 類 in traditional Chinese and
 search for 类 or 類 in the documents. 
 
 Is that something that Solr, or any related software,
 can do? Is there a standard approach in dealing with this
 problem?
 
 Thanks.
 
 
 
 
 
 
 
 



Re: How to handle searches across traditional and simplifies Chinese?

2011-03-07 Thread Robert Muir
On Mon, Mar 7, 2011 at 7:01 PM, Andy angelf...@yahoo.com wrote:
 Thanks. Please tell me more about the tables/software that does the 
 conversion. Really appreciate your help.


also you might be interested in this example:

filter class=solr.ICUTransformFilterFactory id=Traditional-Simplified/

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUTransformFilterFactory