RE: How to handle searches across traditional and simplifies Chinese?
This page discusses the reasons why it's not a simple one to one mapping http://www.kanji.org/cjk/c2c/c2cbasis.htm Tom -Original Message- I have documents that contain both simplified and traditional Chinese characters. Is there any way to search across them? For example, if someone searches for 类 (simplified Chinese), I'd like to be able to recognize that the equivalent character is 類 in traditional Chinese and search for 类 or 類 in the documents
Re: How to handle searches across traditional and simplifies Chinese?
I did a little research into this for a client a while. The character mapping is not one to one which complicates things (TC and SC have evolved independently) and if you want to do a perfect job you will need a dictionary. However there are tables out there (I can dig one up for you) that allow conversion from one to the other. So you would pick either TC or SC as your canonical Chinese, and just convert all the documents and searches to it. I will stress that this is very much a brute force approach, the mapping is not perfect and the two character sets have evolved (much like UK and US English, I was brought up in the UK and live in the US). Hope this helps. Cheers François On Mar 7, 2011, at 5:02 PM, Andy wrote: I have documents that contain both simplified and traditional Chinese characters. Is there any way to search across them? For example, if someone searches for 类 (simplified Chinese), I'd like to be able to recognize that the equivalent character is 類 in traditional Chinese and search for 类 or 類 in the documents. Is that something that Solr, or any related software, can do? Is there a standard approach in dealing with this problem? Thanks.
Re: How to handle searches across traditional and simplifies Chinese?
Thanks. Please tell me more about the tables/software that does the conversion. Really appreciate your help. --- On Mon, 3/7/11, François Schiettecatte fschietteca...@gmail.com wrote: From: François Schiettecatte fschietteca...@gmail.com Subject: Re: How to handle searches across traditional and simplifies Chinese? To: solr-user@lucene.apache.org Date: Monday, March 7, 2011, 5:24 PM I did a little research into this for a client a while. The character mapping is not one to one which complicates things (TC and SC have evolved independently) and if you want to do a perfect job you will need a dictionary. However there are tables out there (I can dig one up for you) that allow conversion from one to the other. So you would pick either TC or SC as your canonical Chinese, and just convert all the documents and searches to it. I will stress that this is very much a brute force approach, the mapping is not perfect and the two character sets have evolved (much like UK and US English, I was brought up in the UK and live in the US). Hope this helps. Cheers François On Mar 7, 2011, at 5:02 PM, Andy wrote: I have documents that contain both simplified and traditional Chinese characters. Is there any way to search across them? For example, if someone searches for 类 (simplified Chinese), I'd like to be able to recognize that the equivalent character is 類 in traditional Chinese and search for 类 or 類 in the documents. Is that something that Solr, or any related software, can do? Is there a standard approach in dealing with this problem? Thanks.
Re: How to handle searches across traditional and simplifies Chinese?
Here are a bunch of resources which will help: This does TC = SC conversions: http://search.cpan.org/~audreyt/Encode-HanConvert-0.35/lib/Encode/HanConvert.pm This has a TC = SC converter in there somewhere: http://www.mediawiki.org/wiki/MediaWiki This explains some of the issues behind TC = SC conversions: http://people.w3.org/rishida/scripts/chinese/ Misc tools: http://mandarintools.com/ François On Mar 7, 2011, at 7:01 PM, Andy wrote: Thanks. Please tell me more about the tables/software that does the conversion. Really appreciate your help. --- On Mon, 3/7/11, François Schiettecatte fschietteca...@gmail.com wrote: From: François Schiettecatte fschietteca...@gmail.com Subject: Re: How to handle searches across traditional and simplifies Chinese? To: solr-user@lucene.apache.org Date: Monday, March 7, 2011, 5:24 PM I did a little research into this for a client a while. The character mapping is not one to one which complicates things (TC and SC have evolved independently) and if you want to do a perfect job you will need a dictionary. However there are tables out there (I can dig one up for you) that allow conversion from one to the other. So you would pick either TC or SC as your canonical Chinese, and just convert all the documents and searches to it. I will stress that this is very much a brute force approach, the mapping is not perfect and the two character sets have evolved (much like UK and US English, I was brought up in the UK and live in the US). Hope this helps. Cheers François On Mar 7, 2011, at 5:02 PM, Andy wrote: I have documents that contain both simplified and traditional Chinese characters. Is there any way to search across them? For example, if someone searches for 类 (simplified Chinese), I'd like to be able to recognize that the equivalent character is 類 in traditional Chinese and search for 类 or 類 in the documents. Is that something that Solr, or any related software, can do? Is there a standard approach in dealing with this problem? Thanks.
Re: How to handle searches across traditional and simplifies Chinese?
On Mon, Mar 7, 2011 at 7:01 PM, Andy angelf...@yahoo.com wrote: Thanks. Please tell me more about the tables/software that does the conversion. Really appreciate your help. also you might be interested in this example: filter class=solr.ICUTransformFilterFactory id=Traditional-Simplified/ http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUTransformFilterFactory