Re: Hanzi trad-simp folding and z-variants

Stephan Stiller Fri, 07 Jun 2013 21:07:02 -0700

simplified [is] better thought of as abbreviated

Part of this is a terminological argument. The historical situation isindeed more complicated than many people know, but the truth is alsothat irrespective of eg people's past or present usage in handwritingthere have (in the past and esp in the present) been printing traditionswhich you can pinpoint by political region and time, occasionally bypublisher. Regardless of what exactly happened during thepre-simplification era, there are fairly stable traditions now.


[quote approximate and adapted:]

a ["]fully simplified["] passage of text will contain[] bothsimplified characters and those which have not been simplified [...]and therefore [be] tagged as traditional.

This depends on the algorithm used for tagging. And note that taggingdoesn't in fact have to be a /binary/ classifier.†

working at character level is not the best way to go for yourpurposes, a larger units such as words or phrases produce much moremeaningful results as this mimics the way a person reads Chinese, theydo read process one character at a time rather word by word.

I don't think JohnB was suggesting character-based retrieval. (I mean,who in his right mind would want to do letter-based (and post–casefolding) retrieval for English documents? :-) Okay – just a joke, thisanalogy isn't any good.) But of course you're right to point out thatsimplification or the reverse operation (what's the term for that?"T-conversion" maybe?) is word- and context-dependent on the edges.

A different point: I'm not suggesting imprecision, but people are partlyused to this in text they've seen converted by those horrible tools youcan find online for that purpose, and for some characters, people won'tactually notice.

Whilst the kZVariant field does mean that characters can, arefrequently are transposed

What do you mean by "transposed"? Could you give an example?

it does not tell you when, also as said above the probability is thatyou have ordinary Chinese text written in the mainland style, foldingbased on the the kZVariant field, would either leave things unchangedor if it changed things would misspell words, that is the sounds, orin some cases appearance, would probably be similar, or homophones,but would not match any dictionaries.

But if all occurrences of everything you process are folded (folding tolower-case is often done in NLP), this isn't a problem. Again, I'm notrecommending this as best practice, I'm just pointing it out.

There are Chinese compatibility characters in Unicode which if presentwhich it probably would good to fold in but these are not in the scopeof UniHan.

And you remind me that z-variation is locale-dependent (see also †above). Anyways, I think it's hard to find examples of meaning-divergentz-variant words in modern Mandarin (MSM). I'm sure you or someone elsewill be able to quickly dig out examples, but really the question iswhat set of algorithms and data structures is best to address thegeneral situation. Have locale-dependent folding tables? Allow a searchterm prefix that specifies "don't normalize or fold the following term"?Have secondary filters in your search that use a stricter model ofcharacter identity?


Stephan

Re: Hanzi trad-simp folding and z-variants

Reply via email to