simplified [is] better thought of as abbreviated
Part of this is a terminological argument. The historical situation is indeed more complicated than many people know, but the truth is also that irrespective of eg people's past or present usage in handwriting there have (in the past and esp in the present) been printing traditions which you can pinpoint by political region and time, occasionally by publisher. Regardless of what exactly happened during the pre-simplification era, there are fairly stable traditions now.

[quote approximate and adapted:]
a ["]fully simplified["] passage of text will contain[] both simplified characters and those which have not been simplified [...] and therefore [be] tagged as traditional.
This depends on the algorithm used for tagging. And note that tagging doesn't in fact have to be a /binary/ classifier.†

working at character level is not the best way to go for your purposes, a larger units such as words or phrases produce much more meaningful results as this mimics the way a person reads Chinese, they do read process one character at a time rather word by word.
I don't think JohnB was suggesting character-based retrieval. (I mean, who in his right mind would want to do letter-based (and post–case folding) retrieval for English documents? :-) Okay – just a joke, this analogy isn't any good.) But of course you're right to point out that simplification or the reverse operation (what's the term for that? "T-conversion" maybe?) is word- and context-dependent on the edges.

A different point: I'm not suggesting imprecision, but people are partly used to this in text they've seen converted by those horrible tools you can find online for that purpose, and for some characters, people won't actually notice.

Whilst the kZVariant field does mean that characters can, are frequently are transposed
What do you mean by "transposed"? Could you give an example?

it does not tell you when, also as said above the probability is that you have ordinary Chinese text written in the mainland style, folding based on the the kZVariant field, would either leave things unchanged or if it changed things would misspell words, that is the sounds, or in some cases appearance, would probably be similar, or homophones, but would not match any dictionaries.
But if all occurrences of everything you process are folded (folding to lower-case is often done in NLP), this isn't a problem. Again, I'm not recommending this as best practice, I'm just pointing it out.

There are Chinese compatibility characters in Unicode which if present which it probably would good to fold in but these are not in the scope of UniHan.
And you remind me that z-variation is locale-dependent (see also † above). Anyways, I think it's hard to find examples of meaning-divergent z-variant words in modern Mandarin (MSM). I'm sure you or someone else will be able to quickly dig out examples, but really the question is what set of algorithms and data structures is best to address the general situation. Have locale-dependent folding tables? Allow a search term prefix that specifies "don't normalize or fold the following term"? Have secondary filters in your search that use a stricter model of character identity?

Stephan

Reply via email to