As far as general folding is concerned, performing conversion (whether it's word-based or not and even if it's locale-tailored) and then a strict search will let you miss out on the z-variation you find in the wild (because of true variation or of misspellings), and a more generous inclusion of z-variation is in fact unlikely to give you false matches (normally different words don't merely differ on the z-axis, though I believe to remember having seen an example involving the name of a historical term somewhere).

You are right about this point
My point here was folding based on a character by character approach of traditional to simplified model would not make accurate word based retrieval from the resulting text easier but harder.
and the note on "transposition". But I also don't think this is the end of the story: If you strictly convert on a word level, you will miss (note that this point is different from what's in my first paragraph above) those search results where your contextual conversion heuristics was wrong. Perhaps a Classical Chinese character collocation agrees with a modern Chinese term in simplified spelling but should be converted "directly" instead of transposed when going from CN to TW. So for that you'd need some sort of n-way expansion of a search query. I don't have an example off the top of my head, but I don't think scenario is unrealistic at all.

Stephan


Reply via email to