Lars,
Tuesday, January 7, 2003, 11:28:50 AM, you wrote:
DC>> So, the change to the algorithm would simply be to preserve all
DC>> spacing.
...
Lars> The problem is that a programme can't distinguish between the different
Lars> kinds of periods used in texts. That is why any algorithm can only
Lars> insert *one* word seperator after every period it encounters, because it
Lars> doesn't know anything about how the period is used there.
Please note that I said *preserve*. That is quite different from *add*.
When the user typed the text in, in its original form, they added spacing,
including one or two spaces after a period, as they saw fit. Hence the
formatting software merely needs to retain the existing spaces, whatever
they are.
Yes, there are times that the software might need to add spaces, such a when
there is no space after the last word on an original line, and that word
needs to be moved into another line.
For situations in which a space must be added, the algorithm could be simple
-- just add one space -- or it could apply a heuristic and sometimes add
two. My own experience with such heuristics is that they usually do not work
very well. There is a huge amount of case analysis that is needed to make
the heuristic work even moderately well.
The suggestion that it is "merely" a matter of having a collection of
abbreviations is natural but misleading. First, getting a really thorough
list is difficult in just one language. What about several? Second, the list
can give the wrong answer. How do you do spacing after the period in the
following text:
An excellent performer would be Yo Yo Ma.
Note that MA. is an abbreviation for Master of Arts. And lest you note the
case difference on the A, what if there were a spelling error? Making
semantic choices based on capitalization choices is yet-another dangerous
path to travel.
Personally the simplistic algorithm I am suggesting will work just fine.
d/
--
Dave <mailto:[EMAIL PROTECTED]>
Brandenburg InternetWorking <http://www.brandenburg.com>
t +1.408.246.8253; f +1.408.850.1850
________________________________________________
Current version is 1.62 | "Using TBUDL" information:
http://www.silverstones.com/thebat/TBUDLInfo.html