rio wrote: > I'm developing an application to do interlineal (an extreme type of > literal) translations of natural language texts and xml. Here's an example > of a text: > > '''Para eso son los amigos. Para celebrar <i>las gracias</i> del otro.''' > > and the expected translation with all of the original tags, whitespace, > etc intact: > > '''For that are the friends. For toCelebrate <i>the graces</i> ofThe > other.<p>''' > > I was unable to find (in htmlparser, string or unicode) a way to define > words as a series of letters (including non-ascii char sets) outside of an > xml tag and whitespace/punctuation, so I wrote the code below to create a > list of the words, nonwords, and xml tags in a text. My intuition tells > me that its an awful lot of code to do a simple thing, but it's the best I > could come up with. I forsee several problems: > > -it currently requires that the entire string (or file) be processed into > memory. if i should want to process a large file line by line, a tab which > spans more than one line would be ignored. (that's assuming i would not be > able to store state information in the function, which is something i've > not yet learned how to do) > -html comments may not be supported. (i'm not really sure about this) > -it may be very slow as it indexes instead of iterating over the string. > > what can i do to overcome these issues? Am I reinventing the wheel? Should > I be using re?
You should probably be using sgmllib. Here is an example that is pretty close to what you are doing: http://diveintopython.org/html_processing/index.html Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor