I'm developing an application to do interlineal (an extreme type of literal) translations of natural language texts and xml. Here's an example of a text:
'''Para eso son los amigos. Para celebrar <i>las gracias</i> del otro.''' and the expected translation with all of the original tags, whitespace, etc intact: '''For that are the friends. For toCelebrate <i>the graces</i> ofThe other.<p>''' I was unable to find (in htmlparser, string or unicode) a way to define words as a series of letters (including non-ascii char sets) outside of an xml tag and whitespace/punctuation, so I wrote the code below to create a list of the words, nonwords, and xml tags in a text. My intuition tells me that its an awful lot of code to do a simple thing, but it's the best I could come up with. I forsee several problems: -it currently requires that the entire string (or file) be processed into memory. if i should want to process a large file line by line, a tab which spans more than one line would be ignored. (that's assuming i would not be able to store state information in the function, which is something i've not yet learned how to do) -html comments may not be supported. (i'm not really sure about this) -it may be very slow as it indexes instead of iterating over the string. what can i do to overcome these issues? Am I reinventing the wheel? Should I be using re? thanks, brian ******************************** # -*- coding: utf-8 -*- # html2list.py def split(alltext, charset='ñÑçÇáÁéÉíÍóÓúÚ'): #in= string; out= list of words, nonwords, html tags. '''builds a list of the words, tags, and nonwords in a text.''' length = len(alltext) str2list = [] url = [] word = [] nonword = [] i = 0 if alltext[i] == '<': url.append(alltext[i]) elif alltext[i].isalpha() or alltext[i] in charset: word.append(alltext[i]) else: nonword.append(alltext[i]) i += 1 while i < length: if url: if alltext[i] == '>': #end url: url.append(alltext[i]) str2list.append("".join(url)) url = [] i += 1 if alltext[i].isalpha() or alltext[i] in charset: #start word word.append(alltext[i]) else: #start nonword nonword.append(alltext[i]) else: url.append(alltext[i]) elif word: if alltext[i].isalpha() or alltext[i] in charset: #continue word word.append(alltext[i]) elif alltext[i] == '<': #start url str2list.append("".join(word)) word = [] url.append(alltext[i]) else: #start nonword str2list.append("".join(word)) word = [] nonword.append(alltext[i]) elif nonword: if alltext[i].isalpha() or alltext[i] in charset: #start word str2list.append("".join(nonword)) nonword = [] word.append(alltext[i]) elif alltext[i] == '<': #start url str2list.append("".join(nonword)) nonword = [] url.append(alltext[i]) else: #continue nonword nonword.append(alltext[i]) else: print 'error', i += 1 if nonword: str2list.append("".join(nonword)) if url: str2list.append("".join(url)) if word: str2list.append("".join(word)) return str2list ## example: text = '''El aguardiente de caña le quemó la garganta y devolvió la botella con una mueca. No se me ponga feo, doctor. Esto mata los bichos de las tripas dijo Antonio José Bolívar, pero no pudo seguir hablando.''' print split(text) ___________________________________________________________ Try the all-new Yahoo! Mail. "The New Version is radically easier to use" The Wall Street Journal http://uk.docs.yahoo.com/nowyoucan.html _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor