> > > I also found that on some of the strings I want to extract, when > > > python reads them using file.read(), there are newline characters > > > and other stuff that doesn`t show up in the actual html source. > > > > Not certain that I understand what you mean there. Can you show us? > > read() should not adulterate the byte stream that comes out of your > >files. > > >>> file = open("file1.html") > >>> file.read() > '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML > 4.01 > Transitional//EN"\r\n"http://www.w3.org/TR/html4/loose.dtd">\r\n<html>\r\n<head>\r\n<!-- > Script for select box changes -->\r\n<script type="text/javascript">\r\n > [...] > > That`s just a snippet from the html code.I`m guessing it won`t cause any > problems since it`s just the newlines from reading the HTML code and not > actually *in* the code.
Hi Oswaldo, Those newlines ARE in the file. *grin* Just to clarify: the convention that text files use to break things into lines is to delimit the lines with the newline escape character. Actually, in DOS-based systems, it's "\r\n", that is, a carriage-return character followed by a newline character. What's happening is that the Python string representation displayer makes those newlines and other control characters visible as backslashed codes. Python's repr() function sees those special control sequences, and to make it more visible for us, translates them to backslash-X for the fifteen-or-so control bytes that are used. We can find a list of the special control characters here, under the "Escape Sequence" table section: http://www.python.org/doc/ref/strings.html Those characters act as hints that many programs use to trigger special behavior. In particular, most programs that see the newline ('\n') and carriage return ('\r') bytes will drop down to the next line. > Yes I`m seeing this right now hehe....but since all the files I have to > process have the same structure (they were generated by a script) I > think it might be easier to use RE`s here. Do you have any idea of what > other tool I can use? I took a look at BeautifulSoup but it seemed a bit > overkill and very much over my current python knowledge. When you have time, do try going through a few examples with BeautifulSoup. The web page there comes with some interesting examples, and I don't think it's as bad as you might think. *grin* It's not overkill: BeautifulSoup is explicitely designed to do the kind of data extraction that you're doing right now. If you have questions about it, please feel free to ask. Best of wishes! _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor