Re: [Tutor] Extracting data from HTML files

Danny Yoo Fri, 30 Dec 2005 20:04:41 -0800

> > > I also found that on some of the strings I want to extract, when
> > > python reads them using file.read(), there are newline characters
> > > and other stuff that doesn`t show up in the actual html source.
> >
> > Not certain that I understand what you mean there.  Can you show us?
> > read() should not adulterate the byte stream that comes out of your
> >files.
>
> >>> file = open("file1.html")
> >>> file.read()
> '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML
> 4.01
> Transitional//EN"\r\n"http://www.w3.org/TR/html4/loose.dtd";>\r\n<html>\r\n<head>\r\n<!--
> Script for select box changes -->\r\n<script type="text/javascript">\r\n
> [...]
>
> That`s just a snippet from the html code.I`m guessing it won`t cause any
> problems since it`s just the newlines from reading the HTML code and not
> actually *in* the code.


Hi Oswaldo,

Those newlines ARE in the file.  *grin*

Just to clarify: the convention that text files use to break things into
lines is to delimit the lines with the newline escape character.
Actually, in DOS-based systems, it's "\r\n", that is, a carriage-return
character followed by a newline character.

What's happening is that the Python string representation displayer makes
those newlines and other control characters visible as backslashed codes.
Python's repr() function sees those special control sequences, and to make
it more visible for us, translates them to backslash-X for the
fifteen-or-so control bytes that are used.

We can find a list of the special control characters here, under the
"Escape Sequence" table section:

    http://www.python.org/doc/ref/strings.html


Those characters act as hints that many programs use to trigger special
behavior.  In particular, most programs that see the newline ('\n') and
carriage return ('\r') bytes will drop down to the next line.



> Yes I`m seeing this right now hehe....but since all the files I have to
> process have the same structure (they were generated by a script) I
> think it might be easier to use RE`s here. Do you have any idea of what
> other tool I can use? I took a look at BeautifulSoup but it seemed a bit
> overkill and very much over my current python knowledge.

When you have time, do try going through a few examples with
BeautifulSoup.  The web page there comes with some interesting examples,
and I don't think it's as bad as you might think.  *grin* It's not
overkill: BeautifulSoup is explicitely designed to do the kind of data
extraction that you're doing right now.

If you have questions about it, please feel free to ask.  Best of wishes!

_______________________________________________
Tutor maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Extracting data from HTML files

Reply via email to