Re: [Tutor] HTML Parsing

2014-05-31 Thread Mitesh H. Budhabhatti
> > I see others have answered the programming question, but there's > a separate one. What is the license of the particular ste, yahoo > in this case. For an occasional scrape, nobody's likely to mind. > But if you plan any volume, using the official api is more > polite. Thanks for the rep

Re: [Tutor] HTML Parsing

2014-05-29 Thread Mitesh H. Budhabhatti
Alan Gauld thanks for the reply. I'll try that out. Warm Regards, *Mitesh H. Budhabhatti* Cell# +91 99040 83855 On Wed, May 28, 2014 at 11:19 PM, Danny Yoo wrote: > > I am using Python 3.3.3 on Windows 7. I would like to know what is the > best > > method to do HTML parsing? For example, I

Re: [Tutor] HTML Parsing

2014-05-28 Thread Dave Angel
"Mitesh H. Budhabhatti" Wrote in message: (please post in text email, not html. Doesn't matter for most people on this particular message, but it's the polite thing to do) I see others have answered the programming question, but there's a separate one. What is the license of the particul

Re: [Tutor] HTML Parsing

2014-05-28 Thread Danny Yoo
> I am using Python 3.3.3 on Windows 7. I would like to know what is the best > method to do HTML parsing? For example, I want to connect to www.yahoo.com > and get all the tags and their values. For this purpose, you may want to look at the APIs that the search engines provide, rather than try

Re: [Tutor] HTML Parsing

2014-05-28 Thread Alan Gauld
On 28/05/14 11:42, Mitesh H. Budhabhatti wrote: Hello Friends, I am using Python 3.3.3 on Windows 7. I would like to know what is the best method to do HTML parsing? For example, I want to connect to www.yahoo.com and get all the tags and their values. The standard lib

[Tutor] HTML Parsing

2014-05-28 Thread Mitesh H. Budhabhatti
Hello Friends, I am using Python 3.3.3 on Windows 7. I would like to know what is the best method to do HTML parsing? For example, I want to connect to www.yahoo.com and get all the tags and their values. Thanks. Warm Regards, *Mitesh H. Budhabhatti* Cell# +91 99040 83855 __

Re: [Tutor] HTML Parsing

2008-04-22 Thread Kent Johnson
Stephen Nelson-Smith wrote: Comments and criticism please. The SGML parser seems like overkill. Can't you just apply the regexes directly to status_info? If the format may change, or you need to interpret entities, etc then the parser is helpful. In this case I don't see how it is needed.

Re: [Tutor] HTML Parsing

2008-04-22 Thread Stephen Nelson-Smith
Hello, > For data this predictable, simple regex matching will probably work fine. I thought that too... Anyway - here's what I've come up with: #!/usr/bin/python import urllib, sgmllib, re mod_status = urllib.urlopen("http://10.1.2.201/server-status";) status_info = mod_status.read() mod_st

Re: [Tutor] HTML Parsing

2008-04-21 Thread Kent Johnson
Stephen Nelson-Smith wrote: Hi, I want to write a little script that parses an apache mod_status page. I want it to return simple the number of page requests a second and the number of connections. The page looks like this: Apache Status Apache Server Status for 10.1.2.201 Server Versio

Re: [Tutor] HTML Parsing

2008-04-21 Thread Andreas Kostyrka
eeck. Not that I advocate parsing files by line, but if you need to do it: lines = list(file)[16:] or lines_iter = iter(file) zip(lines_iter, xrange(16)) for line in lines_iter: Andreas Am Montag, den 21.04.2008, 14:42 + schrieb linuxian iandsd: > Another horrid solution > > #!

Re: [Tutor] HTML Parsing

2008-04-21 Thread Andreas Kostyrka
If you have a correct XML document. In practice this is rather a big IF. Andreas Am Montag, den 21.04.2008, 10:35 -0700 schrieb Jeff Younker: > On Apr 21, 2008, at 6:40 AM, Stephen Nelson-Smith wrote: > > > On 4/21/08, Andreas Kostyrka <[EMAIL PROTECTED]> wrote: > > I want to stick with standard

Re: [Tutor] HTML Parsing

2008-04-21 Thread Jeff Younker
On Apr 21, 2008, at 6:40 AM, Stephen Nelson-Smith wrote: > On 4/21/08, Andreas Kostyrka <[EMAIL PROTECTED]> wrote: > I want to stick with standard library. > > How do you capture elements? from xml.etree import ElementTree document = """ foo and bar foo

Re: [Tutor] HTML Parsing

2008-04-21 Thread Stephen Nelson-Smith
Hi, > for lineno, line in enumerate(html): -Epython2.2hasnoenumerate() Can we code around this? > x = line.find("requests/sec") > if x >= 0: >no_requests_sec = line[3:x] >break > for lineno, line in enumerate(html[lineno+1:]): > x = line.find("requests currently being processed"

Re: [Tutor] HTML Parsing

2008-04-21 Thread bob gailer
Stephen Nelson-Smith wrote: > Hi, > > >> for lineno, line in enumerate(html): >> > > -Epython2.2hasnoenumerate() > > I used enumerate for a marginal (unproven) performance enhancement. > Can we code around this? for lineno in range(len(html)): x = html[lineno].find("requests/sec") i

Re: [Tutor] HTML Parsing

2008-04-21 Thread linuxian iandsd
Another horrid solution > #!/usr/bin/python > # line number does not change so we use that # the data we're looking for does not have a (unique) close tag (htmllib > ) > > import re, urllib2 > file=urllib2.urlopen('http://10.1.2.201/server-status') > n=0 > for line in file: > n=n+1 > if n=

Re: [Tutor] HTML Parsing

2008-04-21 Thread bob gailer
Stephen Nelson-Smith wrote: > Hi, > > I want to write a little script that parses an apache mod_status page. > > I want it to return simple the number of page requests a second and > the number of connections. > > It seems this is very complicated... I can do it in a shell one-liner: > > curl 10.1.

Re: [Tutor] HTML Parsing

2008-04-21 Thread Andreas Kostyrka
Just from memory, you need to subclass the HTMLParser class, and provide start_dt and end_dt methods, plus one to capture the text inbetween. Read the docs on htmllib (www.python.org | Documentation | module docs), and see if you can manage if not, come back with questions ;) Andreas Am Montag,

Re: [Tutor] HTML Parsing

2008-04-21 Thread Stephen Nelson-Smith
On 4/21/08, Andreas Kostyrka <[EMAIL PROTECTED]> wrote: > As usual there are a number of ways. > > But I basically see two steps here: > > 1.) capture all dt elements. If you want to stick with the standard > library, htmllib would be the module. Else you can use e.g. > BeautifulSoup or somethi

Re: [Tutor] HTML Parsing

2008-04-21 Thread Andreas Kostyrka
As usual there are a number of ways. But I basically see two steps here: 1.) capture all dt elements. If you want to stick with the standard library, htmllib would be the module. Else you can use e.g. BeautifulSoup or something comparable. 2.) Check all dt contents either via regex, or with a .s

[Tutor] HTML Parsing

2008-04-21 Thread Stephen Nelson-Smith
Hi, I want to write a little script that parses an apache mod_status page. I want it to return simple the number of page requests a second and the number of connections. It seems this is very complicated... I can do it in a shell one-liner: curl 10.1.2.201/server-status 2>&1 | grep -i request |