[issue39833] Bug in html parsing module triggered by malformed input

2020-03-02 Thread Ezio Melotti
Ezio Melotti added the comment: Thanks for the report. This is a duplicate of #34480. -- nosy: +ezio.melotti resolution: -> duplicate stage: -> resolved status: open -> closed type: compile error -> behavior ___ Python tracker

[issue39833] Bug in html parsing module triggered by malformed input

2020-03-02 Thread Evan
f not match: return -1 if report: j = match.start(0) self.unknown_decl(rawdata[i+3: j]) return match.end(0) `match` should be set to None in the fall-through else statement right before `if not match`. -- components: Library (Lib) messa

Html Parsing stuff

2014-07-21 Thread Nicholas Cannon
Ok i get the basics of this and i have been doing some successful parsings and using regular expressions to find html tags. I have tried to find an img tag and write that image to a file. I have had no success. It says it has successfully wrote the image to the file with a try... except

Re: Html Parsing stuff

2014-07-21 Thread Nicholas Cannon
dont worry it has been solved -- https://mail.python.org/mailman/listinfo/python-list

Beautifulsoup html parsing - nested tags

2011-01-05 Thread Selvam
Hi all, I am trying to parse some html string with BeatifulSoup. The string is, table colWidths='530.0' style='Table_Main_Table' tr td blockTable colWidths='54.0,80.0,67.0' style='Table_Tax_Header' tr th p

Re: Beautifulsoup html parsing - nested tags

2011-01-05 Thread Selvam
On Wed, Jan 5, 2011 at 2:58 PM, Selvam s.selvams...@gmail.com wrote: Hi all, I am trying to parse some html string with BeatifulSoup. The string is, table colWidths='530.0' style='Table_Main_Table' tr td blockTable colWidths='54.0,80.0,67.0'

Re: HTML Parsing

2008-06-30 Thread Larry Bates
[EMAIL PROTECTED] wrote: Hi everyone I am trying to build my own web crawler for an experiement and I don't know how to access HTTP protocol with python. Also, Are there any Opensource Parsing engine for HTML documents available in Python too? That would be great. Check on Mechanize. It

Re: HTML Parsing

2008-06-29 Thread Sebastian lunar Wiesner
Stefan Behnel [EMAIL PROTECTED]: [EMAIL PROTECTED] wrote: I am trying to build my own web crawler for an experiement and I don't know how to access HTTP protocol with python. Also, Are there any Opensource Parsing engine for HTML documents available in Python too? That would be great.

HTML Parsing

2008-06-28 Thread disappearedng
Hi everyone I am trying to build my own web crawler for an experiement and I don't know how to access HTTP protocol with python. Also, Are there any Opensource Parsing engine for HTML documents available in Python too? That would be great. -- http://mail.python.org/mailman/listinfo/python-list

Re: HTML Parsing

2008-06-28 Thread Dan Stromberg
On Sat, 28 Jun 2008 19:03:39 -0700, disappearedng wrote: Hi everyone I am trying to build my own web crawler for an experiement and I don't know how to access HTTP protocol with python. Also, Are there any Opensource Parsing engine for HTML documents available in Python too? That would be

Re: HTML Parsing

2008-06-28 Thread Benjamin
On Jun 28, 9:03 pm, [EMAIL PROTECTED] wrote: Hi everyone I am trying to build my own web crawler for an experiement and I don't know how to access HTTP protocol with python. Look at the httplib module. Also, Are there any Opensource Parsing engine for HTML documents available in Python

Re: HTML Parsing

2008-06-28 Thread Victor Noagbodji
Hi everyone Hello I am trying to build my own web crawler for an experiement and I don't know how to access HTTP protocol with python. urllib2: http://docs.python.org/lib/module-urllib2.html Also, Are there any Opensource Parsing engine for HTML documents available in Python too? That would

Re: HTML Parsing

2008-06-28 Thread Stefan Behnel
[EMAIL PROTECTED] wrote: I am trying to build my own web crawler for an experiement and I don't know how to access HTTP protocol with python. Also, Are there any Opensource Parsing engine for HTML documents available in Python too? That would be great. Try lxml.html. It parses broken HTML,

Re: HTML parsing confusion

2008-01-23 Thread M.-A. Lemburg
a specific problem which I attempted to lay out clearly from the outset. I was asking this community if there was a simple way to use only the tools included with Python to parse a bit of html. There are lots of ways doing HTML parsing in Python. A common one is e.g. using mxTidy to convert

Re: HTML parsing confusion

2008-01-23 Thread cokofreedom
The pages I'm trying to write this code to run against aren't in the wild, though. They are static html files on my company's lan, are very consistent in format, and are (I believe) valid html. Obvious way to check this is to go to http://validator.w3.org/ and see what it tells you about your

Re: HTML parsing confusion

2008-01-23 Thread Alnilam
On Jan 23, 3:54 am, M.-A. Lemburg [EMAIL PROTECTED] wrote: I was asking this community if there was a simple way to use only the tools included with Python to parse a bit of html. There are lots of ways doing HTML parsing in Python. A common one is e.g. using mxTidy to convert the HTML

Re: HTML parsing confusion

2008-01-23 Thread Jerry Hill
On Jan 23, 2008 7:40 AM, Alnilam [EMAIL PROTECTED] wrote: Skipping past html validation, and html to xhtml 'cleaning', and instead starting with the assumption that I have files that are valid XHTML, can anyone give me a good example of how I would use _ htmllib, HTMLParser, or ElementTree _

Re: HTML parsing confusion

2008-01-23 Thread Gabriel Genellina
En Wed, 23 Jan 2008 10:40:14 -0200, Alnilam [EMAIL PROTECTED] escribió: Skipping past html validation, and html to xhtml 'cleaning', and instead starting with the assumption that I have files that are valid XHTML, can anyone give me a good example of how I would use _ htmllib, HTMLParser, or

Re: HTML parsing confusion

2008-01-22 Thread John Machin
On Jan 22, 4:31 pm, Alnilam [EMAIL PROTECTED] wrote: Sorry for the noob question, but I've gone through the documentation on python.org, tried some of the diveintopython and boddie's examples, and looked through some of the numerous posts in this group on the subject and I'm still rather

Re: HTML parsing confusion

2008-01-22 Thread Paul Boddie
On 22 Jan, 06:31, Alnilam [EMAIL PROTECTED] wrote: Sorry for the noob question, but I've gone through the documentation on python.org, tried some of the diveintopython and boddie's examples, and looked through some of the numerous posts in this group on the subject and I'm still rather

Re: HTML parsing confusion

2008-01-22 Thread Alnilam
Pardon me, but the standard issue Python 2.n (for n in range(5, 2, -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous 200-modules PyXML package installed. And you don't want the 75Kb BeautifulSoup? I wasn't aware that I had PyXML installed, and can't find a reference to

Re: HTML parsing confusion

2008-01-22 Thread Paul McGuire
On Jan 22, 7:44 am, Alnilam [EMAIL PROTECTED] wrote: ...I move from computer to computer regularly, and while all have a recent copy of Python, each has different (or no) extra modules, and I don't always have the luxury of downloading extras. That being said, if there's a simple way of doing

Re: HTML parsing confusion

2008-01-22 Thread Alnilam
On Jan 22, 8:44 am, Alnilam [EMAIL PROTECTED] wrote: Pardon me, but the standard issue Python 2.n (for n in range(5, 2, -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous 200-modules PyXML package installed. And you don't want the 75Kb BeautifulSoup? I wasn't aware

Re: HTML parsing confusion

2008-01-22 Thread Diez B. Roggisch
Alnilam wrote: On Jan 22, 8:44 am, Alnilam [EMAIL PROTECTED] wrote: Pardon me, but the standard issue Python 2.n (for n in range(5, 2, -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous 200-modules PyXML package installed. And you don't want the 75Kb BeautifulSoup? I

Re: HTML parsing confusion

2008-01-22 Thread Alnilam
On Jan 22, 11:39 am, Diez B. Roggisch [EMAIL PROTECTED] wrote: Alnilam wrote: On Jan 22, 8:44 am, Alnilam [EMAIL PROTECTED] wrote: Pardon me, but the standard issue Python 2.n (for n in range(5, 2, -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous 200-modules PyXML

Re: HTML parsing confusion

2008-01-22 Thread Gabriel Genellina
En Tue, 22 Jan 2008 19:20:32 -0200, Alnilam [EMAIL PROTECTED] escribió: On Jan 22, 11:39 am, Diez B. Roggisch [EMAIL PROTECTED] wrote: Alnilam wrote: On Jan 22, 8:44 am, Alnilam [EMAIL PROTECTED] wrote: Pardon me, but the standard issue Python 2.n (for n in range(5, 2, -1)) doesn't have

Re: HTML parsing confusion

2008-01-22 Thread [EMAIL PROTECTED]
On Jan 22, 7:29 pm, Gabriel Genellina [EMAIL PROTECTED] wrote: I was asking this community if there was a simple way to use only the tools included with Python to parse a bit of html. If you *know* that your document is valid HTML, you can use the HTMLParser module in the standard Python

Re: HTML parsing confusion

2008-01-22 Thread Alnilam
On Jan 22, 7:29 pm, Gabriel Genellina [EMAIL PROTECTED] wrote: I was asking this community if there was a simple way to use only the tools included with Python to parse a bit of html. If you *know* that your document is valid HTML, you can use the HTMLParser   module in the standard Python

HTML parsing confusion

2008-01-21 Thread Alnilam
Sorry for the noob question, but I've gone through the documentation on python.org, tried some of the diveintopython and boddie's examples, and looked through some of the numerous posts in this group on the subject and I'm still rather confused. I know that there are some great tools out there for

Re: How to Encode Parameters into an HTML Parsing Script

2007-06-22 Thread SMERSH009X
On Jun 21, 9:45 pm, Gabriel Genellina [EMAIL PROTECTED] wrote: En Thu, 21 Jun 2007 23:37:07 -0300, [EMAIL PROTECTED] escribió: So for example if I wanted to navigate to an encoded url http://online.investools.com/landing.iedu?signedin=truerather than

Re: How to Encode Parameters into an HTML Parsing Script

2007-06-21 Thread Gabriel Genellina
En Thu, 21 Jun 2007 23:37:07 -0300, [EMAIL PROTECTED] escribió: So for example if I wanted to navigate to an encoded url http://online.investools.com/landing.iedu?signedin=true rather than just http://online.investools.com/landing.iedu How would I do this? How can I modify the script to

How to Encode Parameters into an HTML Parsing Script

2007-06-21 Thread SMERSH009X
I've written a Script that navigates various urls on a website, and fetches the contents. The Url's are being fed from a list urlList. Everything seems to work splendidly, until I introduce the concept of encoding parameters for a certain url. So for example if I wanted to navigate to an encoded

Re: Output of HTML parsing

2007-06-19 Thread Jackie
On 6 15 , 2 01 , Stefan Behnel [EMAIL PROTECTED] wrote: Jackie wrote: import lxml.etree as et url = http://www.economics.utoronto.ca/index.php/index/person/faculty/; tree = et.parse(url) Stefan- - - - Thank you. But when I tried to run the above part, the following

Re: Output of HTML parsing

2007-06-19 Thread Stefan Behnel
Jackie schrieb: On 6 15 , 2 01 , Stefan Behnel [EMAIL PROTECTED] wrote: Jackie wrote: import lxml.etree as et url = http://www.economics.utoronto.ca/index.php/index/person/faculty/; tree = et.parse(url) Stefan- - - - Thank you. But when I tried to run the above

Output of html parsing

2007-06-16 Thread Jackie Wang
Hi, all, I want to get the information of the professors (name,title) from the following link: http://www.economics.utoronto.ca/index.php/index/person/faculty/; Ideally, I'd like to have a output file where each line is one Prof, including his name and title. In practice, I use

Output of HTML parsing

2007-06-15 Thread Jackie
Hi, all, I want to get the information of the professors (name,title) from the following link: http://www.economics.utoronto.ca/index.php/index/person/faculty/; Ideally, I'd like to have a output file where each line is one Prof, including his name and title. In practice, I use the CSV module.

Re: Output of HTML parsing

2007-06-15 Thread Sebastian Wiesner
[ Jackie [EMAIL PROTECTED] ] 1.The code above assume that each Prof has a tilte. If any one of them does not, the name and title will be mismatched. How to program to allow that title can be empty? 2.Is there any easier way to get the data I want other than using list? Use BeautifulSoup.

Re: Output of HTML parsing

2007-06-15 Thread Stefan Behnel
Jackie wrote: I want to get the information of the professors (name,title) from the following link: http://www.economics.utoronto.ca/index.php/index/person/faculty/; That's even XHTML, no need to go through BeautifulSoup. Use lxml instead. http://codespeak.net/lxml Ideally, I'd like to

Re: HTML Parsing

2007-02-25 Thread Stefan Behnel
John Machin wrote: One can even use ElementTree, if the HTML is well-formed. See below. However if it is as ill-formed as the sample (4th td element not closed; I've omitted it below), then the OP would be better off sticking with Beautiful Soup :-) Or (as we were talking about the best of

Re: HTML Parsing

2007-02-11 Thread John Machin
On Feb 11, 6:05 pm, Ayaz Ahmed Khan [EMAIL PROTECTED] wrote: mtuller typed: I have also tried Beautiful Soup, but had trouble understanding the documentation As Gabriel has suggested, spend a little more time going through the documentation of BeautifulSoup. It is pretty easy to grasp.

Re: HTML Parsing

2007-02-11 Thread Fredrik Lundh
John Machin wrote: One can even use ElementTree, if the HTML is well-formed. See below. However if it is as ill-formed as the sample (4th td element not closed; I've omitted it below), then the OP would be better off sticking with Beautiful Soup :-) or get the best of both worlds:

HTML Parsing

2007-02-10 Thread mtuller
Alright. I have tried everything I can find, but am not getting anywhere. I have a web page that has data like this: tr td headers=col1_1 style=width:21% span class=hpPageText LETTER/span/td td headers=col2_1 style=width:13%; text-align:right span class=hpPageText 33,699/span/td td

Re: HTML Parsing

2007-02-10 Thread Gabriel Genellina
En Sat, 10 Feb 2007 20:07:43 -0300, mtuller [EMAIL PROTECTED] escribió: tr td headers=col1_1 style=width:21% span class=hpPageText LETTER/span/td td headers=col2_1 style=width:13%; text-align:right span class=hpPageText 33,699/span/td td headers=col3_1 style=width:13%;

Re: HTML Parsing

2007-02-10 Thread Ayaz Ahmed Khan
mtuller typed: I have also tried Beautiful Soup, but had trouble understanding the documentation As Gabriel has suggested, spend a little more time going through the documentation of BeautifulSoup. It is pretty easy to grasp. I'll give you an example: I want to extract the text between the

Re: HTML Parsing and Indexing

2006-11-16 Thread Paul McGuire
On Nov 13, 1:12 pm, [EMAIL PROTECTED] wrote: I need a help on HTML parser. snip I saw a couple of python parsers like pyparsing, yappy, yapps, etc but they havn't given any example for HTML parsing. Geez, how hard did you look? pyparsing's wiki menu includes an 'Examples' link, which take

HTML Parsing and Indexing

2006-11-13 Thread mailtogops
example for HTML parsing. Someone recommended using lynx to convert the page into the text and parse the data. That also looks good but still i end of writing a huge chunk of code for each web page. What we need is, One nice parser which should work on HTML/text file (lynx output) and work based

Re: HTML Parsing and Indexing

2006-11-13 Thread Fredrik Lundh
[EMAIL PROTECTED] wrote: I need a help on HTML parser. http://www.effbot.org/pyfaq/tutor-how-do-i-get-data-out-of-html.htm /F -- http://mail.python.org/mailman/listinfo/python-list

Re: HTML Parsing and Indexing

2006-11-13 Thread Bernard
. But Crawler, Parser and Indexer need to run unattended. I don't know how to proceed next.. I saw a couple of python parsers like pyparsing, yappy, yapps, etc but they havn't given any example for HTML parsing. Someone recommended using lynx to convert the page into the text and parse the data

Re: HTML Parsing and Indexing

2006-11-13 Thread Andy Dingley
[EMAIL PROTECTED] wrote: I am involved in one project which tends to collect news information published on selected, known web sites inthe format of HTML, RSS, etc I just can't imagine why anyone would still want to do this. With RSS, it's an easy (if not trivial) problem. With HTML

Re: HTML Parsing and Indexing

2006-11-13 Thread Stefan Behnel
[EMAIL PROTECTED] wrote: I am involved in one project which tends to collect news information published on selected, known web sites inthe format of HTML, RSS, etc and sortlist them and create a bookmark on our website for the news content(we will use django for web development). Currently

Re: HTML parsing bug?

2006-02-02 Thread Istvan Albert
this is a comment in JavaScript, which is itself inside an HTML comment Did you read the post? misread it rather ... -- http://mail.python.org/mailman/listinfo/python-list

Re: HTML parsing bug?

2006-02-01 Thread Tim Roberts
Istvan Albert [EMAIL PROTECTED] wrote: this is a comment in JavaScript, which is itself inside an HTML comment Don't nest HTML comments. Occasionaly it may break the browsers as well. Did you read the post? He didn't nest HTML comments. He put a Javascript comment inside an HTML comment,

Re: HTML parsing bug?

2006-02-01 Thread Fredrik Lundh
[EMAIL PROTECTED] wrote: Python 2.3.5 seems to choke when trying to parse html files, because it doesn't realize that what's inside !-- -- is a comment in HTML, even if this comment is inside script /script, especially if it's a comment inside that script code too. nope. what's inside

HTML parsing bug?

2006-01-30 Thread g_no_mail_please
Python 2.3.5 seems to choke when trying to parse html files, because it doesn't realize that what's inside !-- -- is a comment in HTML, even if this comment is inside script /script, especially if it's a comment inside that script code too. The html file: !DOCTYPE HTML PUBLIC -//W3C//DTD

Re: HTML parsing bug?

2006-01-30 Thread G.
// /ht ml - this is a comment in JavaScript, which is itself inside an HTML comment This is supposed to be one line. Got wrapped during posting. -- http://mail.python.org/mailman/listinfo/python-list

Re: HTML parsing bug?

2006-01-30 Thread Richard Brodie
[EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Python 2.3.5 seems to choke when trying to parse html files, because it doesn't realize that what's inside !-- -- is a comment in HTML, even if this comment is inside script /script, especially if it's a comment inside that

Re: HTML parsing bug?

2006-01-30 Thread Istvan Albert
this is a comment in JavaScript, which is itself inside an HTML comment Don't nest HTML comments. Occasionaly it may break the browsers as well. (I remember this from one of the weirdest of bughunts : whenever the number of characters between nested HTML comments was divisible by four the page

Re: HTML parsing/scraping python

2005-12-09 Thread alex_f_il
Take a look at SW Explorer Automation (http://home.comcast.net/~furmana/SWIEAutomation.htm)(SWEA). SWEA creates an object model (automation interface) for any Web application running in Internet Explorer. It supports all IE functionality:frames, java script, dialogs, downloads. The runtime can

Re: HTML parsing/scraping python

2005-12-04 Thread John J. Lee
Sanjay Arora [EMAIL PROTECTED] writes: We are looking to select the language toolset more suitable for a project that requires getting data from several web-sites in real- timehtml parsing/scraping. It would require full emulation of the browser, including handling cookies, automated

Re: HTML parsing/scraping python

2005-12-04 Thread gene tani
John J. Lee wrote: Sanjay Arora [EMAIL PROTECTED] writes: We are looking to select the language toolset more suitable for a project that requires getting data from several web-sites in real- timehtml parsing/scraping. It would require full emulation of the browser, including

Re: HTML parsing/scraping python

2005-12-01 Thread Fuzzyman
The standard library module for fetching HTML is urllib2. The best module for scraping the HTML is BeautifulSoup. There is a project called mechanize, built by John Lee on top of urllib2 and other standard modules. It will emulate a browsers behaviour - including history, cookies, basic

Re: HTML parsing/scraping python

2005-12-01 Thread Mike Meyer
Fuzzyman [EMAIL PROTECTED] writes: The standard library module for fetching HTML is urllib2. Does urllib2 replace everything in urllib? I thought there was some urllib functionality that urllib2 didn't do. There is a project called mechanize, built by John Lee on top of urllib2 and other

HTML parsing/scraping python

2005-11-30 Thread Sanjay Arora
We are looking to select the language toolset more suitable for a project that requires getting data from several web-sites in real- timehtml parsing/scraping. It would require full emulation of the browser, including handling cookies, automated logins following multiple web-link paths

Re: HTML parsing/scraping python

2005-11-30 Thread Mike Meyer
Sanjay Arora [EMAIL PROTECTED] writes: We are looking to select the language toolset more suitable for a project that requires getting data from several web-sites in real- timehtml parsing/scraping. It would require full emulation of the browser, including handling cookies, automated

html parsing

2005-03-13 Thread Suchitra
Hi all, Please help me in parsing the html document and extract the http links . Thanks in advance!!1 Suchitra -- http://mail.python.org/mailman/listinfo/python-list