Re: HTML extraction
Pieter van Oostrum wrote at 2021-12-8 11:00 +0100: > ... >bs4 can do it, but lxml wants correct XML. Use `lxml's the `HTMLParser` to parse HTML (--> "see https://lxml.de/parsing.html#parsing-html;). -- https://mail.python.org/mailman/listinfo/python-list
Re: HTML extraction
Roland Mueller writes: > But isn't bs4 only for SOAP content? > Can bs4 or lxml cope with HTML code that does not comply with XML as the > following fragment? > > A > B > > bs4 can do it, but lxml wants correct XML. Jupyter console 6.4.0 Python 3.9.9 (main, Nov 16 2021, 07:21:43) Type 'copyright', 'credits' or 'license' for more information IPython 7.29.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: from bs4 import BeautifulSoup as bs In [2]: soup = bs('AB') In [3]: soup.p Out[3]: A In [4]: soup.find_all('p') Out[4]: [A, B] In [5]: from lxml import etree In [6]: root = etree.fromstring('AB') Traceback (most recent call last): File "/opt/local/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3444, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "/var/folders/2l/pdng2d2x18d00m41l6r2ccjrgn/T/ipykernel_96220/3376613260.py", line 1, in root = etree.fromstring('AB') File "src/lxml/etree.pyx", line 3237, in lxml.etree.fromstring File "src/lxml/parser.pxi", line 1896, in lxml.etree._parseMemoryDocument File "src/lxml/parser.pxi", line 1777, in lxml.etree._parseDoc File "src/lxml/parser.pxi", line 1082, in lxml.etree._BaseParser._parseUnicodeDoc File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError File "", line 1 XMLSyntaxError: Premature end of data in tag hr line 1, line 1, column 13 -- Pieter van Oostrum www: http://pieter.vanoostrum.org/ PGP key: [8DAE142BE17999C4] -- https://mail.python.org/mailman/listinfo/python-list
Re: HTML extraction
Roland Mueller wrote at 2021-12-7 22:55 +0200: > ... >Can bs4 or lxml cope with HTML code that does not comply with XML as the >following fragment? `lxml` comes with an HTML parser; that can be configured to check loosely. -- https://mail.python.org/mailman/listinfo/python-list
Re: HTML extraction
On Wed, Dec 8, 2021 at 7:55 AM Roland Mueller wrote: > > Hello, > > ti 7. jouluk. 2021 klo 20.08 Chris Angelico (ros...@gmail.com) kirjoitti: >> >> On Wed, Dec 8, 2021 at 4:55 AM Julius Hamilton >> wrote: >> > >> > Hey, >> > >> > Could anyone please comment on the purest way simply to strip HTML tags >> > from the internal text they surround? >> > >> > I know Beautiful Soup is a convenient tool, but I’m interested to know what >> > the most minimal way to do it would be. >> >> That's definitely the best and most general way, and would still be my >> first thought most of the time. >> >> > People say you usually don’t use Regex for a second order language like >> > HTML, so I was thinking about using xpath or lxml, which seem like very >> > pure, universal tools for the job. >> > >> > I did find an example for doing this with the re module, though. >> > >> > Would it be fair to say that to just strip the tags, Regex is fine, but you >> > need to build a tree-like object if you want the ability to select which >> > nodes to keep and which to discard? >> >> Obligatory reference: >> >> https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags >> >> > Can xpath / lxml do that? >> > >> > What are the chief differences between xpath / lxml and Beautiful Soup? >> > >> >> I've never directly used lxml, mainly because bs4 offers all the same >> advantages and more, with about the same costs. However, if you're >> looking for a no-external-deps option, Python *does* include an HTML >> parser in the standard library: >> > > But isn't bs4 only for SOAP content? > Can bs4 or lxml cope with HTML code that does not comply with XML as the > following fragment? > > A > B > > > BR, > Roland > Check out the bs4 docs for some of the things you can do with it :) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: HTML extraction
Hello, ti 7. jouluk. 2021 klo 20.08 Chris Angelico (ros...@gmail.com) kirjoitti: > On Wed, Dec 8, 2021 at 4:55 AM Julius Hamilton > wrote: > > > > Hey, > > > > Could anyone please comment on the purest way simply to strip HTML tags > > from the internal text they surround? > > > > I know Beautiful Soup is a convenient tool, but I’m interested to know > what > > the most minimal way to do it would be. > > That's definitely the best and most general way, and would still be my > first thought most of the time. > > > People say you usually don’t use Regex for a second order language like > > HTML, so I was thinking about using xpath or lxml, which seem like very > > pure, universal tools for the job. > > > > I did find an example for doing this with the re module, though. > > > > Would it be fair to say that to just strip the tags, Regex is fine, but > you > > need to build a tree-like object if you want the ability to select which > > nodes to keep and which to discard? > > Obligatory reference: > > > https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags > > > Can xpath / lxml do that? > > > > What are the chief differences between xpath / lxml and Beautiful Soup? > > > > I've never directly used lxml, mainly because bs4 offers all the same > advantages and more, with about the same costs. However, if you're > looking for a no-external-deps option, Python *does* include an HTML > parser in the standard library: > > But isn't bs4 only for SOAP content? Can bs4 or lxml cope with HTML code that does not comply with XML as the following fragment? A B BR, Roland > https://docs.python.org/3/library/html.parser.html > > If your purpose is extremely simple (like "strip tags, search for > text"), then it should be easy enough to whip up something using that > module. No external deps, not a lot of code, pretty straight-forward. > On the other hand, if you're trying to do an "HTML to text" > conversion, you'd probably need to be aware of which tags are > block-level and which are inline content, so that (for instance) > "Hello world" would come out as two separate > paragraphs of text, whereas the same thing with tags would become > just "Hello world". But for the most part, handle_data will probably > do everything you need. > > ChrisA > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list
Re: HTML extraction
On Wed, Dec 8, 2021 at 4:55 AM Julius Hamilton wrote: > > Hey, > > Could anyone please comment on the purest way simply to strip HTML tags > from the internal text they surround? > > I know Beautiful Soup is a convenient tool, but I’m interested to know what > the most minimal way to do it would be. That's definitely the best and most general way, and would still be my first thought most of the time. > People say you usually don’t use Regex for a second order language like > HTML, so I was thinking about using xpath or lxml, which seem like very > pure, universal tools for the job. > > I did find an example for doing this with the re module, though. > > Would it be fair to say that to just strip the tags, Regex is fine, but you > need to build a tree-like object if you want the ability to select which > nodes to keep and which to discard? Obligatory reference: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags > Can xpath / lxml do that? > > What are the chief differences between xpath / lxml and Beautiful Soup? > I've never directly used lxml, mainly because bs4 offers all the same advantages and more, with about the same costs. However, if you're looking for a no-external-deps option, Python *does* include an HTML parser in the standard library: https://docs.python.org/3/library/html.parser.html If your purpose is extremely simple (like "strip tags, search for text"), then it should be easy enough to whip up something using that module. No external deps, not a lot of code, pretty straight-forward. On the other hand, if you're trying to do an "HTML to text" conversion, you'd probably need to be aware of which tags are block-level and which are inline content, so that (for instance) "Hello world" would come out as two separate paragraphs of text, whereas the same thing with tags would become just "Hello world". But for the most part, handle_data will probably do everything you need. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
HTML extraction
Hey, Could anyone please comment on the purest way simply to strip HTML tags from the internal text they surround? I know Beautiful Soup is a convenient tool, but I’m interested to know what the most minimal way to do it would be. People say you usually don’t use Regex for a second order language like HTML, so I was thinking about using xpath or lxml, which seem like very pure, universal tools for the job. I did find an example for doing this with the re module, though. Would it be fair to say that to just strip the tags, Regex is fine, but you need to build a tree-like object if you want the ability to select which nodes to keep and which to discard? Can xpath / lxml do that? What are the chief differences between xpath / lxml and Beautiful Soup? Thanks, Julius -- https://mail.python.org/mailman/listinfo/python-list