Re: HTML extraction

2021-12-09 Thread Dieter Maurer
Pieter van Oostrum wrote at 2021-12-8 11:00 +0100:
> ...
>bs4 can do it, but lxml wants correct XML.

Use `lxml's the `HTMLParser` to parse HTML
(--> "see https://lxml.de/parsing.html#parsing-html;).
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: HTML extraction

2021-12-08 Thread Pieter van Oostrum
Roland Mueller  writes:

> But isn't bs4 only for SOAP content?
> Can bs4 or lxml cope with HTML code that does not comply with XML as the
> following fragment?
>
> A
> B
> 
>

bs4 can do it, but lxml wants correct XML.

Jupyter console 6.4.0

Python 3.9.9 (main, Nov 16 2021, 07:21:43) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.29.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from bs4 import BeautifulSoup as bs

In [2]: soup = bs('AB')

In [3]: soup.p
Out[3]: A

In [4]: soup.find_all('p')
Out[4]: [A, B]

In [5]: from lxml import etree

In [6]: root = etree.fromstring('AB')
Traceback (most recent call last):

  File 
"/opt/local/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py",
 line 3444, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)

  File 
"/var/folders/2l/pdng2d2x18d00m41l6r2ccjrgn/T/ipykernel_96220/3376613260.py",
 line 1, in 
root = etree.fromstring('AB')

  File "src/lxml/etree.pyx", line 3237, in lxml.etree.fromstring

  File "src/lxml/parser.pxi", line 1896, in lxml.etree._parseMemoryDocument

  File "src/lxml/parser.pxi", line 1777, in lxml.etree._parseDoc

  File "src/lxml/parser.pxi", line 1082, in 
lxml.etree._BaseParser._parseUnicodeDoc

  File "src/lxml/parser.pxi", line 615, in 
lxml.etree._ParserContext._handleParseResultDoc

  File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult

  File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError

  File "", line 1
XMLSyntaxError: Premature end of data in tag hr line 1, line 1, column 13
-- 
Pieter van Oostrum 
www: http://pieter.vanoostrum.org/
PGP key: [8DAE142BE17999C4]
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: HTML extraction

2021-12-08 Thread Dieter Maurer
Roland Mueller wrote at 2021-12-7 22:55 +0200:
> ...
>Can bs4 or lxml cope with HTML code that does not comply with XML as the
>following fragment?

`lxml` comes with an HTML parser; that can be configured to check loosely.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: HTML extraction

2021-12-07 Thread Chris Angelico
On Wed, Dec 8, 2021 at 7:55 AM Roland Mueller
 wrote:
>
> Hello,
>
> ti 7. jouluk. 2021 klo 20.08 Chris Angelico (ros...@gmail.com) kirjoitti:
>>
>> On Wed, Dec 8, 2021 at 4:55 AM Julius Hamilton
>>  wrote:
>> >
>> > Hey,
>> >
>> > Could anyone please comment on the purest way simply to strip HTML tags
>> > from the internal text they surround?
>> >
>> > I know Beautiful Soup is a convenient tool, but I’m interested to know what
>> > the most minimal way to do it would be.
>>
>> That's definitely the best and most general way, and would still be my
>> first thought most of the time.
>>
>> > People say you usually don’t use Regex for a second order language like
>> > HTML, so I was thinking about using xpath or lxml, which seem like very
>> > pure, universal tools for the job.
>> >
>> > I did find an example for doing this with the re module, though.
>> >
>> > Would it be fair to say that to just strip the tags, Regex is fine, but you
>> > need to build a tree-like object if you want the ability to select which
>> > nodes to keep and which to discard?
>>
>> Obligatory reference:
>>
>> https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
>>
>> > Can xpath / lxml do that?
>> >
>> > What are the chief differences between xpath / lxml and Beautiful Soup?
>> >
>>
>> I've never directly used lxml, mainly because bs4 offers all the same
>> advantages and more, with about the same costs. However, if you're
>> looking for a no-external-deps option, Python *does* include an HTML
>> parser in the standard library:
>>
>
> But isn't bs4 only for SOAP content?
> Can bs4 or lxml cope with HTML code that does not comply with XML as the 
> following fragment?
>
> A
> B
> 
>
> BR,
> Roland
>

Check out the bs4 docs for some of the things you can do with it :)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: HTML extraction

2021-12-07 Thread Roland Mueller via Python-list
Hello,

ti 7. jouluk. 2021 klo 20.08 Chris Angelico (ros...@gmail.com) kirjoitti:

> On Wed, Dec 8, 2021 at 4:55 AM Julius Hamilton
>  wrote:
> >
> > Hey,
> >
> > Could anyone please comment on the purest way simply to strip HTML tags
> > from the internal text they surround?
> >
> > I know Beautiful Soup is a convenient tool, but I’m interested to know
> what
> > the most minimal way to do it would be.
>
> That's definitely the best and most general way, and would still be my
> first thought most of the time.
>
> > People say you usually don’t use Regex for a second order language like
> > HTML, so I was thinking about using xpath or lxml, which seem like very
> > pure, universal tools for the job.
> >
> > I did find an example for doing this with the re module, though.
> >
> > Would it be fair to say that to just strip the tags, Regex is fine, but
> you
> > need to build a tree-like object if you want the ability to select which
> > nodes to keep and which to discard?
>
> Obligatory reference:
>
>
> https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
>
> > Can xpath / lxml do that?
> >
> > What are the chief differences between xpath / lxml and Beautiful Soup?
> >
>
> I've never directly used lxml, mainly because bs4 offers all the same
> advantages and more, with about the same costs. However, if you're
> looking for a no-external-deps option, Python *does* include an HTML
> parser in the standard library:
>
>
But isn't bs4 only for SOAP content?
Can bs4 or lxml cope with HTML code that does not comply with XML as the
following fragment?

A
B


BR,
Roland


> https://docs.python.org/3/library/html.parser.html
>
> If your purpose is extremely simple (like "strip tags, search for
> text"), then it should be easy enough to whip up something using that
> module. No external deps, not a lot of code, pretty straight-forward.
> On the other hand, if you're trying to do an "HTML to text"
> conversion, you'd probably need to be aware of which tags are
> block-level and which are inline content, so that (for instance)
> "Hello world" would come out as two separate
> paragraphs of text, whereas the same thing with  tags would become
> just "Hello world". But for the most part, handle_data will probably
> do everything you need.
>
> ChrisA
> --
> https://mail.python.org/mailman/listinfo/python-list
>
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: HTML extraction

2021-12-07 Thread Chris Angelico
On Wed, Dec 8, 2021 at 4:55 AM Julius Hamilton
 wrote:
>
> Hey,
>
> Could anyone please comment on the purest way simply to strip HTML tags
> from the internal text they surround?
>
> I know Beautiful Soup is a convenient tool, but I’m interested to know what
> the most minimal way to do it would be.

That's definitely the best and most general way, and would still be my
first thought most of the time.

> People say you usually don’t use Regex for a second order language like
> HTML, so I was thinking about using xpath or lxml, which seem like very
> pure, universal tools for the job.
>
> I did find an example for doing this with the re module, though.
>
> Would it be fair to say that to just strip the tags, Regex is fine, but you
> need to build a tree-like object if you want the ability to select which
> nodes to keep and which to discard?

Obligatory reference:

https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

> Can xpath / lxml do that?
>
> What are the chief differences between xpath / lxml and Beautiful Soup?
>

I've never directly used lxml, mainly because bs4 offers all the same
advantages and more, with about the same costs. However, if you're
looking for a no-external-deps option, Python *does* include an HTML
parser in the standard library:

https://docs.python.org/3/library/html.parser.html

If your purpose is extremely simple (like "strip tags, search for
text"), then it should be easy enough to whip up something using that
module. No external deps, not a lot of code, pretty straight-forward.
On the other hand, if you're trying to do an "HTML to text"
conversion, you'd probably need to be aware of which tags are
block-level and which are inline content, so that (for instance)
"Hello world" would come out as two separate
paragraphs of text, whereas the same thing with  tags would become
just "Hello world". But for the most part, handle_data will probably
do everything you need.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


HTML extraction

2021-12-07 Thread Julius Hamilton
Hey,

Could anyone please comment on the purest way simply to strip HTML tags
from the internal text they surround?

I know Beautiful Soup is a convenient tool, but I’m interested to know what
the most minimal way to do it would be.

People say you usually don’t use Regex for a second order language like
HTML, so I was thinking about using xpath or lxml, which seem like very
pure, universal tools for the job.

I did find an example for doing this with the re module, though.

Would it be fair to say that to just strip the tags, Regex is fine, but you
need to build a tree-like object if you want the ability to select which
nodes to keep and which to discard?

Can xpath / lxml do that?

What are the chief differences between xpath / lxml and Beautiful Soup?

Thanks,
Julius
-- 
https://mail.python.org/mailman/listinfo/python-list