Re: subprocess.popen how wait complete open process

2022-08-21 Thread Chris Angelico
On Mon, 22 Aug 2022 at 13:41, Dan Stromberg  wrote:
>
>
>
> On Sun, Aug 21, 2022 at 2:05 PM Chris Angelico  wrote:
>>
>> On Mon, 22 Aug 2022 at 05:39, simone zambonardi
>>  wrote:
>> >
>> > Hi, I am running a program with the punishment subrocess.Popen(...) what I 
>> > should do is to stop the script until the launched program is fully open. 
>> > How can I do this? I used a time.sleep() function but I think there are 
>> > other ways. Thanks
>> >
>>
>> First you have to define "fully open". How would you know?
>
>
> If you're on X11, you could conceivably use:
>  xwininfo -tree -root
>

That's only one possible definition: it has some sort of window. But
to wait until a program is "fully open", you might have to wait past a
splash screen until it has its actual application window. Or maybe
even then, it's not ready for operation. Only the OP can know what
defines "fully open".

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: subprocess.popen how wait complete open process

2022-08-21 Thread Dan Stromberg
On Sun, Aug 21, 2022 at 2:05 PM Chris Angelico  wrote:

> On Mon, 22 Aug 2022 at 05:39, simone zambonardi
>  wrote:
> >
> > Hi, I am running a program with the punishment subrocess.Popen(...) what
> I should do is to stop the script until the launched program is fully open.
> How can I do this? I used a time.sleep() function but I think there are
> other ways. Thanks
> >
>
> First you have to define "fully open". How would you know?
>

If you're on X11, you could conceivably use:
 xwininfo -tree -root
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Chris Angelico
On Mon, 22 Aug 2022 at 10:04, Buck Evan  wrote:
>
> I've had much success doing round trips through the lxml.html parser.
>
> https://lxml.de/lxmlhtml.html
>
> I ditched bs for lxml long ago and never regretted it.
>
> If you find that you have a bunch of invalid html that lxml inadvertently 
> "fixes", I would recommend adding a stutter-step to your project: perform a 
> noop roundtrip thru lxml on all files. I'd then analyze any diff by 
> progressively excluding changes via `grep -vP`.
> Unless I'm mistaken, all such changes should fall into no more than a dozen 
> groups.
>

Will this round-trip mutate every single file and reorder the tag
attributes? Because I really don't want to manually eyeball all those
changes.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Buck Evan
I've had much success doing round trips through the lxml.html parser.

https://lxml.de/lxmlhtml.html

I ditched bs for lxml long ago and never regretted it.

If you find that you have a bunch of invalid html that lxml inadvertently
"fixes", I would recommend adding a stutter-step to your project: perform a
noop roundtrip thru lxml on all files. I'd then analyze any diff by
progressively excluding changes via `grep -vP`.
Unless I'm mistaken, all such changes should fall into no more than a dozen
groups.




On Fri, Aug 19, 2022, 1:34 PM Chris Angelico  wrote:

> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?
>
> Using the Alice example from the BS4 docs:
>
> >>> html_doc = """The Dormouse's story
> 
> The Dormouse's story
>
> Once upon a time there were three little sisters; and
> their names were
> http://example.com/elsie; class="sister" id="link1">Elsie,
> http://example.com/lacie; class="sister" id="link2">Lacie and
> http://example.com/tillie; class="sister" id="link3">Tillie;
> and they lived at the bottom of a well.
>
> ...
> """
> >>> print(soup)
> The Dormouse's story
> 
> The Dormouse's story
> Once upon a time there were three little sisters; and
> their names were
> http://example.com/elsie; id="link1">Elsie,
> http://example.com/lacie; id="link2">Lacie and
> http://example.com/tillie; id="link3">Tillie;
> and they lived at the bottom of a well.
> ...
> 
> >>>
>
> Note two distinct changes: firstly, whitespace has been removed, and
> secondly, attributes are reordered (I think alphabetically). There are
> other canonicalizations being done, too.
>
> I'm trying to make some automated changes to a huge number of HTML
> files, with minimal diffs so they're easy to validate. That means that
> spurious changes like these are very much unwanted. Is there a way to
> get BS4 to reconstruct the original precisely?
>
> The mutation itself would be things like finding an anchor tag and
> changing its href attribute. Fairly simple changes, but might alter
> the length of the file (eg changing "http://example.com/; into
> "https://example.com/;). I'd like to do them intelligently rather than
> falling back on element.sourceline and element.sourcepos, but worst
> case, that's what I'll have to do (which would be fiddly).
>
> ChrisA
> --
> https://mail.python.org/mailman/listinfo/python-list
>
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Chris Angelico
On Mon, 22 Aug 2022 at 05:43, Jon Ribbens via Python-list
 wrote:
>
> On 2022-08-21, Chris Angelico  wrote:
> > On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list
> > wrote:
> >> On 2022-08-20, Chris Angelico  wrote:
> >> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram  wrote:
> >> >> 2qdxy4rzwzuui...@potatochowder.com writes:
> >> >> >textual representations.  That way, the following two elements are the
> >> >> >same (and similar with a collection of sub-elements in a different 
> >> >> >order
> >> >> >in another document):
> >> >>
> >> >>   The /elements/ differ. They have the /same/ infoset.
> >> >
> >> > That's the bit that's hard to prove.
> >> >
> >> >>   The OP could edit the files with regexps to create a new version.
> >> >
> >> > To you and Jon, who also suggested this: how would that be beneficial?
> >> > With Beautiful Soup, I have the line number and position within the
> >> > line where the tag starts; what does a regex give me that I don't have
> >> > that way?
> >>
> >> You mean you could use BeautifulSoup to read the file and identify the
> >> bits you want to change by line number and offset, and then you could
> >> use that data to try and update the file, hoping like hell that your
> >> definition of "line" and "offset" are identical to BeautifulSoup's
> >> and that you don't mess up later changes when you do earlier ones (you
> >> could do them in reverse order of line and offset I suppose) and
> >> probably resorting to regexps anyway in order to find the part of the
> >> tag you want to change ...
> >>
> >> ... or you could avoid all that faff and just do re.sub()?
> >
> > Stefan answered in part, but I'll add that it is far FAR easier to do
> > the analysis with BS4 than regular expressions. I'm not sure what
> > "hoping like hell" is supposed to mean here, since the line and offset
> > have been 100% accurate in my experience;
>
> Given the string:
>
> b"\n \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8?"
>
> what is the line number and offset of the question mark - and does
> BeautifulSoup agree with your answer? Does the answer to that second
> question change depending on what parser you tell BeautifulSoup to use?

I'm not sure, because I don't know how to ask BS4 about the location
of a question mark. But I replaced that with a tag, and:

>>> raw = b"\n 
>>> \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8"
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(raw, "html.parser")
>>> soup.body.sourceline
4
>>> soup.body.sourcepos
12
>>> raw.split(b"\n")[3]
b'\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8'
>>> raw.split(b"\n")[3][12:]
b''

So, yes, it seems to be correct. (Slightly odd in that the sourceline
is 1-based but the sourcepos is 0-based, but that is indeed the case,
as confirmed with a much more straight-forward string.)

And yes, it depends on the parser, but I'm using html.parser and it's fine.

> (If your answer is "if the input contains \xed\xa0\x80\xed\xbc\x9f then
> I am happy with the program throwing an exception" then feel free to
> remove that substring from the question.)

Malformed UTF-8 doesn't seem to be a problem. Every file here seems to
be either UTF-8 or ISO-8859, and in the latter case, I'm assuming
8859-1. So I would probably just let this one go through as 8859-1.

> > the only part I'm unsure about is where the _end_ of the tag is (and
> > maybe there's a way I can use BS4 again to get that??).
>
> There doesn't seem to be. More to the point, there doesn't seem to be
> a way to find out where the *attributes* are, so as I said you'll most
> likely end up using regexps anyway.

I'm okay with replacing an entire tag that needs to be changed.
Especially if I can replace just the opening tag, not the contents and
closing tag. And in fact, I may just do that part by scanning for an
unencoded greater-than, on the assumptions that (a) BS4 will correctly
encode any greater-thans in attributes, and (b) if there's a
mis-encoded one in the input, the diff will be small enough to
eyeball, and a human should easily notice that the text has been
massively expanded and duplicated.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: subprocess.popen how wait complete open process

2022-08-21 Thread Chris Angelico
On Mon, 22 Aug 2022 at 05:39, simone zambonardi
 wrote:
>
> Hi, I am running a program with the punishment subrocess.Popen(...) what I 
> should do is to stop the script until the launched program is fully open. How 
> can I do this? I used a time.sleep() function but I think there are other 
> ways. Thanks
>

First you have to define "fully open". How would you know?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: subprocess.popen how wait complete open process

2022-08-21 Thread Paul Bryan
Sometimes, launching subprocesses can seem like punishment. I don't
think there is a standard cross-platform way to know when a launched
asynchronous process is "fully open" (i.e. fully initialized, accepting
user input).

On Sun, 2022-08-21 at 02:11 -0700, simone zambonardi wrote:
> Hi, I am running a program with the punishment subrocess.Popen(...)
> what I should do is to stop the script until the launched program is
> fully open. How can I do this? I used a time.sleep() function but I
> think there are other ways. Thanks

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Jon Ribbens via Python-list
On 2022-08-21, Chris Angelico  wrote:
> On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list
> wrote:
>> On 2022-08-20, Chris Angelico  wrote:
>> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram  wrote:
>> >> 2qdxy4rzwzuui...@potatochowder.com writes:
>> >> >textual representations.  That way, the following two elements are the
>> >> >same (and similar with a collection of sub-elements in a different order
>> >> >in another document):
>> >>
>> >>   The /elements/ differ. They have the /same/ infoset.
>> >
>> > That's the bit that's hard to prove.
>> >
>> >>   The OP could edit the files with regexps to create a new version.
>> >
>> > To you and Jon, who also suggested this: how would that be beneficial?
>> > With Beautiful Soup, I have the line number and position within the
>> > line where the tag starts; what does a regex give me that I don't have
>> > that way?
>>
>> You mean you could use BeautifulSoup to read the file and identify the
>> bits you want to change by line number and offset, and then you could
>> use that data to try and update the file, hoping like hell that your
>> definition of "line" and "offset" are identical to BeautifulSoup's
>> and that you don't mess up later changes when you do earlier ones (you
>> could do them in reverse order of line and offset I suppose) and
>> probably resorting to regexps anyway in order to find the part of the
>> tag you want to change ...
>>
>> ... or you could avoid all that faff and just do re.sub()?
>
> Stefan answered in part, but I'll add that it is far FAR easier to do
> the analysis with BS4 than regular expressions. I'm not sure what
> "hoping like hell" is supposed to mean here, since the line and offset
> have been 100% accurate in my experience;

Given the string:

b"\n \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8?"

what is the line number and offset of the question mark - and does
BeautifulSoup agree with your answer? Does the answer to that second
question change depending on what parser you tell BeautifulSoup to use?

(If your answer is "if the input contains \xed\xa0\x80\xed\xbc\x9f then
I am happy with the program throwing an exception" then feel free to
remove that substring from the question.)

> the only part I'm unsure about is where the _end_ of the tag is (and
> maybe there's a way I can use BS4 again to get that??).

There doesn't seem to be. More to the point, there doesn't seem to be
a way to find out where the *attributes* are, so as I said you'll most
likely end up using regexps anyway.
-- 
https://mail.python.org/mailman/listinfo/python-list


subprocess.popen how wait complete open process

2022-08-21 Thread simone zambonardi
Hi, I am running a program with the punishment subrocess.Popen(...) what I 
should do is to stop the script until the launched program is fully open. How 
can I do this? I used a time.sleep() function but I think there are other ways. 
Thanks
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Peter J. Holzer
On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote:
> On 2022-08-20, Stefan Ram  wrote:
> > Jon Ribbens  writes:
> >>... or you could avoid all that faff and just do re.sub()?

> > source = ''
> >
> > # Use Python to change the source, keeping the order of attributes.
> >
> > result = re.sub( r'href\s*=\s*"http"', r'href="https"', source )
> > result = re.sub( r"href\s*=\s*'http'", r"href='https'", result )

Depending on the content of the site, this might replace some stuff
which is not a link.


> You could go a bit harder with the regexp of course, e.g.:
> 
>   result = re.sub(
>   r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""",

This will fail on:


The problem can be solved with regular expressions (and given the
constraints I think I would prefer that to using Beautiful Soup), but
getting the regexps right is not trivial, at least in the general case.
It may become a lot easier if you know that certain conventions were
followed (e.g. that ">" was always written as "") or it may become
even harder when the files contain errors.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Barry


> On 21 Aug 2022, at 09:12, Chris Angelico  wrote:
> 
> On Sun, 21 Aug 2022 at 17:26, Barry  wrote:
>> 
>> 
>> 
 On 19 Aug 2022, at 22:04, Chris Angelico  wrote:
>>> 
>>> On Sat, 20 Aug 2022 at 05:12, Barry  wrote:
 
 
 
>> On 19 Aug 2022, at 19:33, Chris Angelico  wrote:
> 
> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?
 
 I recall that in bs4 it parses into an object tree and loses the detail of 
 the input.
 I recently ported from very old bs to bs4 and hit the same issue.
 So no it will not output the same as went in.
 
 If you can trust the input to be parsed as xml, meaning all the rules of 
 closing
 tags have been followed. Then I think you can parse and unparse thru xml to
 do what you want.
 
>>> 
>>> 
>>> Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh
>>> well. Thanks for trying, anyhow.
>>> 
>>> So I'm left with a few options:
>>> 
>>> 1) Give up on validation, give up on verification, and just run this
>>> thing on the production site with my fingers crossed
>> 
>> Can you build a beta site with original intack?
> 
> In a naive way, a full copy would be quite a few gigabytes. I could
> cut that down a good bit by taking only HTML files and the things they
> reference, but then we run into the same problem of broken links,
> which is what we're here to solve in the first place.
> 
> But I would certainly not want to run two copies of the site and then
> manually compare.
> 
>> Also wonder if using selenium to walk the site may work as a verification 
>> step?
>> I cannot recall if you can get an image of the browser window to do image 
>> compares with to look for rendering differences.
> 
> Image recognition won't necessarily even be valid; some of the changes
> will have visual consequences (eg a broken image reference now
> becoming correct), and as soon as that happens, the whole document can
> reflow.
> 
>> From my one task using bs4 I did not see it produce any bad results.
>> In my case the problems where in the code that built on bs1 using bad 
>> assumptions.
> 
> Did that get run on perfect HTML, or on messy real-world stuff that
> uses quirks mode?

I small number of messy html pages.

Barry

> 
> ChrisA
> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Chris Angelico
On Sun, 21 Aug 2022 at 17:26, Barry  wrote:
>
>
>
> > On 19 Aug 2022, at 22:04, Chris Angelico  wrote:
> >
> > On Sat, 20 Aug 2022 at 05:12, Barry  wrote:
> >>
> >>
> >>
>  On 19 Aug 2022, at 19:33, Chris Angelico  wrote:
> >>>
> >>> What's the best way to precisely reconstruct an HTML file after
> >>> parsing it with BeautifulSoup?
> >>
> >> I recall that in bs4 it parses into an object tree and loses the detail of 
> >> the input.
> >> I recently ported from very old bs to bs4 and hit the same issue.
> >> So no it will not output the same as went in.
> >>
> >> If you can trust the input to be parsed as xml, meaning all the rules of 
> >> closing
> >> tags have been followed. Then I think you can parse and unparse thru xml to
> >> do what you want.
> >>
> >
> >
> > Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh
> > well. Thanks for trying, anyhow.
> >
> > So I'm left with a few options:
> >
> > 1) Give up on validation, give up on verification, and just run this
> > thing on the production site with my fingers crossed
>
> Can you build a beta site with original intack?

In a naive way, a full copy would be quite a few gigabytes. I could
cut that down a good bit by taking only HTML files and the things they
reference, but then we run into the same problem of broken links,
which is what we're here to solve in the first place.

But I would certainly not want to run two copies of the site and then
manually compare.

> Also wonder if using selenium to walk the site may work as a verification 
> step?
> I cannot recall if you can get an image of the browser window to do image 
> compares with to look for rendering differences.

Image recognition won't necessarily even be valid; some of the changes
will have visual consequences (eg a broken image reference now
becoming correct), and as soon as that happens, the whole document can
reflow.

> From my one task using bs4 I did not see it produce any bad results.
> In my case the problems where in the code that built on bs1 using bad 
> assumptions.

Did that get run on perfect HTML, or on messy real-world stuff that
uses quirks mode?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Barry


> On 19 Aug 2022, at 22:04, Chris Angelico  wrote:
> 
> On Sat, 20 Aug 2022 at 05:12, Barry  wrote:
>> 
>> 
>> 
 On 19 Aug 2022, at 19:33, Chris Angelico  wrote:
>>> 
>>> What's the best way to precisely reconstruct an HTML file after
>>> parsing it with BeautifulSoup?
>> 
>> I recall that in bs4 it parses into an object tree and loses the detail of 
>> the input.
>> I recently ported from very old bs to bs4 and hit the same issue.
>> So no it will not output the same as went in.
>> 
>> If you can trust the input to be parsed as xml, meaning all the rules of 
>> closing
>> tags have been followed. Then I think you can parse and unparse thru xml to
>> do what you want.
>> 
> 
> 
> Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh
> well. Thanks for trying, anyhow.
> 
> So I'm left with a few options:
> 
> 1) Give up on validation, give up on verification, and just run this
> thing on the production site with my fingers crossed

Can you build a beta site with original intack?

Also wonder if using selenium to walk the site may work as a verification step?
I cannot recall if you can get an image of the browser window to do image 
compares with to look for rendering differences.

From my one task using bs4 I did not see it produce any bad results.
In my case the problems where in the code that built on bs1 using bad 
assumptions.



> 2) Instead of doing an intelligent reconstruction, just str.replace()
> one URL with another within the file
> 3) Split the file into lines, find the Nth line (elem.sourceline) and
> str.replace that line only
> 4) Attempt to use elem.sourceline and elem.sourcepos to find the start
> of the tag, manually find the end, and replace one tag with the
> reconstructed form.
> 
> I'm inclined to the first option, honestly. The others just seem like
> hard work, and I became a programmer so I could be lazy...
> 
> ChrisA
> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list