Re: [PHP-DEV] [RFC] [Discussion] DOM HTML5 parsing and serialization support

2023-09-21 Thread Niels Dossche
Hi Dennis

On 21/09/2023 22:26, Dennis Snell wrote:
> 
> 
>> On Sep 19, 2023, at 12:30 AM, Tim Düsterhus  wrote:
>>
>> From the perspective of the user of the API, I like the symmetry between all 
>> the named constructors:
>>
>> Whenever I want to create a new document, I use one of the fromXyz() 
>> methods. And when I use those, I get exactly what it says on the tin.
>>
>> Making the regular constructor available for use would effectively make 
>> whatever that constructor does a special case / the default case.
> 
> 
> Just wanted to voice my support for this too. In the WordPress HTML API I 
> regret exposing a public constructor on the underlying lexer because that 
> prevented me from hiding the constructor on the HTML parsing class, leading 
> to an awkward convention to try and beg people not to use it.
> 
> Why? the constructor performs various state initialization important for 
> HTML, not only for the encoding, but in our case I felt it was important 
> enough that we provide both a full-document interface and also the HTML 
> fragment parser, since almost all of the actual HTML parsing we do is one 
> fragments. This _could_ be something to consider here in the HTMLDocument API 
> as well.
> 
> The use of the constructor directly opens up opportunities to bypass that 
> state initialization which could lead to incorrect parse results, not only 
> for encoding issues, but also in getting way off track in parsing. E.g. if 
> we’re parsing the inner contents of a TEMPLATE element then a closing BODY 
> tag should be ignored whereas in other parsing modes it closes the BODY 
> element.
> 
>> For the HtmlDocument I find it less obvious that creating an empty document 
>> would be the default case compared to loading an existing document from a 
>> file.
> 
> 
> To hit the point home, “creating an empty document” seems a little unclear in 
> terms of HTML as well. Are we creating am empty DOM class instance without 
> any actual DOM? Are we creating a  DOM with no nodes? If we 
> proceed by feeding it an HTML string and parsing then we can resolve the 
> ambiguity, but if we start inserting DOM nodes directly that leaves us in an 
> undefined starting point.
> 
> Presumably we don’t need this empty method because we could `loadString( ‘’ 
> )`, but that leaves open the ambiguities of what kind of empty we’re making. 
> In my experience with HTML, any kind of “empty” is going to need a definition 
> of its own, and a static creator method is the right place to define and 
> parameterize that, as in the updated RFC, though maybe with an additional 
> clarification of what “empty” means and what implications it carries for 
> ongoing manipulation.
> 

To clarify what createEmpty() does and how it differs from createFromString(''):
createEmpty() creates a HTMLDocument whose DOM tree is empty (i.e. no nodes): 
no html element and not even a doctype. It is the replacement for the 
DOMDocument constructor. This is intended for people who create everything 
themselves from scratch, or people trying to construct only fragments. This 
does not invoke the parser and its rules.
createFromString('') will not produce an empty DOM tree. That's because the 
parsing rules tell us that the html, head and body element are always created. 
This means the resulting DOM document will look like this: 
.

> —
> 
> Having static creator methods with docblocks and named arguments to me is a 
> great way to not only guide people to use these classes safely, but also to 
> teach some of the nuances that have historically been overlooked in PHP’s 
> HTML handling, e.g. `html_entity_decode()` is unaware of of the ambiguous 
> ampersand and offers no way to correctly parse certain kinds of invalid named 
> character references.
> 

I agree.

> Kindly,
> Dennis Snell

Kind regards
Niels

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] [Discussion] DOM HTML5 parsing and serialization support

2023-09-21 Thread Dennis Snell via internals



> On Sep 19, 2023, at 12:30 AM, Tim Düsterhus  wrote:
> 
> From the perspective of the user of the API, I like the symmetry between all 
> the named constructors:
> 
> Whenever I want to create a new document, I use one of the fromXyz() methods. 
> And when I use those, I get exactly what it says on the tin.
> 
> Making the regular constructor available for use would effectively make 
> whatever that constructor does a special case / the default case.


Just wanted to voice my support for this too. In the WordPress HTML API I 
regret exposing a public constructor on the underlying lexer because that 
prevented me from hiding the constructor on the HTML parsing class, leading to 
an awkward convention to try and beg people not to use it.

Why? the constructor performs various state initialization important for HTML, 
not only for the encoding, but in our case I felt it was important enough that 
we provide both a full-document interface and also the HTML fragment parser, 
since almost all of the actual HTML parsing we do is one fragments. This 
_could_ be something to consider here in the HTMLDocument API as well.

The use of the constructor directly opens up opportunities to bypass that state 
initialization which could lead to incorrect parse results, not only for 
encoding issues, but also in getting way off track in parsing. E.g. if we’re 
parsing the inner contents of a TEMPLATE element then a closing BODY tag should 
be ignored whereas in other parsing modes it closes the BODY element.

> For the HtmlDocument I find it less obvious that creating an empty document 
> would be the default case compared to loading an existing document from a 
> file.


To hit the point home, “creating an empty document” seems a little unclear in 
terms of HTML as well. Are we creating am empty DOM class instance without any 
actual DOM? Are we creating a  DOM with no nodes? If we proceed 
by feeding it an HTML string and parsing then we can resolve the ambiguity, but 
if we start inserting DOM nodes directly that leaves us in an undefined 
starting point.

Presumably we don’t need this empty method because we could `loadString( ‘’ )`, 
but that leaves open the ambiguities of what kind of empty we’re making. In my 
experience with HTML, any kind of “empty” is going to need a definition of its 
own, and a static creator method is the right place to define and parameterize 
that, as in the updated RFC, though maybe with an additional clarification of 
what “empty” means and what implications it carries for ongoing manipulation.

—

Having static creator methods with docblocks and named arguments to me is a 
great way to not only guide people to use these classes safely, but also to 
teach some of the nuances that have historically been overlooked in PHP’s HTML 
handling, e.g. `html_entity_decode()` is unaware of of the ambiguous ampersand 
and offers no way to correctly parse certain kinds of invalid named character 
references.

Kindly,
Dennis Snell
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] [Discussion] DOM HTML5 parsing and serialization support

2023-09-21 Thread Niels Dossche
Hi Stephen

On 20/09/2023 12:02, Stephen Reay wrote:
> 
> 
>> On 20 Sep 2023, at 03:03, Niels Dossche  wrote:
>>
>> Hi Stephen
>>
>> On 19/09/2023 09:58, Stephen Reay wrote:
>>>
>>>
 On 19 Sep 2023, at 14:30, Tim Düsterhus  wrote:

 Hi

 On 9/19/23 08:35, Stephen Reay wrote:
> Regarding the private constructor: I understand the issue with the *old* 
> class being confusing - but your new class doesn't have that issue, 
> because there are no "loadXXX" methods: as you said, if you're loading an 
> existing document, you're forced to do it via the static factory methods. 
> With that change alone, there's zero risk of confusion in allowing direct 
> constructor calls, because once the object is instantiated there is no 
> subsequent way to load a document and inadvertently change the encoding.
> Having a regular constructor and then one or more factory methods for 
> specific cases is already a concept in PHP (i.e. DateTime[Immutable]'s  
> `createFromXXX` as well as a constructor), and IMO needing to call a 
> factory method to get an "empty" document seems out of place in PHP - it 
> seems a bit like a Java-ism - using a factory, where one just isn't 
> required.

 I was one of the persons who discussed this updated API with Niels in 
 private and looking up the discussion it was me who proposed making the 
 constructor private and just providing named constructors.

 From the perspective of the user of the API, I like the symmetry between 
 all the named constructors:

 Whenever I want to create a new document, I use one of the fromXyz() 
 methods. And when I use those, I get exactly what it says on the tin.

 Making the regular constructor available for use would effectively make 
 whatever that constructor does a special case / the default case. This 
 makes sense for DateTimeImmutable, because the __construct() variant is 
 likely much more often used than the various createFromXyz() variants. For 
 the HtmlDocument I find it less obvious that creating an empty document 
 would be the default case compared to loading an existing document from a 
 file.

 We should probably rename the named constructors to include the "create" 
 prefix for consistency with DTI though.

 Best regards
 Tim Düsterhus

>>>
>>> Hi Tim,
>>>
>>>
>>> Thanks for providing context on where this idea came from.
>>>
>>>
>>> The constructor + specialised factory methods pattern is not exactly new in 
>>> PHP (e.g. it took about 3 minutes to find multiple Symphony classes doing 
>>> this), and disabling the public constructor for purely cosmetic reasons 
>>> sounds like a very weird, and ironic choice to me, when the stated goal is 
>>> to make the API *less surprising* than the previous one.
>>>
>>
>> Besides the points that have been mentioned by Tim and Larry, there's also 
>> the expectation of the programmer that migrates to the new classes to take 
>> into account.
>> They're used to calling the constructor on the old class and calling a load 
>> method afterwards.
>> As there is still a constructor, but no load method, this might be confusing 
>> for them.
>> So in my opinion, disabling it makes it less surprising than the previous 
>> one.
>> Also, just because Symfony does this doesn't mean it's automatically 
>> something we should follow.
>>
>>>
>>> Cheers
>>>
>>> Stephen 
>>
>> Kind regards
>> Niels
>>
>> -- 
>> PHP Internals - PHP Runtime Development Mailing List
>> To unsubscribe, visit: https://www.php.net/unsub.php 
>> 
> 
> Hi Niels, Larry
> 
> To clarify my earlier message about having the factory methods on the 
> faux-interface - when I said meta programming that was probably the wrong 
> term to use, so I apologise for the confusion. 
> It's true that the *developer* needs to know the expected type of the string 
> being parsed. It's not true that the code directly invoking the 'fromString' 
> method necessarily needs to know - passing classnames as a string and calling 
> static methods on the string is also an established concept in PHP, which 
> works really well when the method being called is defined on a single 
> top-level class or interface.
> A real world example I had in mind, is a Request (e.g. a curl request, or 
> some other http-fetching) class that can decode responses into native PHP 
> objects, with options for the callee to define the object type that's wanted; 
> with just the two provided classes there's already double the checks required 
> to know if the specified class name is one the library knows how to use. If a 
> hypothetical class to handle SVG, RSS, ODF (or any other specialised use of 
> XML) class uses the faux-interface from this RFC as it's base, there's no way 
> for the Request class to know that it can instantiate an instance from a 
> string, without falling back to 

Re: [PHP-DEV] [RFC] [Discussion] DOM HTML5 parsing and serialization support

2023-09-20 Thread Stephen Reay


> On 20 Sep 2023, at 03:03, Niels Dossche  wrote:
> 
> Hi Stephen
> 
> On 19/09/2023 09:58, Stephen Reay wrote:
>> 
>> 
>>> On 19 Sep 2023, at 14:30, Tim Düsterhus  wrote:
>>> 
>>> Hi
>>> 
>>> On 9/19/23 08:35, Stephen Reay wrote:
 Regarding the private constructor: I understand the issue with the *old* 
 class being confusing - but your new class doesn't have that issue, 
 because there are no "loadXXX" methods: as you said, if you're loading an 
 existing document, you're forced to do it via the static factory methods. 
 With that change alone, there's zero risk of confusion in allowing direct 
 constructor calls, because once the object is instantiated there is no 
 subsequent way to load a document and inadvertently change the encoding.
 Having a regular constructor and then one or more factory methods for 
 specific cases is already a concept in PHP (i.e. DateTime[Immutable]'s  
 `createFromXXX` as well as a constructor), and IMO needing to call a 
 factory method to get an "empty" document seems out of place in PHP - it 
 seems a bit like a Java-ism - using a factory, where one just isn't 
 required.
>>> 
>>> I was one of the persons who discussed this updated API with Niels in 
>>> private and looking up the discussion it was me who proposed making the 
>>> constructor private and just providing named constructors.
>>> 
>>> From the perspective of the user of the API, I like the symmetry between 
>>> all the named constructors:
>>> 
>>> Whenever I want to create a new document, I use one of the fromXyz() 
>>> methods. And when I use those, I get exactly what it says on the tin.
>>> 
>>> Making the regular constructor available for use would effectively make 
>>> whatever that constructor does a special case / the default case. This 
>>> makes sense for DateTimeImmutable, because the __construct() variant is 
>>> likely much more often used than the various createFromXyz() variants. For 
>>> the HtmlDocument I find it less obvious that creating an empty document 
>>> would be the default case compared to loading an existing document from a 
>>> file.
>>> 
>>> We should probably rename the named constructors to include the "create" 
>>> prefix for consistency with DTI though.
>>> 
>>> Best regards
>>> Tim Düsterhus
>>> 
>> 
>> Hi Tim,
>> 
>> 
>> Thanks for providing context on where this idea came from.
>> 
>> 
>> The constructor + specialised factory methods pattern is not exactly new in 
>> PHP (e.g. it took about 3 minutes to find multiple Symphony classes doing 
>> this), and disabling the public constructor for purely cosmetic reasons 
>> sounds like a very weird, and ironic choice to me, when the stated goal is 
>> to make the API *less surprising* than the previous one.
>> 
> 
> Besides the points that have been mentioned by Tim and Larry, there's also 
> the expectation of the programmer that migrates to the new classes to take 
> into account.
> They're used to calling the constructor on the old class and calling a load 
> method afterwards.
> As there is still a constructor, but no load method, this might be confusing 
> for them.
> So in my opinion, disabling it makes it less surprising than the previous one.
> Also, just because Symfony does this doesn't mean it's automatically 
> something we should follow.
> 
>> 
>> Cheers
>> 
>> Stephen 
> 
> Kind regards
> Niels
> 
> -- 
> PHP Internals - PHP Runtime Development Mailing List
> To unsubscribe, visit: https://www.php.net/unsub.php

Hi Niels, Larry

To clarify my earlier message about having the factory methods on the 
faux-interface - when I said meta programming that was probably the wrong term 
to use, so I apologise for the confusion. 
It's true that the *developer* needs to know the expected type of the string 
being parsed. It's not true that the code directly invoking the 'fromString' 
method necessarily needs to know - passing classnames as a string and calling 
static methods on the string is also an established concept in PHP, which works 
really well when the method being called is defined on a single top-level class 
or interface.
A real world example I had in mind, is a Request (e.g. a curl request, or some 
other http-fetching) class that can decode responses into native PHP objects, 
with options for the callee to define the object type that's wanted; with just 
the two provided classes there's already double the checks required to know if 
the specified class name is one the library knows how to use. If a hypothetical 
class to handle SVG, RSS, ODF (or any other specialised use of XML) class uses 
the faux-interface from this RFC as it's base, there's no way for the Request 
class to know that it can instantiate an instance from a string, without 
falling back to method_exists() checks.


The only reason I mentioned the DateTime or Symphony classes at all, is to 
highlight that people have been using classes with a regular constructor, and 
one or more static factory 

Re: [PHP-DEV] [RFC] [Discussion] DOM HTML5 parsing and serialization support

2023-09-19 Thread Niels Dossche
Hi Stephen

On 19/09/2023 09:58, Stephen Reay wrote:
> 
> 
>> On 19 Sep 2023, at 14:30, Tim Düsterhus  wrote:
>>
>> Hi
>>
>> On 9/19/23 08:35, Stephen Reay wrote:
>>> Regarding the private constructor: I understand the issue with the *old* 
>>> class being confusing - but your new class doesn't have that issue, because 
>>> there are no "loadXXX" methods: as you said, if you're loading an existing 
>>> document, you're forced to do it via the static factory methods. With that 
>>> change alone, there's zero risk of confusion in allowing direct constructor 
>>> calls, because once the object is instantiated there is no subsequent way 
>>> to load a document and inadvertently change the encoding.
>>> Having a regular constructor and then one or more factory methods for 
>>> specific cases is already a concept in PHP (i.e. DateTime[Immutable]'s  
>>> `createFromXXX` as well as a constructor), and IMO needing to call a 
>>> factory method to get an "empty" document seems out of place in PHP - it 
>>> seems a bit like a Java-ism - using a factory, where one just isn't 
>>> required.
>>
>> I was one of the persons who discussed this updated API with Niels in 
>> private and looking up the discussion it was me who proposed making the 
>> constructor private and just providing named constructors.
>>
>> From the perspective of the user of the API, I like the symmetry between all 
>> the named constructors:
>>
>> Whenever I want to create a new document, I use one of the fromXyz() 
>> methods. And when I use those, I get exactly what it says on the tin.
>>
>> Making the regular constructor available for use would effectively make 
>> whatever that constructor does a special case / the default case. This makes 
>> sense for DateTimeImmutable, because the __construct() variant is likely 
>> much more often used than the various createFromXyz() variants. For the 
>> HtmlDocument I find it less obvious that creating an empty document would be 
>> the default case compared to loading an existing document from a file.
>>
>> We should probably rename the named constructors to include the "create" 
>> prefix for consistency with DTI though.
>>
>> Best regards
>> Tim Düsterhus
>>
> 
> Hi Tim,
> 
> 
> Thanks for providing context on where this idea came from.
> 
> 
> The constructor + specialised factory methods pattern is not exactly new in 
> PHP (e.g. it took about 3 minutes to find multiple Symphony classes doing 
> this), and disabling the public constructor for purely cosmetic reasons 
> sounds like a very weird, and ironic choice to me, when the stated goal is to 
> make the API *less surprising* than the previous one.
> 

Besides the points that have been mentioned by Tim and Larry, there's also the 
expectation of the programmer that migrates to the new classes to take into 
account.
They're used to calling the constructor on the old class and calling a load 
method afterwards.
As there is still a constructor, but no load method, this might be confusing 
for them.
So in my opinion, disabling it makes it less surprising than the previous one.
Also, just because Symfony does this doesn't mean it's automatically something 
we should follow.

> 
> Cheers
> 
> Stephen 

Kind regards
Niels

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] [Discussion] DOM HTML5 parsing and serialization support

2023-09-19 Thread Niels Dossche
Hi Stephen

On 19/09/2023 08:35, Stephen Reay wrote:
> 
>> On 19 Sep 2023, at 01:00, Niels Dossche 
>>
>> Cheers
>> Niels
>>
>> -- 
>> PHP Internals - PHP Runtime Development Mailing List
>> To unsubscribe, visit: https://www.php.net/unsub.php
> 
> 
> 
> (Resending with history removed due to apparent size limit) 
> 
> Hi Niels,
> 
> Obviously, a method with different signatures (`fromEmptyDocument`) can't be 
> in the parent, and I'll discuss that below, but having the other factory 
> methods that do have the same signature in parent (even if only an abstract 
> method, forcing concrete implementations to define it) means that it's 
> possible to *safely* do meta-programming with these classes, where you load a 
> string/file, using a classname from a variable. This use-case doesn't 
> necessarily need to know which specific instance it's getting back, so long 
> as it's an instance of the base class (or interface, if it had been one), but 
> that information can still be presented, via the return type (either 
> `static`, or `self` on the parent and the actual class name on each of the 
> implementations).
> 

I don't really understand this part. What kind of meta-programming are you 
thinking of? Can you give an example?
Calling the create methods on the abstract type does not seem sensible, because 
the input data is not interchangeable between the two subclasses.
E.g. passing an XML input into HTMLDocument will not go well (and vice versa).
So it seems to me that you need to know the type to be able to give sensible 
input.

> 
> Regarding the private constructor: I understand the issue with the *old* 
> class being confusing - but your new class doesn't have that issue, because 
> there are no "loadXXX" methods: as you said, if you're loading an existing 
> document, you're forced to do it via the static factory methods. With that 
> change alone, there's zero risk of confusion in allowing direct constructor 
> calls, because once the object is instantiated there is no subsequent way to 
> load a document and inadvertently change the encoding.
> 
> Having a regular constructor and then one or more factory methods for 
> specific cases is already a concept in PHP (i.e. DateTime[Immutable]'s  
> `createFromXXX` as well as a constructor), and IMO needing to call a factory 
> method to get an "empty" document seems out of place in PHP - it seems a bit 
> like a Java-ism - using a factory, where one just isn't required.
> 

I agree with what Tim and Larry have said here.
I do agree too that it should be called createFromString and createFromFile to 
be consistent with the DateTime classes.
And perhaps instead of createFromEmpty it should be createEmpty because you're 
not creating an instance "from" something here, you're creating it with 
"nothing".

> 
> Cheers
> 
> Stephen
> 

Kind regards
Niels

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] [Discussion] DOM HTML5 parsing and serialization support

2023-09-19 Thread Larry Garfield
On Tue, Sep 19, 2023, at 7:30 AM, Tim Düsterhus wrote:
> Hi
>
> On 9/19/23 08:35, Stephen Reay wrote:
>> Regarding the private constructor: I understand the issue with the *old* 
>> class being confusing - but your new class doesn't have that issue, because 
>> there are no "loadXXX" methods: as you said, if you're loading an existing 
>> document, you're forced to do it via the static factory methods. With that 
>> change alone, there's zero risk of confusion in allowing direct constructor 
>> calls, because once the object is instantiated there is no subsequent way to 
>> load a document and inadvertently change the encoding.
>> 
>> Having a regular constructor and then one or more factory methods for 
>> specific cases is already a concept in PHP (i.e. DateTime[Immutable]'s  
>> `createFromXXX` as well as a constructor), and IMO needing to call a factory 
>> method to get an "empty" document seems out of place in PHP - it seems a bit 
>> like a Java-ism - using a factory, where one just isn't required.
>> 
>
> I was one of the persons who discussed this updated API with Niels in 
> private and looking up the discussion it was me who proposed making the 
> constructor private and just providing named constructors.
>
>  From the perspective of the user of the API, I like the symmetry 
> between all the named constructors:
>
> Whenever I want to create a new document, I use one of the fromXyz() 
> methods. And when I use those, I get exactly what it says on the tin.
>
> Making the regular constructor available for use would effectively make 
> whatever that constructor does a special case / the default case. This 
> makes sense for DateTimeImmutable, because the __construct() variant is 
> likely much more often used than the various createFromXyz() variants. 
> For the HtmlDocument I find it less obvious that creating an empty 
> document would be the default case compared to loading an existing 
> document from a file.
>
> We should probably rename the named constructors to include the "create" 
> prefix for consistency with DTI though.
>
> Best regards
> Tim Düsterhus

I agree with Tim here.  If you have named constructors that are all "equally 
common," then it's non-obvious which one should be the "unnamed default."  I 
know that `fromX()` requires an X, `fromY()` requires a Y, but what does 
`__construct()` require?  I cannot tell from the call site.  If __construct() 
is just an alias of on of the fromX() methods, then what purpose does it serve 
other than confusion?

In addition, although it's probably not super relevant here, static factory 
methods are chainable in a way that `new` is not.  No extra parens are 
necessary.  For classes I'm creating at hoc (rather than services that are 
always managed via DI), I often prefer a named constructor just for convenience.

I don't see it as a Java-ism at all.  Java is (in)famous for excessive layers 
of factory classes, which is something different than what is going on here.  
(How fair that reputation is in 2023, I don't know.)  This is just "named 
constructors," which has been a reasonably common pattern in PHP land for a 
long time.

--Larry Garfield

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] [Discussion] DOM HTML5 parsing and serialization support

2023-09-19 Thread Stephen Reay



> On 19 Sep 2023, at 14:30, Tim Düsterhus  wrote:
> 
> Hi
> 
> On 9/19/23 08:35, Stephen Reay wrote:
>> Regarding the private constructor: I understand the issue with the *old* 
>> class being confusing - but your new class doesn't have that issue, because 
>> there are no "loadXXX" methods: as you said, if you're loading an existing 
>> document, you're forced to do it via the static factory methods. With that 
>> change alone, there's zero risk of confusion in allowing direct constructor 
>> calls, because once the object is instantiated there is no subsequent way to 
>> load a document and inadvertently change the encoding.
>> Having a regular constructor and then one or more factory methods for 
>> specific cases is already a concept in PHP (i.e. DateTime[Immutable]'s  
>> `createFromXXX` as well as a constructor), and IMO needing to call a factory 
>> method to get an "empty" document seems out of place in PHP - it seems a bit 
>> like a Java-ism - using a factory, where one just isn't required.
> 
> I was one of the persons who discussed this updated API with Niels in private 
> and looking up the discussion it was me who proposed making the constructor 
> private and just providing named constructors.
> 
> From the perspective of the user of the API, I like the symmetry between all 
> the named constructors:
> 
> Whenever I want to create a new document, I use one of the fromXyz() methods. 
> And when I use those, I get exactly what it says on the tin.
> 
> Making the regular constructor available for use would effectively make 
> whatever that constructor does a special case / the default case. This makes 
> sense for DateTimeImmutable, because the __construct() variant is likely much 
> more often used than the various createFromXyz() variants. For the 
> HtmlDocument I find it less obvious that creating an empty document would be 
> the default case compared to loading an existing document from a file.
> 
> We should probably rename the named constructors to include the "create" 
> prefix for consistency with DTI though.
> 
> Best regards
> Tim Düsterhus
> 

Hi Tim,


Thanks for providing context on where this idea came from.


The constructor + specialised factory methods pattern is not exactly new in PHP 
(e.g. it took about 3 minutes to find multiple Symphony classes doing this), 
and disabling the public constructor for purely cosmetic reasons sounds like a 
very weird, and ironic choice to me, when the stated goal is to make the API 
*less surprising* than the previous one.


Cheers

Stephen 
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] [Discussion] DOM HTML5 parsing and serialization support

2023-09-19 Thread Tim Düsterhus

Hi

On 9/19/23 08:35, Stephen Reay wrote:

Regarding the private constructor: I understand the issue with the *old* class being 
confusing - but your new class doesn't have that issue, because there are no 
"loadXXX" methods: as you said, if you're loading an existing document, you're 
forced to do it via the static factory methods. With that change alone, there's zero risk 
of confusion in allowing direct constructor calls, because once the object is 
instantiated there is no subsequent way to load a document and inadvertently change the 
encoding.

Having a regular constructor and then one or more factory methods for specific cases is 
already a concept in PHP (i.e. DateTime[Immutable]'s  `createFromXXX` as well as a 
constructor), and IMO needing to call a factory method to get an "empty" 
document seems out of place in PHP - it seems a bit like a Java-ism - using a factory, 
where one just isn't required.



I was one of the persons who discussed this updated API with Niels in 
private and looking up the discussion it was me who proposed making the 
constructor private and just providing named constructors.


From the perspective of the user of the API, I like the symmetry 
between all the named constructors:


Whenever I want to create a new document, I use one of the fromXyz() 
methods. And when I use those, I get exactly what it says on the tin.


Making the regular constructor available for use would effectively make 
whatever that constructor does a special case / the default case. This 
makes sense for DateTimeImmutable, because the __construct() variant is 
likely much more often used than the various createFromXyz() variants. 
For the HtmlDocument I find it less obvious that creating an empty 
document would be the default case compared to loading an existing 
document from a file.


We should probably rename the named constructors to include the "create" 
prefix for consistency with DTI though.


Best regards
Tim Düsterhus

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] [Discussion] DOM HTML5 parsing and serialization support

2023-09-19 Thread Stephen Reay


> On 19 Sep 2023, at 01:00, Niels Dossche 
> 
> Cheers
> Niels
> 
> -- 
> PHP Internals - PHP Runtime Development Mailing List
> To unsubscribe, visit: https://www.php.net/unsub.php



(Resending with history removed due to apparent size limit) 

Hi Niels,

Obviously, a method with different signatures (`fromEmptyDocument`) can't be in 
the parent, and I'll discuss that below, but having the other factory methods 
that do have the same signature in parent (even if only an abstract method, 
forcing concrete implementations to define it) means that it's possible to 
*safely* do meta-programming with these classes, where you load a string/file, 
using a classname from a variable. This use-case doesn't necessarily need to 
know which specific instance it's getting back, so long as it's an instance of 
the base class (or interface, if it had been one), but that information can 
still be presented, via the return type (either `static`, or `self` on the 
parent and the actual class name on each of the implementations).


Regarding the private constructor: I understand the issue with the *old* class 
being confusing - but your new class doesn't have that issue, because there are 
no "loadXXX" methods: as you said, if you're loading an existing document, 
you're forced to do it via the static factory methods. With that change alone, 
there's zero risk of confusion in allowing direct constructor calls, because 
once the object is instantiated there is no subsequent way to load a document 
and inadvertently change the encoding.

Having a regular constructor and then one or more factory methods for specific 
cases is already a concept in PHP (i.e. DateTime[Immutable]'s  `createFromXXX` 
as well as a constructor), and IMO needing to call a factory method to get an 
"empty" document seems out of place in PHP - it seems a bit like a Java-ism - 
using a factory, where one just isn't required.


Cheers

Stephen

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] [Discussion] DOM HTML5 parsing and serialization support

2023-09-18 Thread Niels Dossche
Hi Stephen

On 18/09/2023 08:46, Stephen Reay wrote:
> 
> 
>> On 17 Sep 2023, at 18:28, Niels Dossche  wrote:
>>
>> Hi Alexandru
>>
>> On 9/17/23 11:59, Alexandru Pătrănescu wrote:
>>> On Sat, Sep 16, 2023, 02:17 Niels Dossche  wrote:
>>>

 We'll add a common abstract base class DOM\Document (name taken from the
 DOM spec & Javascript world).
 DOM\Document contains the properties and abstract methods common to both
 HTML and XML documents.


>>> Hi,
>>>
>>> Yes looks a lot better.
>>> Great work overall! And thank you for taking on this effort.
>>>
>>> I would have a small suggestion: to make the abstract class an interface.
>>> This will allow even more flexibility on how things can be build further,
>>> suggesting composition over inheritance.
>>> In user land we cannot have interfaces with properties (yet) but in php
>>> internal interfaces we have example with interface UnitEnum that has name
>>> property, extendes by BackedEnum that adds value property.
>>>
>>
>> Right, we discussed the use of an interface internally too.
>> Indeed as you suggest, we chose an abstract class over an interface because 
>> of the property limitation.
>> Looking at UnitEnum & BackedEnum 
>> (https://github.com/php/php-src/blob/bae30682b896b26f177f83648bd58c77ba3480a8/Zend/zend_enum.stub.php
>>  
>> )
>>  I don't see the properties defined on the interface. In practice, all enums 
>> get the properties of course, just not via the interface.
>> So as we cannot represent the properties on the interfaces (yet), we'll 
>> stick with the abstract class for now.
>>
>>
>>> Thank you,
>>> Alex
>>>
>>
>> Kind regards
>> Niels
>>
>> -- 
>> PHP Internals - PHP Runtime Development Mailing List
>> To unsubscribe, visit: https://www.php.net/unsub.php 
>> 
> Hi Niels,
> 
> Can you expand on the reasoning for two of the decisions in your proposal? 
> I'm not sure I really see the reason/benefit :
> 
> 1. fromX() methods are on the individual classes, rather than the parent, 
> which as I understand it, you're using as a poor-mans interface with 
> properties. I'd have thought that at the very least the parent should declare 
> those methods as abstract
> 

At least the fromEmptyDocument signature differs between HTMLDocument & 
XMLDocument, so it's not possible to declare that method on the parent.
It's possible to define the other two on the parent. However, I choose against 
that because:
1) They have different behaviour if called on XML vs HTML, because they return 
their respective new instance. At least the other methods have the same 
behaviour between XML & HTML. Furthermore it would be weird to have 2 out of 3 
factory methods on the parent, but the other one not.
2) It makes some sense at least to call e.g. createComment() while not knowing 
the concrete type of the document (e.g. when supporting both XML & HTML 
documents), so having that method in the parent makes sense. For example:
function addCopyright(DOM\Document $doc) {
  $doc->append($doc->createComment("(c) foo 2023"));
}
However, you're not going to call the factory methods on a variable that can be 
either XML or HTML document, as you wouldn't know a priori what you're getting 
back.

> 
> 2. Why "fromEmptyDocument()" rather than just allowing `new 
> DOM\{XML,HTML}Document` via a public constructor?

While technically possible, I find it confusing to have both a normal 
constructor and factory methods.
When people see factory methods, they'll know - from experience - that every 
construction needs to go via factories.
Otherwise, people are going to call the constructor and search for a 
loadHTML/loadHTMLFile method like they do with DOMDocument.

> 
> The only mention of 'constructor' I see in the email or the RFC is the line 
> below:
> 
>> the properties set by DOMDocument's constructor are overridden by its load 
>> methods, which is surprising
> 
> But I don't really see how making the constructor private and forcing use of 
> a static method changes this aspect, at all - it just introduces a different 
> way to create an instance.
> 

The problem with DOMDocument's approach is the following:
$dom = new DOMDocument(encoding: "bla");
$dom->loadXML(...); // Oops encoding gone!

With the new classes, you load the document at construction time (or no 
document at all).
Therefore, this confusion/API misuse is prevented by the API design.
You can see it as having 3 different constructors that disallow loading another 
document on the same instance afterwards.

> 
> Otherwise, it's great to see some activity on DOM handling classes.
> 
> 
> Cheers
> 
> 
> Stephen 
> 

Cheers
Niels

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] [Discussion] DOM HTML5 parsing and serialization support

2023-09-18 Thread Stephen Reay


> On 17 Sep 2023, at 18:28, Niels Dossche  wrote:
> 
> Hi Alexandru
> 
> On 9/17/23 11:59, Alexandru Pătrănescu wrote:
>> On Sat, Sep 16, 2023, 02:17 Niels Dossche  wrote:
>> 
>>> 
>>> We'll add a common abstract base class DOM\Document (name taken from the
>>> DOM spec & Javascript world).
>>> DOM\Document contains the properties and abstract methods common to both
>>> HTML and XML documents.
>>> 
>>> 
>> Hi,
>> 
>> Yes looks a lot better.
>> Great work overall! And thank you for taking on this effort.
>> 
>> I would have a small suggestion: to make the abstract class an interface.
>> This will allow even more flexibility on how things can be build further,
>> suggesting composition over inheritance.
>> In user land we cannot have interfaces with properties (yet) but in php
>> internal interfaces we have example with interface UnitEnum that has name
>> property, extendes by BackedEnum that adds value property.
>> 
> 
> Right, we discussed the use of an interface internally too.
> Indeed as you suggest, we chose an abstract class over an interface because 
> of the property limitation.
> Looking at UnitEnum & BackedEnum 
> (https://github.com/php/php-src/blob/bae30682b896b26f177f83648bd58c77ba3480a8/Zend/zend_enum.stub.php)
>  I don't see the properties defined on the interface. In practice, all enums 
> get the properties of course, just not via the interface.
> So as we cannot represent the properties on the interfaces (yet), we'll stick 
> with the abstract class for now.
> 
> 
>> Thank you,
>> Alex
>> 
> 
> Kind regards
> Niels
> 
> -- 
> PHP Internals - PHP Runtime Development Mailing List
> To unsubscribe, visit: https://www.php.net/unsub.php
Hi Niels,

Can you expand on the reasoning for two of the decisions in your proposal? I'm 
not sure I really see the reason/benefit :

1. fromX() methods are on the individual classes, rather than the parent, which 
as I understand it, you're using as a poor-mans interface with properties. I'd 
have thought that at the very least the parent should declare those methods as 
abstract


2. Why "fromEmptyDocument()" rather than just allowing `new 
DOM\{XML,HTML}Document` via a public constructor?

The only mention of 'constructor' I see in the email or the RFC is the line 
below:

> the properties set by DOMDocument's constructor are overridden by its load 
> methods, which is surprising

But I don't really see how making the constructor private and forcing use of a 
static method changes this aspect, at all - it just introduces a different way 
to create an instance.


Otherwise, it's great to see some activity on DOM handling classes.


Cheers


Stephen 



Re: [PHP-DEV] [RFC] [Discussion] DOM HTML5 parsing and serialization support

2023-09-04 Thread Niels Dossche
Hey Christian

Thank you for going through my proposal.

On 04/09/2023 09:23, naitsi...@e.mail.de wrote:
> Am 02-Sep-2023 21:41:50 +0200 schrieb dossche.ni...@gmail.com:
>> Hello internals
>>
>> I'm opening the discussion for my RFC "DOM HTML5 parsing and serialization 
>> support".
>> https://wiki.php.net/rfc/domdocument_html5_parser
>>
>> Kind regards
>> Niels
>>
>> -- 
>> PHP Internals - PHP Runtime Development Mailing List
>> To unsubscribe, visit: https://www.php.net/unsub.php
>>
> 
> Hi Niels,
> 
> thank you for your proposal and your work on this.
> 
>> This proposal introduces the DOM\HTML5Document class that extends the 
>> DOMDocument class. The reason we introduce a new class instead of replacing 
>> the methods of the existing class is to ensure full backwards compatibility.
> 
> Although I do not dislike your idea with a new class for HTML5 parsing I have 
> one question. Why not make the decision, which parser to use, dependent from 
> the doctype declaration at the start of the html document?
There's three major reasons to not depend solely on the doctype:
1) Not all documents have a doctype. So in that case, do we have to pick the 
old parser or the HTML5 parser?
2) People create DOM documents from scratch, without loading some code or file 
first.
   In that case, we can't choose upfront if the document is an HTML5 document 
or a legacy document.
   I believe that by using a class, and thus having a clear choice upfront, 
this minimizes surprises.
3) Having a separate class allows us to use the type system to restrict some 
users to only allow HTML5 documents.

> 
> Best regards
> Christian

Kind regards
Niels

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] [Discussion] DOM HTML5 parsing and serialization support

2023-09-04 Thread naitsirch
Am 02-Sep-2023 21:41:50 +0200 schrieb dossche.ni...@gmail.com:
> Hello internals
> 
> I'm opening the discussion for my RFC "DOM HTML5 parsing and serialization 
> support".
> https://wiki.php.net/rfc/domdocument_html5_parser
> 
> Kind regards
> Niels
> 
> -- 
> PHP Internals - PHP Runtime Development Mailing List
> To unsubscribe, visit: https://www.php.net/unsub.php
> 

Hi Niels,

thank you for your proposal and your work on this.

> This proposal introduces the DOM\HTML5Document class that extends the 
> DOMDocument class. The reason we introduce a new class instead of replacing 
> the methods of the existing class is to ensure full backwards compatibility.

Although I do not dislike your idea with a new class for HTML5 parsing I have 
one question. Why not make the decision, which parser to use, dependent from 
the doctype declaration at the start of the html document?

Best regards
Christian

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



[PHP-DEV] [RFC] [Discussion] DOM HTML5 parsing and serialization support

2023-09-02 Thread Niels Dossche
Hello internals

I'm opening the discussion for my RFC "DOM HTML5 parsing and serialization 
support".
https://wiki.php.net/rfc/domdocument_html5_parser

Kind regards
Niels

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php