Re: [phpxmlrpc] xmlrpc_encode_entitites causing parse error

2005-11-16 Thread a.h.s. boy (lists)
I grabbed a copy from CVS, but I'm in the middle of a few days of  
hardcode iCalendar coding, so I'm focusing on that. I'll run some  
tests and offer comments as soon as I have the chance. Thanks for the  
quick work!


Cheers,
spud.

On Nov 16, 2005, at 11:33 AM, Gaetano Giunta wrote:

OK, code checked in into CVS. Feel free to download and test it (I  
added a new test case for UTF-8 in testsuite, but the more testing  
the better).


I adopted the 'convert all to ASCII' way-of-life, and modified the  
function xmlrpc_encode_entities() to respect the value of $GLOBALS 
['xmlrpc_internalencoding'].


As stated in my last post, more flexible usage patterns might make  
it into future releases.


Right now escaping iso-8859-1 might be faster than it was  
previously, since I use str_replace instead of the hand-made  
algorithm, but escaping UTF8 will be dog slow.
The lib is not built for speed anyway, if you're aiming for that  
the php xmlrpc extension will surely server you better.


The main problem I see with that right now are:

- turning xmlrpc_encode_entities() into a general charset  
transcoder migth make it slower for the default case operation,  
unless user has mbstring ON


- how server and msg objs will communicate to xmlrpcval objs the  
desired charset for serialization (only solution I can think of:  
add an extra param in calls to serialize())


- xmlrpc_encode_entities() is used when serializing server-added  
debug info. Since that info might come at the same time from user  
messages, client request (at debug lvl 3) and php error messages,  
there is a serious risk it will be a charset pot-pourri, ie there  
is no sure way that it will conform to ANY charset.
I wonder if using a CDATA section instead of a comment to wrap  
debug info might help in solving this problem.
The second solution is to just base64-encode the debug info, and  
let the client sort it out.
Of course that would break any existing client that makes usage of  
that undocumented info...


Bye
Gaetano


-Original Message-
From: a.h.s. boy (lists) [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 15, 2005 6:57 PM
To: Gaetano Giunta
Cc: phpxmlrpc@lists.usefulinc.com
Subject: Re: [phpxmlrpc] xmlrpc_encode_entitites causing parse error


On Nov 15, 2005, at 11:31 AM, Gaetano Giunta wrote:


Very toughtful response.


Man, I love cross-linguistic typos...makes great new English words:
"toughtful" = "tough thoughtfulness". Brilliant.


UTF-8 everywhere is fine and dandy but for 2 aspects:

- in fact XML-over-http without a charset declaration SHOULD be
assumed to be ISO-8859-1 (there is a RFC somewhere about that,
which I cannot recall now).


Hmmm. The XML 1.0 spec (http://www.w3.org/TR/2000/REC-xml-20001006)
reads:

Because each XML entity not accompanied by external encoding
information and not in UTF-8 or UTF-16 encoding MUST begin with an
XML encoding declaration, in which the first characters must be '   mechanism that is exempt from the restrictions on the text  
top-

   level type (see section 19.4.1 of HTTP 1.1
[RFC-2068]), "UTF-16"
   (Appendix C.3 of [UNICODE] and Amendment 1 of [ISO-10646]) is
also
   recommended.  UTF-16 is supported by all conforming XML
processors
   [REC-XML].  Since the handling of CR, LF and NUL for text
types in
   most MIME applications would cause undesired  
transformations of

   individual octets in UTF-16 multi-octet characters,
gateways from
   HTTP to these MIME applications MUST transform the XML entity
from
   a text/xml; charset="utf-16" to application/xml;
charset="utf-16".

   Conformant with [RFC-2046], if a text/xml entity is
received with
   the charset parameter omitted, MIME processors and XML
processors
   MUST use the default charset value of "us-ascii".  In
cases where
   the XML entity is transmitted via HTTP, the default
charset value
   is still "us-ascii".

...which implies that us-ascii, not iso-8859-1, is the default (but
not really a problem if you're encoding everything outside of
ASCII).
But I know that my RDFParser class, for example, defaults to "utf-8"
and overrides that only if the encoding is specified as something
else in the xml delaration. I assume I made that decision for good
reasons, though I don't remember them now!

Still, the number of factors affecting encoding and transmission are
unbelievably complex. In my software, for example, there is:

1) Page encoding used when users submit data via a form (mine: UTF-8)
a) Default charset header sent by Apache (mine:  UTF-8)
b) Default charset set in META tags (mine: UTF-8)
c) Charset setting of client browser (no control!)
2) Encoding of database (mine: MySQL 3.x, so limited to ISO-8859-1)
3) Encoding of page used to display data (Irrelevant to XML-RPC
transfers, but 1a,1b,1c apply)
4) PHP 

RE: [phpxmlrpc] xmlrpc_encode_entitites causing parse error

2005-11-16 Thread Gaetano Giunta
OK, code checked in into CVS. Feel free to download and test it (I added a new 
test case for UTF-8 in testsuite, but the more testing the better).

I adopted the 'convert all to ASCII' way-of-life, and modified the function 
xmlrpc_encode_entities() to respect the value of 
$GLOBALS['xmlrpc_internalencoding'].

As stated in my last post, more flexible usage patterns might make it into 
future releases.

Right now escaping iso-8859-1 might be faster than it was previously, since I 
use str_replace instead of the hand-made algorithm, but escaping UTF8 will be 
dog slow.
The lib is not built for speed anyway, if you're aiming for that the php xmlrpc 
extension will surely server you better.

The main problem I see with that right now are:

- turning xmlrpc_encode_entities() into a general charset transcoder migth make 
it slower for the default case operation, unless user has mbstring ON

- how server and msg objs will communicate to xmlrpcval objs the desired 
charset for serialization (only solution I can think of: add an extra param in 
calls to serialize())

- xmlrpc_encode_entities() is used when serializing server-added debug info. 
Since that info might come at the same time from user messages, client request 
(at debug lvl 3) and php error messages, there is a serious risk it will be a 
charset pot-pourri, ie there is no sure way that it will conform to ANY charset.
I wonder if using a CDATA section instead of a comment to wrap debug info might 
help in solving this problem.
The second solution is to just base64-encode the debug info, and let the client 
sort it out.
Of course that would break any existing client that makes usage of that 
undocumented info...

Bye
Gaetano

> -Original Message-
> From: a.h.s. boy (lists) [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, November 15, 2005 6:57 PM
> To: Gaetano Giunta
> Cc: phpxmlrpc@lists.usefulinc.com
> Subject: Re: [phpxmlrpc] xmlrpc_encode_entitites causing parse error
> 
> 
> On Nov 15, 2005, at 11:31 AM, Gaetano Giunta wrote:
> 
> > Very toughtful response.
> 
> Man, I love cross-linguistic typos...makes great new English words:  
> "toughtful" = "tough thoughtfulness". Brilliant.
> 
> > UTF-8 everywhere is fine and dandy but for 2 aspects:
> >
> > - in fact XML-over-http without a charset declaration SHOULD be  
> > assumed to be ISO-8859-1 (there is a RFC somewhere about that,  
> > which I cannot recall now).
> 
> Hmmm. The XML 1.0 spec (http://www.w3.org/TR/2000/REC-xml-20001006)  
> reads:
> 
> Because each XML entity not accompanied by external encoding  
> information and not in UTF-8 or UTF-16 encoding MUST begin with an  
> XML encoding declaration, in which the first characters must be ' xml', any conforming processor can detect, after two to four octets  
> of input, which of the following cases apply.
> 
> RFC 2376, however, offers suggestions for XML MIME-types sent over  
> HTTP, but it reads (pardon the length):
> 
> Although listed as an optional parameter, the use of the charset
>parameter is STRONGLY RECOMMENDED, since this 
> information can be
>used by XML processors to determine authoritatively 
> the character
>encoding of the XML entity. The charset parameter can also be  
> used
>to provide protocol-specific operations, such as charset-based
>content negotiation in HTTP.  "UTF-8" [RFC-2279] is the
>recommended value, representing the UTF-8 charset. UTF-8 is
>supported by all conforming XML processors [REC-XML].
> 
>If the XML entity is transmitted via HTTP, which uses 
> a MIME-like
>mechanism that is exempt from the restrictions on the text top-
>level type (see section 19.4.1 of HTTP 1.1 
> [RFC-2068]), "UTF-16"
>(Appendix C.3 of [UNICODE] and Amendment 1 of [ISO-10646]) is  
> also
>recommended.  UTF-16 is supported by all conforming XML  
> processors
>[REC-XML].  Since the handling of CR, LF and NUL for text  
> types in
>most MIME applications would cause undesired transformations of
>individual octets in UTF-16 multi-octet characters, 
> gateways from
>HTTP to these MIME applications MUST transform the XML entity  
> from
>a text/xml; charset="utf-16" to application/xml;  
> charset="utf-16".
> 
>Conformant with [RFC-2046], if a text/xml entity is 
> received with
>the charset parameter omitted, MIME processors and XML 
> processors
>MUST use the default charset value of "us-ascii".  In 
> cases where
>the XML entity is transmitted via HTTP, the default 
> charset value
>is still "us-ascii".
> 

RE: [phpxmlrpc] xmlrpc_encode_entitites causing parse error

2005-11-16 Thread Gaetano Giunta
Darn, just when I thought I had reached charset-encoding guru state, I discover 
I was mostly wrong.
I really love to be a coder...

> ...
> On Nov 15, 2005, at 11:31 AM, Gaetano Giunta wrote:
> 
> > Very toughtful response.
> 
> Man, I love cross-linguistic typos...makes great new English words:  
> "toughtful" = "tough thoughtfulness". Brilliant.

I can do a lot better if you wish, mixing up italian, french, english and php 
typos all in the same sentence ;)

> > UTF-8 everywhere is fine and dandy but for 2 aspects:
> >
> > - in fact XML-over-http without a charset declaration SHOULD be  
> > assumed to be ISO-8859-1 (there is a RFC somewhere about that,  
> > which I cannot recall now).
> 
> Hmmm. The XML 1.0 spec (http://www.w3.org/TR/2000/REC-xml-20001006)  
> reads:
> 
> ...
> 
> RFC 2376, however, offers suggestions for XML MIME-types sent over  
> HTTP, but it reads (pardon the length):
> 
> ...

OK, I'll admit I blew this one.
I cannot figure outh which RFC I (mis)read that convinced me that latin-1 was 
the way to go for text/xml over http, but RFC 3023 is definitely THE reference 
on this subject. And it states that
- a charset-encoding SHOULD be put in the http headers for interop's sake
- when that is unavailabe, xml MUST be treated as US-ASCII (regardless of the 
xml prologue...)

> ...
> But I know that my RDFParser class, for example, defaults to "utf-8"  
> and overrides that only if the encoding is specified as something  
> else in the xml delaration. I assume I made that decision for good  
> reasons, though I don't remember them now!

Most likely having bad sources of xml that send utf-8 stuff without declaring 
it explicitly. Very annoying, but quite common, at least a little while ago.

> 
> Still, the number of factors affecting encoding and transmission are  
> unbelievably complex.
> ...
> and...ugh! Sometimes I just want to kill myself.

Yup, I only had the chance to prove myself with an arabic website once. It was 
great fun, and source of a lot of learning, but it never went online (and the 
translator refused to translate single phrases as I had specced, to be put in 
the translation engine db, but insisted on giving me bak the 5 page translation 
document without hinting at any separation of paragraphs...)

> 
> While I suppose that attempting to convert all data into us-ascii  
> through entity encoding gives us the "least common donominator"  
> solution -- make everything 7-bit! -- it obviously isn't working  
> perfectly.

This is btw a 'road accident' not a by-design feature, and the previous 
situation was wrong anyway.
The general solution (i.e. let the lib encode any internal charset to ascii) is 
a bit daunting to be coded in php, but to add the 80% case (ie utf8 to ascii) I 
think is quite easy. AND we are following the spec.

> So perhaps any solution that simply makes it work,  
> regardless of whether or not it changes the use of  
> $xmlrpc_internalencoding, would be good. I did wonder about the  
> utf8_encode() function, and why you didn't simply use that 
> instead of  
> $character = ("&#".strval($code).";"); Won't that do all the right  
> work for you?

Yes, provided that we added UTF-8 in the http headers.
No, in the current situation.

> 
> In any case, I think you should try to make the XMLRPC 
> library follow  
> as closely as possible the relevant spec/RFC "recommended" behavior,  
> and let that be your guide.
> ...

What I am currently thinking about is something along the lines:

1 - add support for xmlrpc_internalencoding in xmlrpc_encode_entities(), ONLY 
for utf-8 to ascii, ascii-to-ascii and iso-8859-1 to ascii

2 - add support for specific charset encodings into xmlrpcmsg. If left 
unspecified, defaults to us-ascii, as per the current behaviour. When 
specified, it will modify the http content-type header, and potentially save a 
lot of time while NOT encoding special chars into xml entities

3 - figure out wheter the response charset encoding should be left to decide to 
the response object or to the server. Hint: the server can make intelligent 
decisions based on the client's http headers (accepted-charset).


Bye
Gaetano___
phpxmlrpc mailing list
phpxmlrpc@lists.usefulinc.com
http://lists.usefulinc.com/cgi-bin/mailman/listinfo/phpxmlrpc


Re: [phpxmlrpc] xmlrpc_encode_entitites causing parse error

2005-11-15 Thread a.h.s. boy (lists)

On Nov 15, 2005, at 11:31 AM, Gaetano Giunta wrote:


Very toughtful response.


Man, I love cross-linguistic typos...makes great new English words:  
"toughtful" = "tough thoughtfulness". Brilliant.



UTF-8 everywhere is fine and dandy but for 2 aspects:

- in fact XML-over-http without a charset declaration SHOULD be  
assumed to be ISO-8859-1 (there is a RFC somewhere about that,  
which I cannot recall now).


Hmmm. The XML 1.0 spec (http://www.w3.org/TR/2000/REC-xml-20001006)  
reads:


Because each XML entity not accompanied by external encoding  
information and not in UTF-8 or UTF-16 encoding MUST begin with an  
XML encoding declaration, in which the first characters must be 'xml', any conforming processor can detect, after two to four octets  
of input, which of the following cases apply.


RFC 2376, however, offers suggestions for XML MIME-types sent over  
HTTP, but it reads (pardon the length):


Although listed as an optional parameter, the use of the charset
  parameter is STRONGLY RECOMMENDED, since this information can be
  used by XML processors to determine authoritatively the character
  encoding of the XML entity. The charset parameter can also be  
used

  to provide protocol-specific operations, such as charset-based
  content negotiation in HTTP.  "UTF-8" [RFC-2279] is the
  recommended value, representing the UTF-8 charset. UTF-8 is
  supported by all conforming XML processors [REC-XML].

  If the XML entity is transmitted via HTTP, which uses a MIME-like
  mechanism that is exempt from the restrictions on the text top-
  level type (see section 19.4.1 of HTTP 1.1 [RFC-2068]), "UTF-16"
  (Appendix C.3 of [UNICODE] and Amendment 1 of [ISO-10646]) is  
also
  recommended.  UTF-16 is supported by all conforming XML  
processors
  [REC-XML].  Since the handling of CR, LF and NUL for text  
types in

  most MIME applications would cause undesired transformations of
  individual octets in UTF-16 multi-octet characters, gateways from
  HTTP to these MIME applications MUST transform the XML entity  
from
  a text/xml; charset="utf-16" to application/xml;  
charset="utf-16".


  Conformant with [RFC-2046], if a text/xml entity is received with
  the charset parameter omitted, MIME processors and XML processors
  MUST use the default charset value of "us-ascii".  In cases where
  the XML entity is transmitted via HTTP, the default charset value
  is still "us-ascii".

...which implies that us-ascii, not iso-8859-1, is the default (but  
not really a problem if you're encoding everything outside of ASCII).  
But I know that my RDFParser class, for example, defaults to "utf-8"  
and overrides that only if the encoding is specified as something  
else in the xml delaration. I assume I made that decision for good  
reasons, though I don't remember them now!


Still, the number of factors affecting encoding and transmission are  
unbelievably complex. In my software, for example, there is:


1) Page encoding used when users submit data via a form (mine: UTF-8)
   a) Default charset header sent by Apache (mine:  UTF-8)
   b) Default charset set in META tags (mine: UTF-8)
   c) Charset setting of client browser (no control!)
2) Encoding of database (mine: MySQL 3.x, so limited to ISO-8859-1)
3) Encoding of page used to display data (Irrelevant to XML-RPC  
transfers, but 1a,1b,1c apply)

4) PHP internal encoding
5) XMLRPC library internal encoding
6) XML declaration charset (optional, but highly recommended by spec)
7) text/xml MIME type charset declaration (optional, mine: text/ 
xml;charset=utf-8)

8) application/xml MIME type charset declaration (optional)

...and since all of them could be set to different encodings, getting  
it all straight is a dizzying adventure. Add to that the complexity  
of handling things like users copying text from a Word document  
created in Windows-1252 and pasting into a form on a UTF-8 page,  
and...ugh! Sometimes I just want to kill myself.


While I suppose that attempting to convert all data into us-ascii  
through entity encoding gives us the "least common donominator"  
solution -- make everything 7-bit! -- it obviously isn't working  
perfectly. So perhaps any solution that simply makes it work,  
regardless of whether or not it changes the use of  
$xmlrpc_internalencoding, would be good. I did wonder about the  
utf8_encode() function, and why you didn't simply use that instead of  
$character = ("&#".strval($code).";"); Won't that do all the right  
work for you?


In any case, I think you should try to make the XMLRPC library follow  
as closely as possible the relevant spec/RFC "recommended" behavior,  
and let that be your guide.


Adding some extra settings to client/server objects is fine, but  
the causal user might not be used to using those, and backward  
compatability is a primary concern to me.
Traduced in code that would probably mean adding some hacky 

RE: [phpxmlrpc] xmlrpc_encode_entitites causing parse error

2005-11-15 Thread Gaetano Giunta
Very toughtful response.

UTF-8 everywhere is fine and dandy but for 2 aspects:

- in fact XML-over-http without a charset declaration SHOULD be assumed to be 
ISO-8859-1 (there is a RFC somewhere about that, which I cannot recall now).
The xmlrpc lib got it wrong the first time around, but I never dared to cahnge 
the global var to a more 'correct' default, as the only benefit I imagine would 
have been breaking a lot of people's scripts.
This basically contradicts the argument 'UTF-8' is universal: xmlrpc clients 
written in other languages might (correctly) make the  assumption that the 
received xml charset is iso-8859-1 when unspecified, and dutifully choke on 
utf-8 characters.

- unless mbstring is enabled, all PHP processing is carried out in ISO-8859-1 
(of course, this does not apply to data gotten of your DB directly in UTF-8 
encoding)

Having said that, there is no guarantee that strings that the user gets out of 
his db are in fact utf-8, and sending some weird japanese charset using an 
utf-8 declaration is most likely wrong.

Adding some extra settings to client/server objects is fine, but the causal 
user might not be used to using those, and backward compatability is a primary 
concern to me.
Traduced in code that would probably mean adding some hacky stuff of the sort 
"object default charset preference is undefined, and while still undefined use 
global variable, otherwise use object preference" (doable but ugly).
The though part is letting the client object communicate the desired charset 
encoding to the xmlrpcval object, since the responsibility of creating 
serialized content is left to the xmlrpcval object itself (and I'm surely not 
changing that fundamental assumption).

I think I need a copule of days to sort out a good solution...

Bye
Gaetano

ps: the real (only ?) advantage of using variables instead of constnts for 
things such as internal_encoding is that you can redefine them not inside the 
xmlrpc lib but just after its inclusion, eg.

this way you do not have to change anything when updating...

> -Original Message-
> From: a.h.s. boy (lists) [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, November 15, 2005 4:34 PM
> To: Gaetano Giunta
> Cc: phpxmlrpc@lists.usefulinc.com
> Subject: Re: [phpxmlrpc] xmlrpc_encode_entitites causing parse error
> 
> 
> On Nov 15, 2005, at 4:11 AM, Gaetano Giunta wrote:
> 
> > Brief analysis:
> >
> > - the lib tries to encode all chars outside of the ASCII range as  
> > 'XML character entity' when serializing
> 
> I understand the theory, but one of the benefits to using UTF-8 in  
> the first place is its ability to properly render all sorts of  
> languages and character sets. Debugging becomes brutal when you're  
> staring at a huge string of HTML entities.
> 
> > - this has the main benefit that such an xml is valid 
> regardless of  
> > the charset assumed by the parser, i.e. we do not need to add a  
> > 'charset' parameter to either the HTTP Content-type header or the  
> > XML prologue
> 
> Well...apparently it isn't valid XML despite the lack of 
> charset...or  
> we wouldn't be having this discussion! ;-)
> 
> > - it is also the best solution I could come up with to solve the  
> > long-standing problems with cahrset encodings (I also tried the  
> > other way round, e.g. explicitly stating the charset used for xml,  
> > in a private fork of the lib I use for personal projects, but I  
> > would rather stick with the current approach, as it solves the  
> > problem in a more elegant way)
> 
> Believe me, I totally understand the issue of long-standing charset  
> encoding problems! I've been developing a CMS that needs to handle  
> multiple languages, alphabets, directionality, and XML-RPC/RSS feeds  
> all on the same page! Not easy, especially if your own linguistic  
> range is limited to English and Romance languages!
> 
> But I'm also a fan of proper declarations...and I'd rather have an  
> XML feed explicitly declare its charset encoding (and work) than try  
> to be "universal" and fail. :-)
> 
> I'll admit to not being fully familiar with all the XMLRPC library  
> code -- only enough to debug a bit -- but it appears that  
> $xmlrpc_internalencoding is declared as a global variable, though it  
> is only used in object methods. Could it be changed to be a property  
> of the xmlrpcmsg and xmlrpc_server classes? That way it could be set  
> through scripting with
> 
> $xmlrpcmsg->set_internalencoding($foo);
> 
> or something similar? That would be more flexible, and since you  
> _always_ know what the encoding is, you can send it in the XML  
> prologue, which is what that parameter is desig

Re: [phpxmlrpc] xmlrpc_encode_entitites causing parse error

2005-11-15 Thread a.h.s. boy (lists)

On Nov 15, 2005, at 4:11 AM, Gaetano Giunta wrote:


Brief analysis:

- the lib tries to encode all chars outside of the ASCII range as  
'XML character entity' when serializing


I understand the theory, but one of the benefits to using UTF-8 in  
the first place is its ability to properly render all sorts of  
languages and character sets. Debugging becomes brutal when you're  
staring at a huge string of HTML entities.


- this has the main benefit that such an xml is valid regardless of  
the charset assumed by the parser, i.e. we do not need to add a  
'charset' parameter to either the HTTP Content-type header or the  
XML prologue


Well...apparently it isn't valid XML despite the lack of charset...or  
we wouldn't be having this discussion! ;-)


- it is also the best solution I could come up with to solve the  
long-standing problems with cahrset encodings (I also tried the  
other way round, e.g. explicitly stating the charset used for xml,  
in a private fork of the lib I use for personal projects, but I  
would rather stick with the current approach, as it solves the  
problem in a more elegant way)


Believe me, I totally understand the issue of long-standing charset  
encoding problems! I've been developing a CMS that needs to handle  
multiple languages, alphabets, directionality, and XML-RPC/RSS feeds  
all on the same page! Not easy, especially if your own linguistic  
range is limited to English and Romance languages!


But I'm also a fan of proper declarations...and I'd rather have an  
XML feed explicitly declare its charset encoding (and work) than try  
to be "universal" and fail. :-)


I'll admit to not being fully familiar with all the XMLRPC library  
code -- only enough to debug a bit -- but it appears that  
$xmlrpc_internalencoding is declared as a global variable, though it  
is only used in object methods. Could it be changed to be a property  
of the xmlrpcmsg and xmlrpc_server classes? That way it could be set  
through scripting with


$xmlrpcmsg->set_internalencoding($foo);

or something similar? That would be more flexible, and since you  
_always_ know what the encoding is, you can send it in the XML  
prologue, which is what that parameter is designed for anyway.


- basically, I see two options to extend the lib to make up for  
your problem:
  + extend the xmlrpc_encode_entitites function to take into  
account the xmlrpc_internalencoding global var, and use 2 different  
parsing alghoritms (better solution but slower)


Well...UTF-8 should only require converting "&", "<", and '"'  
explicitly, and the rest is assumed to be valid. So the only fork  
you'd need in the code is to convert additional entities for non- 
UTF-8 encodings. Shouldn't slow anything down...in fact, it would  
make UTF-8 faster, since it would skip additional processing.


In fact, I may be mistaken, but it seems like older versions of the  
library didn't even do the entity translation...at least, in the  
course of my own development, I know I included some entity  
conversion routines to process the data _before_ I sent it to the  
XMLRPC library (but it may have been redundant on my part). Though I  
admit I do like the idea that I can pass _anything_ to the XMLRPC  
library and have it properly encoded for me!



Would you be willing to test the patches?


Absolutely...but I do think you should give some serious thought to  
making the internal encoding variable more scriptable so no one ever  
needs to hard-code changes in the script file. I hate having to  
remember to change the variable value whenever I upgrade the library...


Cheers,
spud.


---
a.h.s. boy
spud(at)nothingness.org"as yes is to if,love is to yes"
http://www.nothingness.org/
---

___
phpxmlrpc mailing list
phpxmlrpc@lists.usefulinc.com
http://lists.usefulinc.com/cgi-bin/mailman/listinfo/phpxmlrpc


RE: [phpxmlrpc] xmlrpc_encode_entitites causing parse error

2005-11-15 Thread Gaetano Giunta
Brief analysis:

- the lib tries to encode all chars outside of the ASCII range as 'XML 
character entity' when serializing

- this has the main benefit that such an xml is valid regardless of the charset 
assumed by the parser, i.e. we do not need to add a 'charset' parameter to 
either the HTTP Content-type header or the XML prologue

- it is also the best solution I could come up with to solve the long-standing 
problems with cahrset encodings (I also tried the other way round, e.g. 
explicitly stating the charset used for xml, in a private fork of the lib I use 
for personal projects, but I would rather stick with the current approach, as 
it solves the problem in a more elegant way)

- unfortunately, as I work with non-mbstring enabled installs by default, I 
assumed that internal string representation was iso-8859-1, and coded the 
xmlrpc_encode_entitites function accordingly

- I am now looking at the PHP man page for utf8_decode, and there are a few 
examples of a correct utf8-to-xmlentities functions, that might be of use

- basically, I see two options to extend the lib to make up for your problem:
  + extend the xmlrpc_encode_entitites function to take into account the 
xmlrpc_internalencoding global var, and use 2 different parsing alghoritms 
(better solution but slower)
  + add a 'workaround' solution: a class var of server/client objects that will 
prevent the escaping of non-ascii chars to take place.
  + note that both things could actually be combined...

Would you be willing to test the patches?

Bye
Gaetano

> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] Behalf Of a.h.s. boy
> (lists)
> Sent: Tuesday, November 15, 2005 12:17 AM
> To: phpxmlrpc@lists.usefulinc.com
> Subject: [phpxmlrpc] xmlrpc_encode_entitites causing parse error
> 
> 
> I'm using the XML-RPC library to retrieve calendar listing records  
> from a calendar website. Both the client and the server are 
> using the  
> latest XML-RPC library.
> 
> Both client and server are using UTF-8 encoding all around, and I've  
> adjusted $xmlrpc_internalencoding.
> 
> Some of the calendar entries are in Japanese, input with UTF-8  
> encoding, and displayed on the site with UTF-8 encoding. (See http:// 
> www.radicalendar.org/calendar/index.php?view=month&group=imcjapan).
> 
> If I make an XMLRPC request to retrieve some Japanese entries, the  
> library chokes and returns an "Invalid token" error. After 
> what seems  
> like 90 hours of debugging (checking the strings and arrays at  
> various stages of encoding and parsing), I tracked the problem down  
> to the default case of xmlrpc_encode_entitites()
> 
> default:
> if ($code < 32 || $code > 159)
>$character = ("&#".strval($code).";");
> 
> If I simply comment out that code, leaving a blank default case, the  
> XML is now valid and parses (and displays) exactly as expected. I  
> have NOT debugged the code to the extent where I can tell exactly  
> what character's entity reference might be the exact cause of the  
> problem...it's all complicated by the fact that I don't read  
> Japanese, so debugging is that much harder.
> 
> Any idea why the entity conversion is causing the XML to become  
> invalid? Is it feasible to leave off the
> 
> There's an example page at http://dev.dadaimc.org/mod/calendar/ 
> index.php with debugging turned on, but it'll only be valid 
> for today  
> (11/14/05 -0500), after which time the Japanese entry will no longer  
> be part of the results. But I'd be happy to reproduce the problem  
> upon request.
> 
> Cheers,
> spud.
> 
> 
> 
> ---
> a.h.s. boy
> spud(at)nothingness.org"as yes is to if,love is to yes"
> http://www.nothingness.org/
> ---
> 
> ___
> phpxmlrpc mailing list
> phpxmlrpc@lists.usefulinc.com
> http://lists.usefulinc.com/cgi-bin/mailman/listinfo/phpxmlrpc
> ___
phpxmlrpc mailing list
phpxmlrpc@lists.usefulinc.com
http://lists.usefulinc.com/cgi-bin/mailman/listinfo/phpxmlrpc