[issue465502] urllib2: urlopen unicode problem

2022-04-10 Thread admin


Change by admin :


--
github: None -> 35241

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue418173] Unicode problem in Tkinter under Windows

2022-04-10 Thread admin


Change by admin :


--
github: None -> 34398

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-04 Thread Marko Rauhamaa
BartC :
> Usually anything that is defined can be changed at run-time so that the
> compiler can never assume anything.

The compiler can't assume anything permanent, but it could heuristically
make excellent guesses at runtime. It needs to verify its guesses at the
boundaries of compiled code and gradually keep expanding the boundaries.
If the guesses end up being wrong, it has to correct its assumptions and
recompile the relevant parts of the code.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-04 Thread BartC

On 04/07/2016 15:46, Ned Batchelder wrote:

On Monday, July 4, 2016 at 10:36:54 AM UTC-4, BartC wrote:

On 04/07/2016 13:47, Ned Batchelder wrote:



This is a huge change.


I've used a kind of 'weak' import scheme elsewhere, corresponding to C's
'#include'.



I think that could work in Python provided whatever is defined can
tolerate having copies redefined in each module that includes the same
file. Anything that is defined once and is never assigned to nor
modified for example.


You are hand-waving over huge details of semantics that are very important
in Python.  For example, it is very important not to have copies of
classes.  Importing a module must produce the same module object
everywhere it is imported, and the classes defined in the module must
be defined only once.


So that would be something that doesn't tolerate copies.

But I think that a bigger change for Python wouldn't be new ways of 
doing imports, but the concept of having a user-defined anything that is 
a constant at compile-time. And not part of a conditional statement either.


Usually anything that is defined can be changed at run-time so that the 
compiler can never assume anything.


--
Bartc

--
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-04 Thread Ned Batchelder
On Monday, July 4, 2016 at 10:36:54 AM UTC-4, BartC wrote:
> On 04/07/2016 13:47, Ned Batchelder wrote:
> > On Monday, July 4, 2016 at 6:05:20 AM UTC-4, BartC wrote:
> >> On 04/07/2016 03:30, Steven D'Aprano wrote:
> 
> >>> You're still having problems with the whole Python-as-a-dynamic-language
> >>> thing, aren't you? :-)
> 
> >> Most Pythons seem to pre-compile code before executing the result. That
> >> pre-compilation requires that operators and precedences are known in
> >> advance and the resulting instructions are then hard-coded before 
> >> execution.
> >
> > This is the key but subtle point that all the discussion of parser mechanics
> > are missing: Python today needs no information from imported modules in
> > order to compile a file.  When the compiler encounters "import xyzzy" in
> > a file, it doesn't have to do anything to find or read xyzzy.py at compile
> > time.
> 
> Yeah, there's that small detail. Anything affecting how source is to be 
> parsed needs to known in advance.
> 
> > If operators can be invented, they will only be useful if they can be
> > created in modules which you then import and use.  But that would mean that
> > imported files would have to be found and read during compilation, not
> > during execution as they are now.
> >
> > This is a huge change.
> 
> I've used a kind of 'weak' import scheme elsewhere, corresponding to C's 
> '#include'.
> 
> Then the textual contents of that 'imported' module are read by the 
> compiler, and treated as though they occurred in this module. No new 
> namespace is created.
> 
> I think that could work in Python provided whatever is defined can 
> tolerate having copies redefined in each module that includes the same 
> file. Anything that is defined once and is never assigned to nor 
> modified for example.

You are hand-waving over huge details of semantics that are very important
in Python.  For example, it is very important not to have copies of
classes.  Importing a module must produce the same module object
everywhere it is imported, and the classes defined in the module must
be defined only once.

This is what makes catching exceptions work (because it is based on an
exception being an instance of a particular class), and what makes
class attributes shared among all the instances of the class.

--Ned.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-04 Thread BartC

On 04/07/2016 13:47, Ned Batchelder wrote:

On Monday, July 4, 2016 at 6:05:20 AM UTC-4, BartC wrote:

On 04/07/2016 03:30, Steven D'Aprano wrote:



You're still having problems with the whole Python-as-a-dynamic-language
thing, aren't you? :-)



Most Pythons seem to pre-compile code before executing the result. That
pre-compilation requires that operators and precedences are known in
advance and the resulting instructions are then hard-coded before execution.


This is the key but subtle point that all the discussion of parser mechanics
are missing: Python today needs no information from imported modules in
order to compile a file.  When the compiler encounters "import xyzzy" in
a file, it doesn't have to do anything to find or read xyzzy.py at compile
time.


Yeah, there's that small detail. Anything affecting how source is to be 
parsed needs to known in advance.



If operators can be invented, they will only be useful if they can be
created in modules which you then import and use.  But that would mean that
imported files would have to be found and read during compilation, not
during execution as they are now.

This is a huge change.


I've used a kind of 'weak' import scheme elsewhere, corresponding to C's 
'#include'.


Then the textual contents of that 'imported' module are read by the 
compiler, and treated as though they occurred in this module. No new 
namespace is created.


I think that could work in Python provided whatever is defined can 
tolerate having copies redefined in each module that includes the same 
file. Anything that is defined once and is never assigned to nor 
modified for example.


--
Bartc
--
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-04 Thread Ned Batchelder
On Monday, July 4, 2016 at 6:05:20 AM UTC-4, BartC wrote:
> On 04/07/2016 03:30, Steven D'Aprano wrote:
> > On Mon, 4 Jul 2016 10:17 am, BartC wrote:
> >
> >> On 04/07/2016 01:00, Lawrence D’Oliveiro wrote:
> >>> On Monday, July 4, 2016 at 11:47:26 AM UTC+12, eryk sun wrote:
>  Python lacks a mechanism to add user-defined operators. (R has this
>  capability.) Maybe this feature could be added.
> >>>
> >>> That would be neat. But remember, you would have to define the operator
> >>> precedence as well. So you could no longer use a recursive-descent
> >>> parser.
> >>
> >> That wouldn't be a problem provided the new operator symbol and its
> >> precedence is known at a compile time, and defined before use.
> >
> > You're still having problems with the whole Python-as-a-dynamic-language
> > thing, aren't you? :-)
> 
> Well it isn't completely dynamic, not unless code only exists as a eval 
> or exec argument string (and even there, any changes will only be seen 
> on calling eval or exec again on the same string).
> 
> Most Pythons seem to pre-compile code before executing the result. That 
> pre-compilation requires that operators and precedences are known in 
> advance and the resulting instructions are then hard-coded before execution.

This is the key but subtle point that all the discussion of parser mechanics
are missing: Python today needs no information from imported modules in
order to compile a file.  When the compiler encounters "import xyzzy" in
a file, it doesn't have to do anything to find or read xyzzy.py at compile
time.

If operators can be invented, they will only be useful if they can be
created in modules which you then import and use.  But that would mean that
imported files would have to be found and read during compilation, not
during execution as they are now.

This is a huge change.

--Ned.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-04 Thread Rustom Mody
On Monday, July 4, 2016 at 3:56:43 PM UTC+5:30, BartC wrote:
> On 04/07/2016 02:15, Lawrence D’Oliveiro wrote:
> > On Monday, July 4, 2016 at 12:40:14 PM UTC+12, BartC wrote:
> >> The structure of such a parser doesn't need to exactly match the grammar
> >> with a dedicated block of code for each operator precedence. It can be
> >> table-driven so that an operator precedence value is just an attribute.
> >
> > Of course. But that’s not a recursive-descent parser any more.
> >
> 
> All the parsers I write work the same way. If I can't describe them as 
> recursive descent, then I don't know what they are.
> 
> This is just recognising that a bunch of specialised functions that are 
> very similar can be reduced to one or two more generalised ones.

In gofer (likewise Haskell) one can concoct any operator and give it a 
precedence
and associativity -- l,r,non

Internals of Haskell I do not know, but of gofer I can say the following:

Implementation is in C.
Uses yacc to parse all operators left-assoc, same precedence
Then post-processes the tree with an elegant little shift-reduce parser
based on specified precedences and associativities.

I sometimes teach this to my kids as an example of how 
FP-style comments can clarify arcane imperative code:

Mark Jones (gofer author) original version + My version made executable
http://blog.languager.org/2016/07/a-little-functional-parser.html
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-04 Thread Jussi Piitulainen
BartC writes:

> A simpler approach is to treat user-defined operators as aliases for
> functions:
>
> def myadd(a,b):
>   return a+b
>
> operator ∇:
>(myadd,2,+3)   # map to myadd, 2 operands, prio 3, LTR
>
> x = y ∇ z
>
> is then equivalent to:
>
> x = myadd(y,z)
>
> However you will usually want to be able overload the same operator
> for different operand types. That means mapping the operator to one of
> several methods. Maybe even allowing the operator to have either one
> or two operands.
>
> Trickier but still doable I think.

Julia does something like that. The parser knows a number of symbols
that it treats as operators, some of them are aliases for ASCII names,
all operators correspond to generic functions, and the programmer can
add methods for their own types (or for pre-existing types) to these
functions.

Prolog opens its precedence table for the programmer. I don't know if
there's been any Unicode activity, or any activity, in recent years, but
there are actually two different issues here: what is parsed as an
identifier, and what identifiers are treated as operator symbols (with
what precedence and associativity).
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-04 Thread BartC

On 04/07/2016 02:15, Lawrence D’Oliveiro wrote:

On Monday, July 4, 2016 at 12:40:14 PM UTC+12, BartC wrote:

The structure of such a parser doesn't need to exactly match the grammar
with a dedicated block of code for each operator precedence. It can be
table-driven so that an operator precedence value is just an attribute.


Of course. But that’s not a recursive-descent parser any more.



All the parsers I write work the same way. If I can't describe them as 
recursive descent, then I don't know what they are.


This is just recognising that a bunch of specialised functions that are 
very similar can be reduced to one or two more generalised ones.


--
bartc
--
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-04 Thread Jussi Piitulainen
Lawrence D’Oliveiro writes:

> On Monday, July 4, 2016 at 6:08:51 PM UTC+12, Jussi Piitulainen wrote:
>> Something could be done, but if the intention is to allow
>> mathematical notation, it needs to be done with care.
>
> Mathematics uses single-character variable names so that
> multiplication can be implicit.

Certainly on topic, though independent of Unicode. I was thinking of
different classes of operator symbols.

> An old, stillborn language design from the 1960s called CPL* had two
> syntaxes for variable names:
> * a single lowercase letter, optionally followed by any number of primes “'”;
> * an uppercase letter followed by letters or digits.
>
> It also allowed implicit multiplication; single-letter identifiers
> could be run together without spaces, but multi-character ones needed
> to be delimited by spaces or non-identifier characters. E.g.
>
>   Sqrt(bb - 4ac)
>   Area ≡ Length Width
>
> *It was never fully implemented, but a cut-down derivative named BCPL
> did get some use. Some researchers at Bell Labs took it as their
> starting point, first creating a language called “B”, then another one
> called “C” ... well, the rest is history. 

There's been at least D, F, J, K (APL family), R, S (_before_ R), T (a
Lisp), X (the window system), Z (some specification language).

Any single-letter non-ASCII names yet? Spelled-out like Lambda and Omega
don't count.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-04 Thread BartC

On 04/07/2016 03:30, Steven D'Aprano wrote:

On Mon, 4 Jul 2016 10:17 am, BartC wrote:


On 04/07/2016 01:00, Lawrence D’Oliveiro wrote:

On Monday, July 4, 2016 at 11:47:26 AM UTC+12, eryk sun wrote:

Python lacks a mechanism to add user-defined operators. (R has this
capability.) Maybe this feature could be added.


That would be neat. But remember, you would have to define the operator
precedence as well. So you could no longer use a recursive-descent
parser.


That wouldn't be a problem provided the new operator symbol and its
precedence is known at a compile time, and defined before use.


You're still having problems with the whole Python-as-a-dynamic-language
thing, aren't you? :-)


Well it isn't completely dynamic, not unless code only exists as a eval 
or exec argument string (and even there, any changes will only be seen 
on calling eval or exec again on the same string).


Most Pythons seem to pre-compile code before executing the result. That 
pre-compilation requires that operators and precedences are known in 
advance and the resulting instructions are then hard-coded before execution.



In full generality, you would want to be able to define unary prefix, unary
suffix and binary infix operators, and set their precedence and whether
they associate to the left or the right. That's probably a bit much to
expect.


No, that's all possible. Maybe that's even how some language 
implementations work, defining all the set of standard operators at the 
start.



But if we limit ourselves to the boring case of binary infix operators of a
single precedence and associtivity, there's a simple approach: the parser
can allow any unicode code point of category "Sm" as a legal operator, e.g.
x ∇ y. Pre-defined operators like + - * etc continue to call the same
dunder methods they already do, but anything else tries calling:

x.__oper__('∇', y)
y.__roper__('∇', x)

and if neither of those exist and return a result other than NotImplemented,
then finally raise a runtime TypeError('undefined operator ∇').


A simpler approach is to treat user-defined operators as aliases for 
functions:


def myadd(a,b):
return a+b

operator ∇:
   (myadd,2,+3)   # map to myadd, 2 operands, prio 3, LTR

x = y ∇ z

is then equivalent to:

x = myadd(y,z)

However you will usually want to be able overload the same operator for 
different operand types. That means mapping the operator to one of 
several methods. Maybe even allowing the operator to have either one or 
two operands.


Trickier but still doable I think.

--
Bartc
--
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-04 Thread Marko Rauhamaa
Lawrence D’Oliveiro :

> Mathematics uses single-character variable names so that
> multiplication can be implicit.

I don't think anybody developed mathematical notation systematically.
Rather, over the centuries, various masters came up with personal
abbreviations and shorthand, which spread among admirers and students
through emulation. The resulting two-dimensional hodgepodge needs to be
supplemented by much natural-language handwaving. Rigorous treatment
needs to use a formal language, eg: http://us.metamath.org/mpeuni/evlslem2.html>.

Anyway, most programming has little use for mathematics. Thus, a
general-purpose programming language shouldn't bend over backwards to
placate that particular application domain.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-04 Thread Lawrence D’Oliveiro
On Monday, July 4, 2016 at 6:08:51 PM UTC+12, Jussi Piitulainen wrote:
> Something could be done, but if the intention is to allow
> mathematical notation, it needs to be done with care.

Mathematics uses single-character variable names so that multiplication can be 
implicit.

An old, stillborn language design from the 1960s called CPL* had two syntaxes 
for variable names:
* a single lowercase letter, optionally followed by any number of primes “'”;
* an uppercase letter followed by letters or digits.

It also allowed implicit multiplication; single-letter identifiers could be run 
together without spaces, but multi-character ones needed to be delimited by 
spaces or non-identifier characters. E.g.

  Sqrt(bb - 4ac)
  Area ≡ Length Width

*It was never fully implemented, but a cut-down derivative named BCPL did get 
some use. Some researchers at Bell Labs took it as their starting point, first 
creating a language called “B”, then another one called “C” ... well, the rest 
is history.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-04 Thread Jussi Piitulainen
Rustom Mody writes:

> Subscripts OTOH as part of identifier-lexemes doesn't seem to have any
> issues

They have the general issue that one might *want* them interpreted as
indexes, so that a₁ would mean the same as a[1].

Mathematical symbols face similar issues. One would not *want* them all
be binary operators; a specific level of precedence would not be good
for all uses; and some uses of some symbols need chaining and then
parentheses do not help. Just for the starters.

> My main point being unicode gives a wide repertory -- thats good
> It also gives char-classification -- thats a start
> But its not enough for designing a (modern) programming

So I agree. Something could be done, but if the intention is to allow
mathematical notation, it needs to be done with care.

(And no, I'm not saying Python needs to do anything at this time, and I
do not express any opinion on how likely Python is to do anything about
Unicode math at this time or ever, and so on. Just that I would not be
happy to have all those symbols available in a way that is not usable
for the intended purpose so please do take care.)
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Rustom Mody
On Monday, July 4, 2016 at 8:03:47 AM UTC+5:30, Steven D'Aprano wrote:
> On Mon, 4 Jul 2016 07:28 am, Lawrence D’Oliveiro wrote:
> 
> > On Monday, July 4, 2016 at 6:39:45 AM UTC+12, John Ladasky wrote:
> >> Here's another worm for the can.  Would you rather read this...
> >> 
> >> d = sqrt(x**2 + y**2)
> >> 
> >> ...or this?
> >> 
> >> d = √(x² + y²)
> > 
> > Neither. I would rather see
> > 
> > d = math.hypot(x, y)
> > 
> > Much simpler, don’t you think?
> 
> Only if you think of x and y as the sides of a triangle, and remember
> that "hypot" is a Unix-like abbreviation for hypotenuse (rather than,
> say, "hypothesis". And it doesn't help you one bit when it comes to:
> 
> a = √(4x²y - 3xy² + 2xy - 1)

In math typically one would write

a = √4x²y - 3xy² + 2xy - 1

with the radical sign running along upto and slightly beyond the 1

My unicode prowess is not upto doing that
Though experts may be able to use macrons/overlines 

> 
> 
> Personally, I'm not convinced about using the very limited number of
> superscript code points to represent exponentiation. Using √ as an unary
> operator looks cute, but I don't know that it adds enough to the language
> to justify the addition.

I guess I am more or less in agreement (on THIS/THESE)
ie √ and superscripts is probably not worth the headache

Subscripts OTOH as part of identifier-lexemes doesn't seem to have any issues

Python3

 >>> a₁ = 1
  File "", line 1
a₁ = 1
 ^
SyntaxError: invalid character in identifier

Haskell already has it

Prelude>  let a₁ = 1
Prelude>  a₁
1
Prelude> 

Haskell allows the same for superscripts:

Prelude> let a¹ = 1
Prelude> a¹
1

which is probably not such a great idea!
Prelude>  a¹ +   a₁
2
Prelude> 

My main point being unicode gives a wide repertory -- thats good
It also gives char-classification -- thats a start
But its not enough for designing a (modern) programming

Of course one can stay with ASCII
Like "There are many ways to skin a cat"
the modern version would be "There are many ways to be a Luddite"
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Steven D'Aprano
On Mon, 4 Jul 2016 07:28 am, Lawrence D’Oliveiro wrote:

> On Monday, July 4, 2016 at 6:39:45 AM UTC+12, John Ladasky wrote:
>> Here's another worm for the can.  Would you rather read this...
>> 
>> d = sqrt(x**2 + y**2)
>> 
>> ...or this?
>> 
>> d = √(x² + y²)
> 
> Neither. I would rather see
> 
> d = math.hypot(x, y)
> 
> Much simpler, don’t you think?

Only if you think of x and y as the sides of a triangle, and remember
that "hypot" is a Unix-like abbreviation for hypotenuse (rather than,
say, "hypothesis". And it doesn't help you one bit when it comes to:

a = √(4x²y - 3xy² + 2xy - 1)


Personally, I'm not convinced about using the very limited number of
superscript code points to represent exponentiation. Using √ as an unary
operator looks cute, but I don't know that it adds enough to the language
to justify the addition.



-- 
Steven
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Steven D'Aprano
On Mon, 4 Jul 2016 10:17 am, BartC wrote:

> On 04/07/2016 01:00, Lawrence D’Oliveiro wrote:
>> On Monday, July 4, 2016 at 11:47:26 AM UTC+12, eryk sun wrote:
>>> Python lacks a mechanism to add user-defined operators. (R has this
>>> capability.) Maybe this feature could be added.
>>
>> That would be neat. But remember, you would have to define the operator
>> precedence as well. So you could no longer use a recursive-descent
>> parser.
> 
> That wouldn't be a problem provided the new operator symbol and its
> precedence is known at a compile time, and defined before use.

You're still having problems with the whole Python-as-a-dynamic-language
thing, aren't you? :-)

In full generality, you would want to be able to define unary prefix, unary
suffix and binary infix operators, and set their precedence and whether
they associate to the left or the right. That's probably a bit much to
expect.

But if we limit ourselves to the boring case of binary infix operators of a
single precedence and associtivity, there's a simple approach: the parser
can allow any unicode code point of category "Sm" as a legal operator, e.g.
x ∇ y. Pre-defined operators like + - * etc continue to call the same
dunder methods they already do, but anything else tries calling:

x.__oper__('∇', y)
y.__roper__('∇', x)

and if neither of those exist and return a result other than NotImplemented,
then finally raise a runtime TypeError('undefined operator ∇').

But I don't think this will ever be part of Python.



-- 
Steven
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Random832
On Sun, Jul 3, 2016, at 21:15, Lawrence D’Oliveiro wrote:
> On Monday, July 4, 2016 at 12:40:14 PM UTC+12, BartC wrote:
> > The structure of such a parser doesn't need to exactly match the grammar 
> > with a dedicated block of code for each operator precedence. It can be 
> > table-driven so that an operator precedence value is just an attribute.
> 
> Of course. But that’s not a recursive-descent parser any more.

It's still recursive descent if it, for example, calls the _same_ block
of code recursively with arguments to tell it which operator is being
considered. This would be analogous to, in Python, implementing a
recursive-descent parser with arbitrary callable objects instead of
simple functions.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Random832
On Sun, Jul 3, 2016, at 20:00, Lawrence D’Oliveiro wrote:
> That would be neat. But remember, you would have to define the operator
> precedence as well. So you could no longer use a recursive-descent
> parser.

You could use a recursive-descent parser if you monkey-patch the parser
when adding a new operator.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Lawrence D’Oliveiro
On Monday, July 4, 2016 at 12:40:14 PM UTC+12, BartC wrote:
> The structure of such a parser doesn't need to exactly match the grammar 
> with a dedicated block of code for each operator precedence. It can be 
> table-driven so that an operator precedence value is just an attribute.

Of course. But that’s not a recursive-descent parser any more.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread BartC

On 04/07/2016 01:24, Lawrence D’Oliveiro wrote:

On Monday, July 4, 2016 at 12:17:47 PM UTC+12, BartC wrote:


On 04/07/2016 01:00, Lawrence D’Oliveiro wrote:


On Monday, July 4, 2016 at 11:47:26 AM UTC+12, eryk sun wrote:


Python lacks a mechanism to add user-defined operators. (R has this
capability.) Maybe this feature could be added.


That would be neat. But remember, you would have to define the operator
precedence as well. So you could no longer use a recursive-descent parser.


That wouldn't be a problem provided the new operator symbol and its
precedence is known at a compile time, and defined before use.


That is how it is normally done. (E.g. Algol 68.)

But you still couldn’t use a recursive-descent parser.


Why not?

The structure of such a parser doesn't need to exactly match the grammar 
with a dedicated block of code for each operator precedence. It can be 
table-driven so that an operator precedence value is just an attribute.


--
Bartc



--
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Lawrence D’Oliveiro
On Monday, July 4, 2016 at 12:17:47 PM UTC+12, BartC wrote:
>
> On 04/07/2016 01:00, Lawrence D’Oliveiro wrote:
>>
>> On Monday, July 4, 2016 at 11:47:26 AM UTC+12, eryk sun wrote:
>>>
>>> Python lacks a mechanism to add user-defined operators. (R has this
>>> capability.) Maybe this feature could be added.
>>
>> That would be neat. But remember, you would have to define the operator
>> precedence as well. So you could no longer use a recursive-descent parser.
> 
> That wouldn't be a problem provided the new operator symbol and its 
> precedence is known at a compile time, and defined before use.

That is how it is normally done. (E.g. Algol 68.)

But you still couldn’t use a recursive-descent parser.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread BartC

On 04/07/2016 01:00, Lawrence D’Oliveiro wrote:

On Monday, July 4, 2016 at 11:47:26 AM UTC+12, eryk sun wrote:

Python lacks a mechanism to add user-defined operators. (R has this
capability.) Maybe this feature could be added.


That would be neat. But remember, you would have to define the operator 
precedence as well. So you could no longer use a recursive-descent parser.


That wouldn't be a problem provided the new operator symbol and its 
precedence is known at a compile time, and defined before use.



--
Bartc

--
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Lawrence D’Oliveiro
On Monday, July 4, 2016 at 11:47:26 AM UTC+12, eryk sun wrote:
> Python lacks a mechanism to add user-defined operators. (R has this
> capability.) Maybe this feature could be added.

That would be neat. But remember, you would have to define the operator 
precedence as well. So you could no longer use a recursive-descent parser.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread eryk sun
On Sun, Jul 3, 2016 at 6:58 AM, John Ladasky  wrote:
> The nabla symbol (∇) is used in the naming of gradients. Python isn't having 
> it.
> The interpreter throws a "SyntaxError: invalid character in identifier" when 
> it
> encounters the ∇.

Del is a mathematical operator to take the gradient. It's not part of
the name. For `∇f`, the operator is `∇` and the function name is `f`.
Python lacks a mechanism to add user-defined operators. (R has this
capability.) Maybe this feature could be added. To make parsing
simple, user-defined operators could be limited to non-ASCII symbol
characters (math and other -- Sm, So). That simple option is off the
table if we allow symbol characters in names.

Adding an operator to the language itself requires a PEP. Recently PEP
465 added an `@` operator for matrix products. For example:

>>> x = np.array([1j, 1])
>>> x @ x
0j
>>> x @ x.conj() # Hermitian inner product
(2+0j)

Note that using a non-ASCII operator was ruled out:

http://legacy.python.org/dev/peps/pep-0465/#choice-of-operator
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Lawrence D’Oliveiro
On Monday, July 4, 2016 at 6:39:45 AM UTC+12, John Ladasky wrote:
> Here's another worm for the can.  Would you rather read this...
> 
> d = sqrt(x**2 + y**2)
> 
> ...or this?
> 
> d = √(x² + y²)

Neither. I would rather see

d = math.hypot(x, y)

Much simpler, don’t you think?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Marko Rauhamaa
Random832 :

> Being able to put any character in a symbol doesn't make those strings
> identifiers, any more than passing them to getattr/setattr (or
> __import__, something's __name__, etc) does in Python.

From R7RS, the newest Scheme standard (p. 61-62):

 7.1.1. Lexical structure
 [...]
 〈vertical line〉 → |
 [...]
 〈identifier〉 → 〈initial〉 〈subsequent〉*
  | 〈vertical line〉 〈symbol element〉* 〈vertical line〉
  | 〈peculiar identifier〉
 〈initial〉 → 〈letter〉 | 〈special initial〉
 〈letter〉 → a | b | c | ... | z
 | A | B | C | ... | Z
 〈special initial〉 → ! | $ | % | & | * | / | : | < | =
 | > | ? | ^ | _ | ~
 〈subsequent〉 → 〈initial〉 | 〈digit〉
 | 〈special subsequent〉
 〈digit〉 → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
 〈hex digit〉 → 〈digit〉 | a | b | c | d | e | f
 〈explicit sign〉 → + | -
 〈special subsequent〉 → 〈explicit sign〉 | . | @
 〈inline hex escape〉 → \x〈hex scalar value〉;
 〈hex scalar value〉 → 〈hex digit〉 +
 〈mnemonic escape〉 → \a | \b | \t | \n | \r
 〈peculiar identifier〉 → 〈explicit sign〉
 | 〈explicit sign〉 〈sign subsequent〉 〈subsequent〉*
 | 〈explicit sign〉 . 〈dot subsequent〉 〈subsequent〉*
 | . 〈dot subsequent〉 〈subsequent〉*
 〈dot subsequent〉 → 〈sign subsequent〉 | .
 〈sign subsequent〉 → 〈initial〉 | 〈explicit sign〉 | @
 〈symbol element〉 →
 〈any character other than 〈vertical line〉or \〉
 | 〈inline hex escape〉 | 〈mnemonic escape〉 | \|


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Lawrence D’Oliveiro
On Sunday, July 3, 2016 at 11:50:52 PM UTC+12, BartC wrote:
> Otherwise you can be looking at:
> 
>a b c d e f g h
> 
> (not Scheme) and wondering which are names and which are operators.

I did a language design for my MSc thesis where all “functions” were operators. 
So a construct like “f(a, b, c)” was really a monadic operator “f” followed by 
a single argument, a record constructor “(a, b, c)”.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Lawrence D’Oliveiro
On Sunday, July 3, 2016 at 9:02:05 PM UTC+12, Marko Rauhamaa wrote:
> Lawrence D’Oliveiro:
> 
>> On Sunday, July 3, 2016 at 7:27:04 PM UTC+12, Marko Rauhamaa wrote:
>>
>>> Personally, I don't think even π should be used in identifiers.
>>
> > Why not?
> 
> 1. It can't be typed easily.

I have a custom .XCompose, so it’s just “compose-p-i”. Easy to type, easy to 
remember.

> 2. It can look like an n.

Only to someone accustomed to using just one alphabet. :)

> 3. Single-character identifiers should not be promoted, especially with
>a global scope.

It’s no more “global” than “math.e”. And what about “1j”? (That completes the 
triumvirate of single-letter names from the Euler identity.)
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Random832
On Sun, Jul 3, 2016, at 07:22, Marko Rauhamaa wrote:
> Christian Gollwitzer :
> > Am 03.07.16 um 13:01 schrieb Marko Rauhamaa:
> >> Scheme allows *any* characters whatsoever in identifiers.
> >
> > Parentheses?
> 
> Yes.
> 
> Hint: Python allows *any* characters whatsoever in strings.

Being able to put any character in a symbol doesn't make those strings
identifiers, any more than passing them to getattr/setattr (or
__import__, something's __name__, etc) does in Python.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread MRAB

On 2016-07-03 19:39, John Ladasky wrote:

On Sunday, July 3, 2016 at 12:42:14 AM UTC-7, Chris Angelico wrote:

On Sun, Jul 3, 2016 at 4:58 PM, John Ladasky wrote:



Very good question! The detaily answer is here:

https://docs.python.org/3/reference/lexical_analysis.html#identifiers

> A philosophical question.  Why should any character be excluded from a 
variable name, besides the fact that it might also be an operator?

In a way, that's exactly what's happening here. Python permits certain
categories of character as identifiers, leaving other categories
available for operators. Even though there aren't any non-ASCII
operators in a vanilla CPython, it's plausible that someone could
create a Python-based language with more operators (eg ≠ NOT EQUAL TO
as an alternative to !=), and I'm sure you'd agree that saying "≠ = 1"
is nonsensical.


I agree that there are some characters in the Unicode definition that could (should?) be operators and, as such, 
disallowed in identifiers.  "≠", "≥" and "√" come to mind.  I don't know whether the 
Unicode "character properties" are assigned to the characters in a way that would be satisfying to the needs 
of programmers.  I'll do some reading.


Symbols like that are a bit of a
grey area, so you may find that you're starting a huge debate :)


Oh, I can see that debate coming.  I know that not all of these characters are 
easily TYPED, and so I have to reach for a Unicode table to cut and paste them. 
 But once but and pasted, they are easily READ, and that's a big plus.

Here's another worm for the can.  Would you rather read this...

d = sqrt(x**2 + y**2)

...or this?

d = √(x² + y²)

It's easy to read something as simple like that, but it's harder when 
the exponent is more than a number or a variable. And what about a**b**c?


Not to mention the limited number of superscript codepoints available...
--
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread John Ladasky
On Sunday, July 3, 2016 at 12:42:14 AM UTC-7, Chris Angelico wrote:
> On Sun, Jul 3, 2016 at 4:58 PM, John Ladasky wrote:

> Very good question! The detaily answer is here:
> 
> https://docs.python.org/3/reference/lexical_analysis.html#identifiers
> 
> > A philosophical question.  Why should any character be excluded from a 
> > variable name, besides the fact that it might also be an operator?
> 
> In a way, that's exactly what's happening here. Python permits certain
> categories of character as identifiers, leaving other categories
> available for operators. Even though there aren't any non-ASCII
> operators in a vanilla CPython, it's plausible that someone could
> create a Python-based language with more operators (eg ≠ NOT EQUAL TO
> as an alternative to !=), and I'm sure you'd agree that saying "≠ = 1"
> is nonsensical.

I agree that there are some characters in the Unicode definition that could 
(should?) be operators and, as such, disallowed in identifiers.  "≠", "≥" and 
"√" come to mind.  I don't know whether the Unicode "character properties" are 
assigned to the characters in a way that would be satisfying to the needs of 
programmers.  I'll do some reading.

> Symbols like that are a bit of a
> grey area, so you may find that you're starting a huge debate :)

Oh, I can see that debate coming.  I know that not all of these characters are 
easily TYPED, and so I have to reach for a Unicode table to cut and paste them. 
 But once but and pasted, they are easily READ, and that's a big plus.

Here's another worm for the can.  Would you rather read this...

d = sqrt(x**2 + y**2)

...or this?

d = √(x² + y²)

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread John Ladasky
Lawrence, I trust you understand that I didn't post a complete working program, 
just a few lines showing the intended usage?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Chris Angelico
On Sun, Jul 3, 2016 at 7:01 PM, Marko Rauhamaa  wrote:
> Lawrence D’Oliveiro :
>
>> On Sunday, July 3, 2016 at 7:27:04 PM UTC+12, Marko Rauhamaa wrote:
>>
>>> Personally, I don't think even π should be used in identifiers.
>>
>> Why not?
>
> 1. It can't be typed easily.
>
> 2. It can look like an n.
>
> 3. Single-character identifiers should not be promoted, especially with
>a global scope.

None of these is a language-level concern. You can't type it? That's
your problem - and you can choose not to use it. But Python lets you,
if you want to. Remember, some people speak Greek natively, and for
those people, typing Greek text is as natural as typing Latin text is
for us. Similarly, Cyrillic text is the most natural language for
Russian speakers. Why should Python block them?

Your other concerns might be a case for linters, but definitely not
the language.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Marko Rauhamaa
Christian Gollwitzer :

> Am 03.07.16 um 13:22 schrieb Marko Rauhamaa:
>> Christian Gollwitzer :
>>> Am 03.07.16 um 13:01 schrieb Marko Rauhamaa:
 Scheme allows *any* characters whatsoever in identifiers.
>>> Parentheses?
>> Yes.
>
> My knowledge of Scheme is rusty. How do you do that?

   Moreover, all characters whose Unicode scalar values are greater than
   127 and whose Unicode category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd,
   Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or Co can be used within
   identifiers. In addition, any character can be used within an
   identifier when specified via an . For example,
   the identifier H\x65;llo is the same as the identifier Hello, and the
   identifier \x3BB; is the same as the identifier λ.
   http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4>

Guile doesn't support the R6RS inline hex escape notation. Instead, it
natively supports a notation of its own:

   #{foo bar}#

   #{what
   ever}#

   #{4242}#

Or the R7RS notation:

   |foo bar|
   |\x3BB; is a greek lambda|
   |\| is a vertical bar|

   https://www.gnu.org/software/guile/manual/html_node/Symbol-Rea
   d-Syntax.html#index-r7rs_002dsymbols>


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Christian Gollwitzer

Am 03.07.16 um 13:22 schrieb Marko Rauhamaa:

Christian Gollwitzer :


Am 03.07.16 um 13:01 schrieb Marko Rauhamaa:

Alain Ketterlin :


It would be very confusing to have a variable named ∇f, as confusing
as naming a variable a+b or √x.


Scheme allows *any* characters whatsoever in identifiers.


Parentheses?


Yes.

Hint: Python allows *any* characters whatsoever in strings.


My knowledge of Scheme is rusty. How do you do that? Consider

(define x 'hello)

then the x is the identifier, isn't it? How can you include a 
metacharacter like space, ', or ( in it? I'm using 
https://repl.it/languages/scheme to try it out.


Another language which allows any characters in identifiers is Tcl. Here 
you can quote identifiers:


set {a b} c

creates a variable "a b" with a space in it, because there is no 
distinction between quoted/unquoted. Metacharacters can be included by 
\-escapes. How does that work in Scheme?


Christian

--
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread BartC

On 03/07/2016 12:01, Marko Rauhamaa wrote:

Alain Ketterlin :


It would be very confusing to have a variable named ∇f, as confusing
as naming a variable a+b or √x.


Scheme allows *any* characters whatsoever in identifiers.


I think it's one of those languages that has already dispensed with most 
syntax anyway. Including distinctions between names and symbols.


Some people think that extra syntax rules including enforcing such 
distinctions and having restrictions can improve readability. Otherwise 
you can be looking at:


  a b c d e f g h

(not Scheme) and wondering which are names and which are operators.

--
Bartc
--
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Marko Rauhamaa
Christian Gollwitzer :

> Am 03.07.16 um 13:01 schrieb Marko Rauhamaa:
>> Alain Ketterlin :
>>
>>> It would be very confusing to have a variable named ∇f, as confusing
>>> as naming a variable a+b or √x.
>>
>> Scheme allows *any* characters whatsoever in identifiers.
>
> Parentheses?

Yes.

Hint: Python allows *any* characters whatsoever in strings.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Christian Gollwitzer

Am 03.07.16 um 13:01 schrieb Marko Rauhamaa:

Alain Ketterlin :


It would be very confusing to have a variable named ∇f, as confusing
as naming a variable a+b or √x.


Scheme allows *any* characters whatsoever in identifiers.


Parentheses?

Christian

--
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Marko Rauhamaa
Alain Ketterlin :

> It would be very confusing to have a variable named ∇f, as confusing
> as naming a variable a+b or √x.

Scheme allows *any* characters whatsoever in identifiers.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Alain Ketterlin
John Ladasky  writes:

> from math import pi as π
> [...]
> c = 2 * π * r

> Up until today, every character I've tried has been accepted by the
> Python interpreter as a legitimate character for inclusion in a
> variable name. Now I'm copying a formula which defines a gradient. The
> nabla symbol (∇) is used in the naming of gradients. Python isn't
> having it. The interpreter throws a "SyntaxError: invalid character in
> identifier" when it encounters the ∇.

The rules are at
https://docs.python.org/3.5/reference/lexical_analysis.html#identifiers

To me it makes a lot of sense to *not* include category Sm characters in
identifiers, since they are usually used to denote operators (like +).
It would be very confusing to have a variable named ∇f, as confusing as
naming a variable a+b or √x.

-- Alain.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Robert Kern

On 2016-07-03 08:29, Jussi Piitulainen wrote:

(Hm. Python seems to understand that the character occurs in what is
intended to be an identifier. Perhaps that's a default error message.)


I suspect that "identifier" is the final catch-all token in the lexer. Comments 
and strings are clearly delimited. Keywords, operators, and [{(braces)}] are all 
explicitly whitelisted from finite lists. Well, I guess it could have been 
intended by the user to be a numerical literal, but I suspect that's attempted 
before identifier.


--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth."
  -- Umberto Eco

--
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Marko Rauhamaa
Lawrence D’Oliveiro :

> On Sunday, July 3, 2016 at 7:27:04 PM UTC+12, Marko Rauhamaa wrote:
>
>> Personally, I don't think even π should be used in identifiers.
>
> Why not?

1. It can't be typed easily.

2. It can look like an n.

3. Single-character identifiers should not be promoted, especially with
   a global scope.

> Python already has all the other single-character constants in what
> probably the most fundamental identity in all of mathematics:
>
> $$e^{i \pi} + 1 = 0$$

Mathematics and physics have run into trouble with single-character
identifiers already. They have run out of letters and have had to reuse
them. Programmers used to have the same problem until they realized it's
ok to use descriptive names.

Just say,

>>> import cmath
>>> cmath.e ** (1j * cmath.pi) + 1
1.2246467991473532e-16j


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Lawrence D’Oliveiro
On Sunday, July 3, 2016 at 7:27:04 PM UTC+12, Marko Rauhamaa wrote:

> Personally, I don't think even π should be used in identifiers.

Why not? Python already has all the other single-character constants in what 
probably the most fundamental identity in all of mathematics:

$$e^{i \pi} + 1 = 0$$
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Chris Angelico
On Sun, Jul 3, 2016 at 4:58 PM, John Ladasky  wrote:
> Up until today, every character I've tried has been accepted by the Python 
> interpreter as a legitimate character for inclusion in a variable name.  Now 
> I'm copying a formula which defines a gradient.  The nabla symbol (∇) is used 
> in the naming of gradients.  Python isn't having it.  The interpreter throws 
> a "SyntaxError: invalid character in identifier" when it encounters the ∇.
>
> I am now wondering what constitutes a valid character for an identifier, and 
> how they were chosen.  Obviously, the Western alphabet and standard Greek 
> letters work.  I just tried a few very weird characters from the Latin 
> Extended range, and some Cyrillic characters.  These are also fine.
>

Very good question! The detaily answer is here:

https://docs.python.org/3/reference/lexical_analysis.html#identifiers

> A philosophical question.  Why should any character be excluded from a 
> variable name, besides the fact that it might also be an operator?
>

In a way, that's exactly what's happening here. Python permits certain
categories of character as identifiers, leaving other categories
available for operators. Even though there aren't any non-ASCII
operators in a vanilla CPython, it's plausible that someone could
create a Python-based language with more operators (eg ≠ NOT EQUAL TO
as an alternative to !=), and I'm sure you'd agree that saying "≠ = 1"
is nonsensical.

> This might be a problem I can solve, I'm not sure.  Is there a file that the 
> Python interpreter refers to which defines the accepted variable name 
> characters?  Perhaps I could just add ∇.
>

The key here is its Unicode category:

>>> unicodedata.category("∇")
'Sm'

You could probably hack CPython to include Sm, and maybe Sc, Sk, and
So, as valid identifier characters. I'm not sure where, though, and
I've just spent a good bit of time delving (it's based on the
XID_Start and XID_Continue derived properties, but I have no idea
where they're defined - Tools/unicode/makeunicodedata.py looks
promising, but even there, I can't find it). And - or maybe instead -
you could appeal to the core devs to have the category/ies in question
added to the official Python spec. Symbols like that are a bit of a
grey area, so you may find that you're starting a huge debate :)

Have fun.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Rustom Mody
On Sunday, July 3, 2016 at 12:29:14 PM UTC+5:30, John Ladasky wrote:
> A while back, I shared my love for using Greek letters as variable names in 
> my Python (3.4) code -- when, and only when, they are warranted for improved 
> readability.  For example, I like to see the following:
> 
> 
> from math import pi as π
> 
> c = 2 * π * r
> 
> 
> When I am copying mathematical formulas from publications, and Greek letters 
> are used in that publication, I prefer to follow the text exactly as written.
> 
> Up until today, every character I've tried has been accepted by the Python 
> interpreter as a legitimate character for inclusion in a variable name.  Now 
> I'm copying a formula which defines a gradient.  The nabla symbol (∇) is used 
> in the naming of gradients.  Python isn't having it.  The interpreter throws 
> a "SyntaxError: invalid character in identifier" when it encounters the ∇.
> 
> I am now wondering what constitutes a valid character for an identifier, and 
> how they were chosen.  Obviously, the Western alphabet and standard Greek 
> letters work.  I just tried a few very weird characters from the Latin 
> Extended range, and some Cyrillic characters.  These are also fine.

https://docs.python.org/3.5/reference/lexical_analysis.html
points to
https://www.dcl.hpi.uni-potsdam.de/home/loewis/table-3131.html

Quite hardwired

> 
> A philosophical question.  Why should any character be excluded from a 
> variable name, besides the fact that it might also be an operator?
> 
> This might be a problem I can solve, I'm not sure.  Is there a file that the 
> Python interpreter refers to which defines the accepted variable name 
> characters?  Perhaps I could just add ∇.

You need to try something like

>>> import unicodedata as ud
>>> ud.category("∇")
'Sm'
>>> ud.category("A")
'Lu'
>>> ud.category("π")
'Ll'
>>> ud.category("a")
'Ll'

followed by figuring out why/what etc from (say)
https://en.wikipedia.org/wiki/Unicode_character_property

This is the way it IS
Not saying it SHOULD BE…
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Marko Rauhamaa
Lawrence D’Oliveiro :

> It wasn’t the “π” it was complaining about...

The question is why π is accepted but ∇ is not.

The immediate reason is that π is a letter while ∇ is not. But the
question, then, is why bother excluding nonletters from identifiers.

Personally, I don't think even π should be used in identifiers.
Mathematicians and physicists have a questionable tradition of using
single-character identifiers in their formulas. That shouldn't be
transported to programming.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Jussi Piitulainen
John Ladasky writes:

[- -]

> The nabla symbol (∇) is used in the naming of gradients.  Python isn't
> having it.  The interpreter throws a "SyntaxError: invalid character
> in identifier" when it encounters the ∇.
>
> I am now wondering what constitutes a valid character for an
> identifier, and how they were chosen.  Obviously, the Western alphabet
> and standard Greek letters work.  I just tried a few very weird
> characters from the Latin Extended range, and some Cyrillic
> characters.  These are also fine.

I think they merely extended the identifier syntax to Unicode: one or
more letters, underscores and digits, not starting with a digit. The
nabla symbol is not classified as a letter in Unicode, so it's not
allowed under this rule, and there is no other rule to allow it.

(Hm. Python seems to understand that the character occurs in what is
intended to be an identifier. Perhaps that's a default error message.)
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread Lawrence D’Oliveiro
On Sunday, July 3, 2016 at 6:59:14 PM UTC+12, John Ladasky wrote:
> from math import pi as π
> 
> c = 2 * π * r

ldo@theon:~> python3
Python 3.5.1+ (default, Jun 10 2016, 09:03:40) 
[GCC 5.4.0 20160603] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from math import pi as π
>>> 
>>> c = 2 * π * r
Traceback (most recent call last):
  File "", line 1, in 
NameError: name 'r' is not defined

It wasn’t the “π” it was complaining about...
-- 
https://mail.python.org/mailman/listinfo/python-list


Well, I finally ran into a Python Unicode problem, sort of

2016-07-03 Thread John Ladasky
A while back, I shared my love for using Greek letters as variable names in my 
Python (3.4) code -- when, and only when, they are warranted for improved 
readability.  For example, I like to see the following:


from math import pi as π

c = 2 * π * r


When I am copying mathematical formulas from publications, and Greek letters 
are used in that publication, I prefer to follow the text exactly as written.

Up until today, every character I've tried has been accepted by the Python 
interpreter as a legitimate character for inclusion in a variable name.  Now 
I'm copying a formula which defines a gradient.  The nabla symbol (∇) is used 
in the naming of gradients.  Python isn't having it.  The interpreter throws a 
"SyntaxError: invalid character in identifier" when it encounters the ∇.

I am now wondering what constitutes a valid character for an identifier, and 
how they were chosen.  Obviously, the Western alphabet and standard Greek 
letters work.  I just tried a few very weird characters from the Latin Extended 
range, and some Cyrillic characters.  These are also fine.

A philosophical question.  Why should any character be excluded from a variable 
name, besides the fact that it might also be an operator?

This might be a problem I can solve, I'm not sure.  Is there a file that the 
Python interpreter refers to which defines the accepted variable name 
characters?  Perhaps I could just add ∇.
-- 
https://mail.python.org/mailman/listinfo/python-list


How to work around a unicode problem?

2012-01-24 Thread tinnews
I have a small python program that uses the pyexiv2 package to view
exif data in image files.

I've hit a problem because I have a filename with accented characters
in its path and the pyexiv2 code traps as follows:-

Traceback (most recent call last):
  File /home/chris/bin/eview.py, line 87, in module
image = pyexiv2.ImageMetadata(filepath)
  File /usr/lib/python2.7/dist-packages/pyexiv2/metadata.py, line 65, in 
__init__
self.filename = filename.encode(sys.getfilesystemencoding())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 38: 
ordinal not in range(128)

Without digging deep into pyexiv2 is there any way I can work around
this error?  The accented characters aren't in the filename itself,
they're in the directory path.  I.e. it's:-

./1977/04 April/#09 - Monaco, inc. Musée de Poupée/p77_08_011.jpg

I could of course remove the accents but I'd much prefer not to do so.

-- 
Chris Green
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to work around a unicode problem?

2012-01-24 Thread Chris Rebert
On Tue, Jan 24, 2012 at 3:57 AM,  tinn...@isbd.co.uk wrote:
 I have a small python program that uses the pyexiv2 package to view
 exif data in image files.

 I've hit a problem because I have a filename with accented characters
 in its path and the pyexiv2 code traps as follows:-

    Traceback (most recent call last):
      File /home/chris/bin/eview.py, line 87, in module
        image = pyexiv2.ImageMetadata(filepath)
      File /usr/lib/python2.7/dist-packages/pyexiv2/metadata.py, line 65, in 
 __init__
        self.filename = filename.encode(sys.getfilesystemencoding())
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 38: 
 ordinal not in range(128)

 Without digging deep into pyexiv2 is there any way I can work around
 this error?  The accented characters aren't in the filename itself,
 they're in the directory path.

After glancing at the docs, (untested):

with open(filepath) as f:
image = pyexiv2.ImageMetadata.from_buffer(f.read())

Cheers,
Chris
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to work around a unicode problem?

2012-01-24 Thread Peter Otten
tinn...@isbd.co.uk wrote:

 I have a small python program that uses the pyexiv2 package to view
 exif data in image files.
 
 I've hit a problem because I have a filename with accented characters
 in its path and the pyexiv2 code traps as follows:-
 
 Traceback (most recent call last):
   File /home/chris/bin/eview.py, line 87, in module
 image = pyexiv2.ImageMetadata(filepath)
   File /usr/lib/python2.7/dist-packages/pyexiv2/metadata.py, line
   65, in __init__
 self.filename = filename.encode(sys.getfilesystemencoding())
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
 38: ordinal not in range(128)
 
 Without digging deep into pyexiv2 is there any way I can work around
 this error?  The accented characters aren't in the filename itself,
 they're in the directory path.  I.e. it's:-
 
 ./1977/04 April/#09 - Monaco, inc. Musée de Poupée/p77_08_011.jpg
 
 I could of course remove the accents but I'd much prefer not to do so.
 
Try passing a unicode filename. A quickfix:

filepath = filepath.decode(sys.getfilesystemencoding())
image = pyexiv2.ImageMetadata(filepath)

If you are using os.listdir() or glob.glob() to produce the filepath -- they 
will return unicode filenames if you invoke them with a unicode argument.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to work around a unicode problem?

2012-01-24 Thread tinnews
Peter Otten __pete...@web.de wrote:
 tinn...@isbd.co.uk wrote:
 
  I have a small python program that uses the pyexiv2 package to view
  exif data in image files.
  
  I've hit a problem because I have a filename with accented characters
  in its path and the pyexiv2 code traps as follows:-
  
  Traceback (most recent call last):
File /home/chris/bin/eview.py, line 87, in module
  image = pyexiv2.ImageMetadata(filepath)
File /usr/lib/python2.7/dist-packages/pyexiv2/metadata.py, line
65, in __init__
  self.filename = filename.encode(sys.getfilesystemencoding())
  UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
  38: ordinal not in range(128)
  
  Without digging deep into pyexiv2 is there any way I can work around
  this error?  The accented characters aren't in the filename itself,
  they're in the directory path.  I.e. it's:-
  
  ./1977/04 April/#09 - Monaco, inc. Musée de Poupée/p77_08_011.jpg
  
  I could of course remove the accents but I'd much prefer not to do so.
  
 Try passing a unicode filename. A quickfix:
 
 filepath = filepath.decode(sys.getfilesystemencoding())
 image = pyexiv2.ImageMetadata(filepath)
 
... and this solution works too, thank you.

 If you are using os.listdir() or glob.glob() to produce the filepath -- they 
 will return unicode filenames if you invoke them with a unicode argument.
 
-- 
Chris Green
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to work around a unicode problem?

2012-01-24 Thread tinnews
Chris Rebert c...@rebertia.com wrote:
 On Tue, Jan 24, 2012 at 3:57 AM,  tinn...@isbd.co.uk wrote:
  I have a small python program that uses the pyexiv2 package to view
  exif data in image files.
 
  I've hit a problem because I have a filename with accented characters
  in its path and the pyexiv2 code traps as follows:-
 
     Traceback (most recent call last):
       File /home/chris/bin/eview.py, line 87, in module
         image = pyexiv2.ImageMetadata(filepath)
       File /usr/lib/python2.7/dist-packages/pyexiv2/metadata.py, line 65, 
  in __init__
         self.filename = filename.encode(sys.getfilesystemencoding())
     UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 38: 
  ordinal not in range(128)
 
  Without digging deep into pyexiv2 is there any way I can work around
  this error?  The accented characters aren't in the filename itself,
  they're in the directory path.
 
 After glancing at the docs, (untested):
 
 with open(filepath) as f:
 image = pyexiv2.ImageMetadata.from_buffer(f.read())
 
Excellent, worked perfectly (after I spotted I had another variable f).

Thank you.

-- 
Chris Green
-- 
http://mail.python.org/mailman/listinfo/python-list


unicode problem?

2010-10-09 Thread Brian Blais
This may be a stemming from my complete ignorance of unicode, but when I do 
this (Python 2.6):

s='\xc2\xa9 2008 \r\n'

and I want the ascii version of it, ignoring any non-ascii chars, I thought I 
could do:

s.encode('ascii','ignore')

but it gives the error:

In [20]:s.encode('ascii','ignore')

UnicodeDecodeErrorTraceback (most recent call last)

/Users/bblais/python/doit100810a.py in module()
 1 
  2 
  3 
  4 
  5 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal 
not in range(128)

am I doing something stupid here?

of course, as a workaround, I can do: ''.join([c for c in s if ord(c)128])

but I thought the encode call should work.

thanks,
bb

-- 
Brian Blais
bbl...@bryant.edu
http://web.bryant.edu/~bblais
http://bblais.blogspot.com/



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode problem?

2010-10-09 Thread Benjamin Kaplan
On Sat, Oct 9, 2010 at 7:59 PM, Brian Blais bbl...@bryant.edu wrote:
 This may be a stemming from my complete ignorance of unicode, but when I do 
 this (Python 2.6):

 s='\xc2\xa9 2008 \r\n'

 and I want the ascii version of it, ignoring any non-ascii chars, I thought I 
 could do:

 s.encode('ascii','ignore')

 but it gives the error:

 In [20]:s.encode('ascii','ignore')
 
 UnicodeDecodeError                        Traceback (most recent call last)

 /Users/bblais/python/doit100810a.py in module()
  1
      2
      3
      4
      5

 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: 
 ordinal not in range(128)

 am I doing something stupid here?

 of course, as a workaround, I can do: ''.join([c for c in s if ord(c)128])

 but I thought the encode call should work.

                thanks,
                        bb


Encode takes a Unicode string (made up of code points) and turns it
into a byte string (a sequence of bytes). In your case, you don't have
a Unicode string. You have a byte string. In order to encode that
sequence of bytes into a different encoding, you have to first figure
out what those bytes mean (decode it). Python has no way of knowing
that your strings are UTF-8 so it just tries ascii as the default.

You can either decode the byte string explicitly or (if it's actually
a literal in your code) just specify it as a Unicode string.
s = u'\u00a9 2008'
s.encode('ascii','ignore')

The encode vs. decode confusion was removed in Python 3: byte strings
don't have an encode method and unicode strings don't have a decode
method.

 --
 Brian Blais
 bbl...@bryant.edu
 http://web.bryant.edu/~bblais
 http://bblais.blogspot.com/



 --
 http://mail.python.org/mailman/listinfo/python-list

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode problem?

2010-10-09 Thread Chris Rebert
On Sat, Oct 9, 2010 at 4:59 PM, Brian Blais bbl...@bryant.edu wrote:
 This may be a stemming from my complete ignorance of unicode, but when I do 
 this (Python 2.6):

 s='\xc2\xa9 2008 \r\n'

 and I want the ascii version of it, ignoring any non-ascii chars, I thought I 
 could do:

 s.encode('ascii','ignore')

 but it gives the error:

 In [20]:s.encode('ascii','ignore')
 
 UnicodeDecodeError                        Traceback (most recent call last)

 /Users/bblais/python/doit100810a.py in module()
  1
      2
      3
      4
      5

 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: 
 ordinal not in range(128)

 am I doing something stupid here?

In addition to Benjamin's explanation:

Unicode strings in Python are of type `unicode` and written with a
leading u; e.g. uA unicode string for ¥500. Byte strings lack the
leading u; e.g. A plain byte string. Note that Unicode string
does not refer to strings which have been encoded using a Unicode
encoding (e.g. UTF-8); such strings are still byte strings, for
encodings emit bytes.

As to why you got the /exact/ error you did:
As a backward compatibility hack, in order to satisfy your nonsensical
encoding request, Python implicitly tried to decode the byte string
`s` using ASCII as a default (the choice of ASCII here has nothing to
do with the fact that you specified ASCII in your encoding request),
so that it could then try and encode the resulting unicode string;
hence why you got a Unicode*De*codeError as opposed to a
Unicode*En*codeError, despite the fact you called *en*code().

Highly suggested further reading:
The Absolute Minimum Every Software Developer Absolutely, Positively
Must Know About Unicode and Character Sets (No Excuses!)
http://www.joelonsoftware.com/articles/Unicode.html

Cheers,
Chris
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Re: unicode problem?

2010-10-09 Thread hidura
I had a similar problem but i can 't encode a byte to a file what has been  
uploaded, without damage the data if i used utf-8 to encode the file  
duplicates the size, and i try to change the codec to raw_unicode_escape  
and this barely give me the correct size but still damage the file, i used  
Python 3 and i have to encode the file again.


On Oct 9, 2010 11:39pm, Chris Rebert creb...@ucsd.edu wrote:

On Sat, Oct 9, 2010 at 4:59 PM, Brian Blais bbl...@bryant.edu wrote:


 This may be a stemming from my complete ignorance of unicode, but when  
I do this (Python 2.6):







 s='\xc2\xa9 2008 \r\n'






 and I want the ascii version of it, ignoring any non-ascii chars, I  
thought I could do:







 s.encode('ascii','ignore')







 but it gives the error:







 In [20]:s.encode('ascii','ignore')


  




 UnicodeDecodeError Traceback (most recent call last)







 /Users/bblais/python/doit100810a.py in ()



  1



 2



 3



 4



 5






 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0:  
ordinal not in range(128)







 am I doing something stupid here?





In addition to Benjamin's explanation:





Unicode strings in Python are of type `unicode` and written with a



leading u; eg uA unicode string for ¥500. Byte strings lack the



leading u; eg A plain byte string. Note that Unicode string



does not refer to strings which have been encoded using a Unicode



encoding (eg UTF-8); such strings are still byte strings, for



encodings emit bytes.





As to why you got the /exact/ error you did:



As a backward compatibility hack, in order to satisfy your nonsensical



encoding request, Python implicitly tried to decode the byte string



`s` using ASCII as a default (the choice of ASCII here has nothing to



do with the fact that you specified ASCII in your encoding request),



so that it could then try and encode the resulting unicode string;



hence why you got a Unicode*De*codeError as opposed to a



Unicode*En*codeError, despite the fact you called *en*code().





Highly suggested further reading:



The Absolute Minimum Every Software Developer Absolutely, Positively



Must Know About Unicode and Character Sets (No Excuses!)



http://www.joelonsoftware.com/articles/Unicode.html





Cheers,



Chris



--



http://mail.python.org/mailman/listinfo/python-list


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode problem in ucs4

2009-03-25 Thread abhi
On Mar 24, 4:55 am, Martin v. Löwis mar...@v.loewis.de wrote:
  So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
  \0s after a char, printf or wprintf is only printing one letter.

 No. printf indeed will see a terminating character. However, wprintf
 should correctly know that a wchar_t has four bytes per character,
 and print it correctly. Make sure to use %ls to print wchar_t arrays;
 %s would print multi-byte character strings.

  I need to further process the data and those libraries will need the
  data in UCS2 format (2 bytes), otherwise they fail.

 Are you absolutely sure about that? Why does that library expect
 UCS-2, when you system's wchar_t is four bytes?

 In any case, do what MAL told you: use the UCS-2 codec to convert
 the Unicode string to a 2-bytes-per-char byte string. The PyObject
 you get from the conversion is a byte string object; use
 PyString_AsStringAndSize to get to the actual bytes.

 Regards,
 Martin

Thanks Marc and Martin, my preliminary trials are showing positive
results with this method.

-
Abhigyan
--
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode problem in ucs4

2009-03-23 Thread abhi
On Mar 20, 5:47 pm, M.-A. Lemburg m...@egenix.com wrote:
 On 2009-03-20 12:13, abhi wrote:





  On Mar 20, 11:03 am, Martin v. Löwis mar...@v.loewis.de wrote:
  Any idea on why this is happening?
  Can you provide a complete example? Your code looks correct, and should
  just work.

  How do you know the result contains only 't' (i.e. how do you know it
  does not contain 'e', 's', 't')?

  Regards,
  Martin

  Hi Martin,
   Here is the code:
  unicodeTest.c

  #includePython.h

  static PyObject *unicode_helper(PyObject *self,PyObject *args){
     PyObject *sampleObj = NULL;
             Py_UNICODE *sample = NULL;

        if (!PyArg_ParseTuple(args, O, sampleObj)){
                  return NULL;
        }

      // Explicitly convert it to unicode and get Py_UNICODE value
        sampleObj = PyUnicode_FromObject(sampleObj);
        sample = PyUnicode_AS_UNICODE(sampleObj);
        wprintf(Ldatabase value after unicode conversion is : %s\n,
  sample);

 You have to use PyUnicode_AsWideChar() to convert a Python
 Unicode object to a wchar_t representation.

 Please don't make any assumptions on what Py_UNICODE maps
 to and always use the the Unicode API for this. It is designed
 to provide a portable interface and will not do more conversion
 work than necessary.





        return Py_BuildValue();
  }

  static PyMethodDef funcs[]={{unicodeTest,(PyCFunction)
  unicode_helper,METH_VARARGS,test ucs2, ucs4},{NULL}};

  void initunicodeTest(void){
     Py_InitModule3(unicodeTest,funcs,);
  }

  When i install this unicodeTest on python ucs2 wprintf prints whatever
  is passed eg

  import unicodeTest
  unicodeTest.unicodeTest(hello world)
  database value after unicode conversion is : hello world

  but it prints the following on ucs4 configured python:
  database value after unicode conversion is : h

  Regards,
  Abhigyan
  --
 http://mail.python.org/mailman/listinfo/python-list

 --
 Marc-Andre Lemburg
 eGenix.com

 Professional Python Services directly from the Source  (#1, Mar 20 2009) 
 Python/Zope Consulting and Support ...        http://www.egenix.com/
  mxODBC.Zope.Database.Adapter ...            http://zope.egenix.com/
  mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

 

 ::: Try our new mxODBC.Connect Python Database Interface for free ! 

    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
            Registered at Amtsgericht Duesseldorf: HRB 46611
                http://www.egenix.com/company/contact/- Hide quoted text -

 - Show quoted text -- Hide quoted text -

 - Show quoted text -

Hi Mark,
 Thanks for the help. I tried PyUnicode_AsWideChar() but I am
getting the same result i.e. only the first letter.

sample code:

#includePython.h

static PyObject *unicode_helper(PyObject *self,PyObject *args){
PyObject *sampleObj = NULL;
wchar_t *sample = NULL;
int size = 0;

  if (!PyArg_ParseTuple(args, O, sampleObj)){
return NULL;
  }

 // use wide char function
  size = PyUnicode_AsWideChar(databaseObj, sample,
PyUnicode_GetSize(databaseObj));
  printf(%d chars are copied to sample\n, size);
  wprintf(Ldatabase value after unicode conversion is : %s\n,
sample);
  return Py_BuildValue();

}


static PyMethodDef funcs[]={{unicodeTest,(PyCFunction)
unicode_helper,METH_VARARGS,test ucs2, ucs4},{NULL}};

void initunicodeTest(void){
Py_InitModule3(unicodeTest,funcs,);

}

This prints the following when input value is given as test:
4 chars are copied to sample
database value after unicode conversion is : t

Any ideas?

-
Abhigyan
--
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode problem in ucs4

2009-03-23 Thread John Machin
On Mar 23, 6:18 pm, abhi abhigyan_agra...@in.ibm.com wrote:

[snip]
 Hi Mark,
      Thanks for the help. I tried PyUnicode_AsWideChar() but I am
 getting the same result i.e. only the first letter.

 sample code:

 #includePython.h

 static PyObject *unicode_helper(PyObject *self,PyObject *args){
         PyObject *sampleObj = NULL;
         wchar_t *sample = NULL;
         int size = 0;

       if (!PyArg_ParseTuple(args, O, sampleObj)){
                 return NULL;
       }

          // use wide char function
       size = PyUnicode_AsWideChar(databaseObj, sample,
 PyUnicode_GetSize(databaseObj));

What is databaseObj???  Copy/paste the *actual* code that you compiled
and ran.

       printf(%d chars are copied to sample\n, size);
       wprintf(Ldatabase value after unicode conversion is : %s\n,
 sample);
       return Py_BuildValue();

 }

 static PyMethodDef funcs[]={{unicodeTest,(PyCFunction)
 unicode_helper,METH_VARARGS,test ucs2, ucs4},{NULL}};

 void initunicodeTest(void){
         Py_InitModule3(unicodeTest,funcs,);

 }

 This prints the following when input value is given as test:
 4 chars are copied to sample
 database value after unicode conversion is : t

[presuming littleendian] The ucs4 string will look like \t\0\0\0e
\0\0\0s\0\0\0t\0\0\0 in memory. I suspect that your wprintf is
grokking only 16-bit doodads -- \t\0 is printed and then \0\0 is
end-of-string. Try your wprintf on sample[0], ..., sample[3] in a loop
and see what you get. Use bog-standard printf to print the hex
representation of each of the 16 bytes starting at the address sample
is pointing to.

--
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode problem in ucs4

2009-03-23 Thread John Machin
On Mar 23, 6:41 pm, John Machin sjmac...@lexicon.net had a severe
attack of backslashitis:

 [presuming littleendian] The ucs4 string will look like \t\0\0\0e
 \0\0\0s\0\0\0t\0\0\0 in memory. I suspect that your wprintf is
 grokking only 16-bit doodads -- \t\0 is printed and then \0\0 is
 end-of-string. Try your wprintf on sample[0], ..., sample[3] in a loop
 and see what you get. Use bog-standard printf to print the hex
 representation of each of the 16 bytes starting at the address sample
 is pointing to.

and typed \t in two places where he should have typed t :-)
--
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode problem in ucs4

2009-03-23 Thread M.-A. Lemburg
On 2009-03-23 08:18, abhi wrote:
 On Mar 20, 5:47 pm, M.-A. Lemburg m...@egenix.com wrote:
 unicodeTest.c
 #includePython.h
 static PyObject *unicode_helper(PyObject *self,PyObject *args){
PyObject *sampleObj = NULL;
Py_UNICODE *sample = NULL;
   if (!PyArg_ParseTuple(args, O, sampleObj)){
 return NULL;
   }
 // Explicitly convert it to unicode and get Py_UNICODE value
   sampleObj = PyUnicode_FromObject(sampleObj);
   sample = PyUnicode_AS_UNICODE(sampleObj);
   wprintf(Ldatabase value after unicode conversion is : %s\n,
 sample);
 You have to use PyUnicode_AsWideChar() to convert a Python
 Unicode object to a wchar_t representation.

 Please don't make any assumptions on what Py_UNICODE maps
 to and always use the the Unicode API for this. It is designed
 to provide a portable interface and will not do more conversion
 work than necessary.

 Hi Mark,
  Thanks for the help. I tried PyUnicode_AsWideChar() but I am
 getting the same result i.e. only the first letter.
 
 sample code:
 
 #includePython.h
 
 static PyObject *unicode_helper(PyObject *self,PyObject *args){
 PyObject *sampleObj = NULL;
 wchar_t *sample = NULL;
 int size = 0;
 
   if (!PyArg_ParseTuple(args, O, sampleObj)){
 return NULL;
   }
 
  // use wide char function
   size = PyUnicode_AsWideChar(databaseObj, sample,
 PyUnicode_GetSize(databaseObj));

The 3. argument is the buffer size in bytes, not code points.
The result will require sizeof(wchar_t) * PyUnicode_GetSize(databaseObj)
bytes without a trailing NUL, otherwise sizeof(wchar_t) *
(PyUnicode_GetSize(databaseObj) + 1).

You also have to allocate the buffer to store the wchar_t data in.
Passing in a NULL pointer will result in a seg fault. The function
does not allocate a buffer for you:

/* Copies the Unicode Object contents into the wchar_t buffer w.  At
   most size wchar_t characters are copied.

   Note that the resulting wchar_t string may or may not be
   0-terminated.  It is the responsibility of the caller to make sure
   that the wchar_t string is 0-terminated in case this is required by
   the application.

   Returns the number of wchar_t characters copied (excluding a
   possibly trailing 0-termination character) or -1 in case of an
   error. */

PyAPI_FUNC(Py_ssize_t) PyUnicode_AsWideChar(
PyUnicodeObject *unicode,   /* Unicode object */
register wchar_t *w,/* wchar_t buffer */
Py_ssize_t size /* size of buffer */
);



   printf(%d chars are copied to sample\n, size);
   wprintf(Ldatabase value after unicode conversion is : %s\n,
 sample);
   return Py_BuildValue();
 
 }
 
 
 static PyMethodDef funcs[]={{unicodeTest,(PyCFunction)
 unicode_helper,METH_VARARGS,test ucs2, ucs4},{NULL}};
 
 void initunicodeTest(void){
 Py_InitModule3(unicodeTest,funcs,);
 
 }
 
 This prints the following when input value is given as test:
 4 chars are copied to sample
 database value after unicode conversion is : t
 
 Any ideas?
 
 -
 Abhigyan
 --
 http://mail.python.org/mailman/listinfo/python-list

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Mar 23 2009)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2009-03-19: Released mxODBC.Connect 1.0.1  http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
--
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode problem in ucs4

2009-03-23 Thread abhi
On Mar 23, 3:04 pm, M.-A. Lemburg m...@egenix.com wrote:
 On 2009-03-23 08:18, abhi wrote:



  On Mar 20, 5:47 pm, M.-A. Lemburg m...@egenix.com wrote:
  unicodeTest.c
  #includePython.h
  static PyObject *unicode_helper(PyObject *self,PyObject *args){
     PyObject *sampleObj = NULL;
             Py_UNICODE *sample = NULL;
        if (!PyArg_ParseTuple(args, O, sampleObj)){
                  return NULL;
        }
      // Explicitly convert it to unicode and get Py_UNICODE value
        sampleObj = PyUnicode_FromObject(sampleObj);
        sample = PyUnicode_AS_UNICODE(sampleObj);
        wprintf(Ldatabase value after unicode conversion is : %s\n,
  sample);
  You have to use PyUnicode_AsWideChar() to convert a Python
  Unicode object to a wchar_t representation.

  Please don't make any assumptions on what Py_UNICODE maps
  to and always use the the Unicode API for this. It is designed
  to provide a portable interface and will not do more conversion
  work than necessary.

  Hi Mark,
       Thanks for the help. I tried PyUnicode_AsWideChar() but I am
  getting the same result i.e. only the first letter.

  sample code:

  #includePython.h

  static PyObject *unicode_helper(PyObject *self,PyObject *args){
          PyObject *sampleObj = NULL;
          wchar_t *sample = NULL;
          int size = 0;

        if (!PyArg_ParseTuple(args, O, sampleObj)){
                  return NULL;
        }

           // use wide char function
        size = PyUnicode_AsWideChar(databaseObj, sample,
  PyUnicode_GetSize(databaseObj));

 The 3. argument is the buffer size in bytes, not code points.
 The result will require sizeof(wchar_t) * PyUnicode_GetSize(databaseObj)
 bytes without a trailing NUL, otherwise sizeof(wchar_t) *
 (PyUnicode_GetSize(databaseObj) + 1).

 You also have to allocate the buffer to store the wchar_t data in.
 Passing in a NULL pointer will result in a seg fault. The function
 does not allocate a buffer for you:

 /* Copies the Unicode Object contents into the wchar_t buffer w.  At
    most size wchar_t characters are copied.

    Note that the resulting wchar_t string may or may not be
    0-terminated.  It is the responsibility of the caller to make sure
    that the wchar_t string is 0-terminated in case this is required by
    the application.

    Returns the number of wchar_t characters copied (excluding a
    possibly trailing 0-termination character) or -1 in case of an
    error. */

 PyAPI_FUNC(Py_ssize_t) PyUnicode_AsWideChar(
     PyUnicodeObject *unicode,   /* Unicode object */
     register wchar_t *w,        /* wchar_t buffer */
     Py_ssize_t size             /* size of buffer */
     );



        printf(%d chars are copied to sample\n, size);
        wprintf(Ldatabase value after unicode conversion is : %s\n,
  sample);
        return Py_BuildValue();

  }

  static PyMethodDef funcs[]={{unicodeTest,(PyCFunction)
  unicode_helper,METH_VARARGS,test ucs2, ucs4},{NULL}};

  void initunicodeTest(void){
          Py_InitModule3(unicodeTest,funcs,);

  }

  This prints the following when input value is given as test:
  4 chars are copied to sample
  database value after unicode conversion is : t

  Any ideas?

  -
  Abhigyan
  --
 http://mail.python.org/mailman/listinfo/python-list

 --
 Marc-Andre Lemburg
 eGenix.com

 Professional Python Services directly from the Source  (#1, Mar 23 2009) 
 Python/Zope Consulting and Support ...        http://www.egenix.com/
  mxODBC.Zope.Database.Adapter ...            http://zope.egenix.com/
  mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

 
 2009-03-19: Released mxODBC.Connect 1.0.1      http://python.egenix.com/

 ::: Try our new mxODBC.Connect Python Database Interface for free ! 

    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
            Registered at Amtsgericht Duesseldorf: HRB 46611
                http://www.egenix.com/company/contact/

Thanks Marc, John,
 With your help, I am at least somewhere. I re-wrote the code
to compare Py_Unicode and wchar_t outputs and they both look exactly
the same.

#includePython.h

static PyObject *unicode_helper(PyObject *self,PyObject *args){
const char *name;
PyObject *sampleObj = NULL;
Py_UNICODE *sample = NULL;
wchar_t * w=NULL;
int size = 0;
int i;

  if (!PyArg_ParseTuple(args, O, sampleObj)){
return NULL;
  }


// Explicitly convert it to unicode and get Py_UNICODE value
sampleObj = PyUnicode_FromObject(sampleObj);
sample = PyUnicode_AS_UNICODE(sampleObj);
printf(size of sampleObj is : %d\n,PyUnicode_GET_SIZE
(sampleObj));
w = (wchar_t *) malloc((PyUnicode_GET_SIZE(sampleObj)+1)*sizeof
(wchar_t));
size = PyUnicode_AsWideChar(sampleObj,w,(PyUnicode_GET_SIZE(sampleObj)

Re: Unicode problem in ucs4

2009-03-23 Thread M.-A. Lemburg
On 2009-03-23 11:50, abhi wrote:
 On Mar 23, 3:04 pm, M.-A. Lemburg m...@egenix.com wrote:
 Thanks Marc, John,
  With your help, I am at least somewhere. I re-wrote the code
 to compare Py_Unicode and wchar_t outputs and they both look exactly
 the same.
 
 #includePython.h
 
 static PyObject *unicode_helper(PyObject *self,PyObject *args){
   const char *name;
   PyObject *sampleObj = NULL;
   Py_UNICODE *sample = NULL;
   wchar_t * w=NULL;
   int size = 0;
   int i;
 
   if (!PyArg_ParseTuple(args, O, sampleObj)){
 return NULL;
   }
 
 
 // Explicitly convert it to unicode and get Py_UNICODE value
 sampleObj = PyUnicode_FromObject(sampleObj);
 sample = PyUnicode_AS_UNICODE(sampleObj);
 printf(size of sampleObj is : %d\n,PyUnicode_GET_SIZE
 (sampleObj));
 w = (wchar_t *) malloc((PyUnicode_GET_SIZE(sampleObj)+1)*sizeof
 (wchar_t));
   size = PyUnicode_AsWideChar(sampleObj,w,(PyUnicode_GET_SIZE(sampleObj)
 +1)*sizeof(wchar_t));
   printf(%d chars are copied to w\n,size);
   printf(size of wchar_t is : %d\n, sizeof(wchar_t));
   printf(size of Py_UNICODE is: %d\n,sizeof(Py_UNICODE));
   for(i=0;iPyUnicode_GET_SIZE(sampleObj);i++){
   printf(sample is : %c\n,sample[i]);
   printf(w is : %c\n,w[i]);
   }
   return sampleObj;
 }
 
 static PyMethodDef funcs[]={{unicodeTest,(PyCFunction)
 unicode_helper,METH_VARARGS,test ucs2, ucs4},{NULL}};
 
 void initunicodeTest(void){
   Py_InitModule3(unicodeTest,funcs,);
 }
 
 This gives the following output when I pass abc as input:
 
 size of sampleObj is : 3
 3 chars are copied to w
 size of wchar_t is : 4
 size of Py_UNICODE is: 4
 sample is : a
 w is : a
 sample is : b
 w is : b
 sample is : c
 w is : c
 
 So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
 \0s after a char, printf or wprintf is only printing one letter.
 I need to further process the data and those libraries will need the
 data in UCS2 format (2 bytes), otherwise they fail. Is there any way
 by which I can force wchar_t to be 2 bytes, or can I convert this UCS4
 data to UCS2 explicitly?

Sure: just use the appropriate UTF-16 codec for this.

/* Generic codec based encoding API.

   object is passed through the encoder function found for the given
   encoding using the error handling method defined by errors. errors
   may be NULL to use the default method defined for the codec.

   Raises a LookupError in case no encoder can be found.

 */

PyAPI_FUNC(PyObject *) PyCodec_Encode(
   PyObject *object,
   const char *encoding,
   const char *errors
   );

encoding needs to be set to 'utf-16-le' for little endian, 'utf-16-be'
for big endian.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Mar 23 2009)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2009-03-19: Released mxODBC.Connect 1.0.1  http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
--
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode problem in ucs4

2009-03-23 Thread abhi
On Mar 23, 4:37 pm, M.-A. Lemburg m...@egenix.com wrote:
 On 2009-03-23 11:50, abhi wrote:



  On Mar 23, 3:04 pm, M.-A. Lemburg m...@egenix.com wrote:
  Thanks Marc, John,
           With your help, I am at least somewhere. I re-wrote the code
  to compare Py_Unicode and wchar_t outputs and they both look exactly
  the same.

  #includePython.h

  static PyObject *unicode_helper(PyObject *self,PyObject *args){
     const char *name;
     PyObject *sampleObj = NULL;
             Py_UNICODE *sample = NULL;
     wchar_t * w=NULL;
     int size = 0;
     int i;

        if (!PyArg_ParseTuple(args, O, sampleObj)){
                  return NULL;
        }

          // Explicitly convert it to unicode and get Py_UNICODE value
          sampleObj = PyUnicode_FromObject(sampleObj);
          sample = PyUnicode_AS_UNICODE(sampleObj);
          printf(size of sampleObj is : %d\n,PyUnicode_GET_SIZE
  (sampleObj));
          w = (wchar_t *) malloc((PyUnicode_GET_SIZE(sampleObj)+1)*sizeof
  (wchar_t));
     size = PyUnicode_AsWideChar(sampleObj,w,(PyUnicode_GET_SIZE(sampleObj)
  +1)*sizeof(wchar_t));
     printf(%d chars are copied to w\n,size);
     printf(size of wchar_t is : %d\n, sizeof(wchar_t));
     printf(size of Py_UNICODE is: %d\n,sizeof(Py_UNICODE));
     for(i=0;iPyUnicode_GET_SIZE(sampleObj);i++){
             printf(sample is : %c\n,sample[i]);
             printf(w is : %c\n,w[i]);
     }
     return sampleObj;
  }

  static PyMethodDef funcs[]={{unicodeTest,(PyCFunction)
  unicode_helper,METH_VARARGS,test ucs2, ucs4},{NULL}};

  void initunicodeTest(void){
     Py_InitModule3(unicodeTest,funcs,);
  }

  This gives the following output when I pass abc as input:

  size of sampleObj is : 3
  3 chars are copied to w
  size of wchar_t is : 4
  size of Py_UNICODE is: 4
  sample is : a
  w is : a
  sample is : b
  w is : b
  sample is : c
  w is : c

  So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
  \0s after a char, printf or wprintf is only printing one letter.
  I need to further process the data and those libraries will need the
  data in UCS2 format (2 bytes), otherwise they fail. Is there any way
  by which I can force wchar_t to be 2 bytes, or can I convert this UCS4
  data to UCS2 explicitly?

 Sure: just use the appropriate UTF-16 codec for this.

 /* Generic codec based encoding API.

    object is passed through the encoder function found for the given
    encoding using the error handling method defined by errors. errors
    may be NULL to use the default method defined for the codec.

    Raises a LookupError in case no encoder can be found.

  */

 PyAPI_FUNC(PyObject *) PyCodec_Encode(
        PyObject *object,
        const char *encoding,
        const char *errors
        );

 encoding needs to be set to 'utf-16-le' for little endian, 'utf-16-be'
 for big endian.

 --
 Marc-Andre Lemburg
 eGenix.com

 Professional Python Services directly from the Source  (#1, Mar 23 2009) 
 Python/Zope Consulting and Support ...        http://www.egenix.com/
  mxODBC.Zope.Database.Adapter ...            http://zope.egenix.com/
  mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

 
 2009-03-19: Released mxODBC.Connect 1.0.1      http://python.egenix.com/

 ::: Try our new mxODBC.Connect Python Database Interface for free ! 

    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
            Registered at Amtsgericht Duesseldorf: HRB 46611
                http://www.egenix.com/company/contact/

Thanks, but this is returning PyObject *, whereas I need value in some
variable which can be printed using wprintf() like wchar_t (having a
size of 2 bytes). If I again convert this PyObject to wchar_t or
PyUnicode, I go back to where I started. :)

-
Abhigyan
--
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode problem in ucs4

2009-03-23 Thread abhi
On Mar 23, 4:57 pm, abhi abhigyan_agra...@in.ibm.com wrote:
 On Mar 23, 4:37 pm, M.-A. Lemburg m...@egenix.com wrote:



  On 2009-03-23 11:50, abhi wrote:

   On Mar 23, 3:04 pm, M.-A. Lemburg m...@egenix.com wrote:
   Thanks Marc, John,
            With your help, I am at least somewhere. I re-wrote the code
   to compare Py_Unicode and wchar_t outputs and they both look exactly
   the same.

   #includePython.h

   static PyObject *unicode_helper(PyObject *self,PyObject *args){
      const char *name;
      PyObject *sampleObj = NULL;
              Py_UNICODE *sample = NULL;
      wchar_t * w=NULL;
      int size = 0;
      int i;

         if (!PyArg_ParseTuple(args, O, sampleObj)){
                   return NULL;
         }

           // Explicitly convert it to unicode and get Py_UNICODE value
           sampleObj = PyUnicode_FromObject(sampleObj);
           sample = PyUnicode_AS_UNICODE(sampleObj);
           printf(size of sampleObj is : %d\n,PyUnicode_GET_SIZE
   (sampleObj));
           w = (wchar_t *) malloc((PyUnicode_GET_SIZE(sampleObj)+1)*sizeof
   (wchar_t));
      size = PyUnicode_AsWideChar(sampleObj,w,(PyUnicode_GET_SIZE(sampleObj)
   +1)*sizeof(wchar_t));
      printf(%d chars are copied to w\n,size);
      printf(size of wchar_t is : %d\n, sizeof(wchar_t));
      printf(size of Py_UNICODE is: %d\n,sizeof(Py_UNICODE));
      for(i=0;iPyUnicode_GET_SIZE(sampleObj);i++){
              printf(sample is : %c\n,sample[i]);
              printf(w is : %c\n,w[i]);
      }
      return sampleObj;
   }

   static PyMethodDef funcs[]={{unicodeTest,(PyCFunction)
   unicode_helper,METH_VARARGS,test ucs2, ucs4},{NULL}};

   void initunicodeTest(void){
      Py_InitModule3(unicodeTest,funcs,);
   }

   This gives the following output when I pass abc as input:

   size of sampleObj is : 3
   3 chars are copied to w
   size of wchar_t is : 4
   size of Py_UNICODE is: 4
   sample is : a
   w is : a
   sample is : b
   w is : b
   sample is : c
   w is : c

   So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
   \0s after a char, printf or wprintf is only printing one letter.
   I need to further process the data and those libraries will need the
   data in UCS2 format (2 bytes), otherwise they fail. Is there any way
   by which I can force wchar_t to be 2 bytes, or can I convert this UCS4
   data to UCS2 explicitly?

  Sure: just use the appropriate UTF-16 codec for this.

  /* Generic codec based encoding API.

     object is passed through the encoder function found for the given
     encoding using the error handling method defined by errors. errors
     may be NULL to use the default method defined for the codec.

     Raises a LookupError in case no encoder can be found.

   */

  PyAPI_FUNC(PyObject *) PyCodec_Encode(
         PyObject *object,
         const char *encoding,
         const char *errors
         );

  encoding needs to be set to 'utf-16-le' for little endian, 'utf-16-be'
  for big endian.

  --
  Marc-Andre Lemburg
  eGenix.com

  Professional Python Services directly from the Source  (#1, Mar 23 2009) 
  Python/Zope Consulting and Support ...        http://www.egenix.com/
   mxODBC.Zope.Database.Adapter ...            http://zope.egenix.com/
   mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

  
  2009-03-19: Released mxODBC.Connect 1.0.1      http://python.egenix.com/

  ::: Try our new mxODBC.Connect Python Database Interface for free ! 

     eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
      D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
             Registered at Amtsgericht Duesseldorf: HRB 46611
                 http://www.egenix.com/company/contact/

 Thanks, but this is returning PyObject *, whereas I need value in some
 variable which can be printed using wprintf() like wchar_t (having a
 size of 2 bytes). If I again convert this PyObject to wchar_t or
 PyUnicode, I go back to where I started. :)

 -
 Abhigyan

Hi Marc,
   Is there any way to ensure that wchar_t size would always be 2
instead of 4 in ucs4 configured python? Googling gave me the
impression that there is some logic written in PyUnicode_AsWideChar()
which can take care of ucs4 to ucs2 conversion if sizes of Py_UNICODE
and wchar_t are different.

-
Abhigyan
--
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode problem in ucs4

2009-03-23 Thread M.-A. Lemburg
On 2009-03-23 14:05, abhi wrote:
 Hi Marc,
Is there any way to ensure that wchar_t size would always be 2
 instead of 4 in ucs4 configured python? Googling gave me the
 impression that there is some logic written in PyUnicode_AsWideChar()
 which can take care of ucs4 to ucs2 conversion if sizes of Py_UNICODE
 and wchar_t are different.

wchar_t is defined by your compiler. There's no way to change that.

However, you can configure Python to use UCS2 (default) or UCS4 (used
on most Unix platforms), so it's easy to customize for your needs.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Mar 23 2009)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2009-03-19: Released mxODBC.Connect 1.0.1  http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
--
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode problem in ucs4

2009-03-23 Thread M.-A. Lemburg
On 2009-03-23 12:57, abhi wrote:
 Is there any way
 by which I can force wchar_t to be 2 bytes, or can I convert this UCS4
 data to UCS2 explicitly?
 Sure: just use the appropriate UTF-16 codec for this.

 /* Generic codec based encoding API.

object is passed through the encoder function found for the given
encoding using the error handling method defined by errors. errors
may be NULL to use the default method defined for the codec.

Raises a LookupError in case no encoder can be found.

  */

 PyAPI_FUNC(PyObject *) PyCodec_Encode(
PyObject *object,
const char *encoding,
const char *errors
);

 encoding needs to be set to 'utf-16-le' for little endian, 'utf-16-be'
 for big endian.
 
 Thanks, but this is returning PyObject *, whereas I need value in some
 variable which can be printed using wprintf() like wchar_t (having a
 size of 2 bytes). If I again convert this PyObject to wchar_t or
 PyUnicode, I go back to where I started. :)

It will return a PyString object with the UTF-16 data. You can
use PyString_AS_STRING() to access the data stored by it.

Note that writing your own UCS2/UCS4 converter isn't all that hard
either. Just have a look at the code in unicodeobject.c for
PyUnicode_AsWideChar().

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Mar 23 2009)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2009-03-19: Released mxODBC.Connect 1.0.1  http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
--
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode problem in ucs4

2009-03-23 Thread Martin v. Löwis
 So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
 \0s after a char, printf or wprintf is only printing one letter.

No. printf indeed will see a terminating character. However, wprintf
should correctly know that a wchar_t has four bytes per character,
and print it correctly. Make sure to use %ls to print wchar_t arrays;
%s would print multi-byte character strings.

 I need to further process the data and those libraries will need the
 data in UCS2 format (2 bytes), otherwise they fail.

Are you absolutely sure about that? Why does that library expect
UCS-2, when you system's wchar_t is four bytes?

In any case, do what MAL told you: use the UCS-2 codec to convert
the Unicode string to a 2-bytes-per-char byte string. The PyObject
you get from the conversion is a byte string object; use
PyString_AsStringAndSize to get to the actual bytes.

Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode problem in ucs4

2009-03-20 Thread Martin v. Löwis
 Any idea on why this is happening? 

Can you provide a complete example? Your code looks correct, and should
just work.

How do you know the result contains only 't' (i.e. how do you know it
does not contain 'e', 's', 't')?

Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode problem in ucs4

2009-03-20 Thread abhi
On Mar 20, 11:03 am, Martin v. Löwis mar...@v.loewis.de wrote:
  Any idea on why this is happening?

 Can you provide a complete example? Your code looks correct, and should
 just work.

 How do you know the result contains only 't' (i.e. how do you know it
 does not contain 'e', 's', 't')?

 Regards,
 Martin

Hi Martin,
 Here is the code:
unicodeTest.c

#includePython.h

static PyObject *unicode_helper(PyObject *self,PyObject *args){
PyObject *sampleObj = NULL;
Py_UNICODE *sample = NULL;

  if (!PyArg_ParseTuple(args, O, sampleObj)){
return NULL;
  }

 // Explicitly convert it to unicode and get Py_UNICODE value
  sampleObj = PyUnicode_FromObject(sampleObj);
  sample = PyUnicode_AS_UNICODE(sampleObj);
  wprintf(Ldatabase value after unicode conversion is : %s\n,
sample);
  return Py_BuildValue();
}

static PyMethodDef funcs[]={{unicodeTest,(PyCFunction)
unicode_helper,METH_VARARGS,test ucs2, ucs4},{NULL}};

void initunicodeTest(void){
Py_InitModule3(unicodeTest,funcs,);
}

When i install this unicodeTest on python ucs2 wprintf prints whatever
is passed eg

import unicodeTest
unicodeTest.unicodeTest(hello world)
database value after unicode conversion is : hello world

but it prints the following on ucs4 configured python:
database value after unicode conversion is : h

Regards,
Abhigyan
--
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode problem in ucs4

2009-03-20 Thread M.-A. Lemburg
On 2009-03-20 12:13, abhi wrote:
 On Mar 20, 11:03 am, Martin v. Löwis mar...@v.loewis.de wrote:
 Any idea on why this is happening?
 Can you provide a complete example? Your code looks correct, and should
 just work.

 How do you know the result contains only 't' (i.e. how do you know it
 does not contain 'e', 's', 't')?

 Regards,
 Martin
 
 Hi Martin,
  Here is the code:
 unicodeTest.c
 
 #includePython.h
 
 static PyObject *unicode_helper(PyObject *self,PyObject *args){
   PyObject *sampleObj = NULL;
   Py_UNICODE *sample = NULL;
 
   if (!PyArg_ParseTuple(args, O, sampleObj)){
 return NULL;
   }
 
// Explicitly convert it to unicode and get Py_UNICODE value
   sampleObj = PyUnicode_FromObject(sampleObj);
   sample = PyUnicode_AS_UNICODE(sampleObj);
   wprintf(Ldatabase value after unicode conversion is : %s\n,
 sample);

You have to use PyUnicode_AsWideChar() to convert a Python
Unicode object to a wchar_t representation.

Please don't make any assumptions on what Py_UNICODE maps
to and always use the the Unicode API for this. It is designed
to provide a portable interface and will not do more conversion
work than necessary.

   return Py_BuildValue();
 }
 
 static PyMethodDef funcs[]={{unicodeTest,(PyCFunction)
 unicode_helper,METH_VARARGS,test ucs2, ucs4},{NULL}};
 
 void initunicodeTest(void){
   Py_InitModule3(unicodeTest,funcs,);
 }
 
 When i install this unicodeTest on python ucs2 wprintf prints whatever
 is passed eg
 
 import unicodeTest
 unicodeTest.unicodeTest(hello world)
 database value after unicode conversion is : hello world
 
 but it prints the following on ucs4 configured python:
 database value after unicode conversion is : h
 
 Regards,
 Abhigyan
 --
 http://mail.python.org/mailman/listinfo/python-list

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Mar 20 2009)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
--
http://mail.python.org/mailman/listinfo/python-list


Unicode problem in ucs4

2009-03-19 Thread abhi
Hi,
I have a C extension, which takes a unicode or string value from
python and convert it to unicode before doing more operations on it.
The skeleton looks like:

static PyObject *unicode_helper( PyObject *self, PyObject *args){
  PyObject *sampleObj = NULL;
  Py_UNICODE *sample = NULL;

  if (!PyArg_ParseTuple(args, O, sampleObj)){
return NULL;
  }
  // Explicitly convert it to unicode and get Py_UNICODE value
  sampleObj = PyUnicode_FromObject(sampleObj);
  sample = PyUnicode_AS_UNICODE(sampleObj);
   
  // perform other operations.
   .
}

This piece of code is working fine on python with ucs2 configuration
but fails with python ucs4 config. By failing, I mean that only the
first letter comes in variable sample i.e. if I pass test from
python then sample will contain only t. However, PyUnicode_GetSize
(sampleObj) function is returning correct value (4 in this case).

Any idea on why this is happening? Any help will be appreciated.

Regards,
Abhigyan
--
http://mail.python.org/mailman/listinfo/python-list


Unicode Problem

2008-10-30 Thread Seid Mohammed
I am new to python.
I want to print Amharic character using the Python IDLE.
here goes somple code
==
 abebe = 'አበበ በሶ በላ'
 abebe
'\xe1\x8a\xa0\xe1\x89\xa0\xe1\x89\xa0 \xe1\x89\xa0\xe1\x88\xb6
\xe1\x89\xa0\xe1\x88\x8b'
 print abebe
አበበ በሶ በላ
 abeba = ['አበበ','በሶ','በላ']
 abeba
['\xe1\x8a\xa0\xe1\x89\xa0\xe1\x89\xa0', '\xe1\x89\xa0\xe1\x88\xb6',
'\xe1\x89\xa0\xe1\x88\x8b']
 print abeba
['\xe1\x8a\xa0\xe1\x89\xa0\xe1\x89\xa0', '\xe1\x89\xa0\xe1\x88\xb6',
'\xe1\x89\xa0\xe1\x88\x8b']
 len(abebe)
23

so my question is
1)why  abebe prints  '\xe1\x8a\xa0\xe1\x89\xa0\xe1\x89\xa0
\xe1\x89\xa0\xe1\x88\xb6 \xe1\x89\xa0\xe1\x88\x8b' instead of አበበ በሶ
በላ
2) why  print abeba don't print the expected አበበ በሶ በላ string
thanks a lot.
Seid M.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode Problem

2008-10-30 Thread Marc 'BlackJack' Rintsch
On Thu, 30 Oct 2008 10:28:39 +0300, Seid Mohammed wrote:

 I am new to python.
 I want to print Amharic character using the Python IDLE. here goes
 somple code
 ==
 abebe = 'አበበ በሶ በላ'
 abebe
 '\xe1\x8a\xa0\xe1\x89\xa0\xe1\x89\xa0 \xe1\x89\xa0\xe1\x88\xb6
 \xe1\x89\xa0\xe1\x88\x8b'
 print abebe
 አበበ በሶ በላ
 abeba = ['አበበ','በሶ','በላ'] abeba
 ['\xe1\x8a\xa0\xe1\x89\xa0\xe1\x89\xa0', '\xe1\x89\xa0\xe1\x88\xb6',
 '\xe1\x89\xa0\xe1\x88\x8b']
 print abeba
 ['\xe1\x8a\xa0\xe1\x89\xa0\xe1\x89\xa0', '\xe1\x89\xa0\xe1\x88\xb6',
 '\xe1\x89\xa0\xe1\x88\x8b']
 len(abebe)
 23
 
 so my question is
 1)why  abebe prints  '\xe1\x8a\xa0\xe1\x89\xa0\xe1\x89\xa0
 \xe1\x89\xa0\xe1\x88\xb6 \xe1\x89\xa0\xe1\x88\x8b' instead of አበበ በሶ በላ
 2) why  print abeba don't print the expected አበበ በሶ በላ string thanks
 a lot.

Because lists represent their content in the `repr()` form.  So you, the 
programmer, can see what's really in there.

Ciao,
Marc 'BlackJack' Rintsch
--
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode Problem

2008-10-30 Thread Ulrich Eckhardt
Seid Mohammed wrote:
 I am new to python.

Welcome! :)

 abebe = 'አበበ በሶ በላ'
 abebe
 '\xe1\x8a\xa0\xe1\x89\xa0\xe1\x89\xa0 \xe1\x89\xa0\xe1\x88\xb6
 \xe1\x89\xa0\xe1\x88\x8b'
 print abebe
 አበበ በሶ በላ
 abeba = ['አበበ','በሶ','በላ']
 abeba
 ['\xe1\x8a\xa0\xe1\x89\xa0\xe1\x89\xa0', '\xe1\x89\xa0\xe1\x88\xb6',
 '\xe1\x89\xa0\xe1\x88\x8b']
 print abeba
 ['\xe1\x8a\xa0\xe1\x89\xa0\xe1\x89\xa0', '\xe1\x89\xa0\xe1\x88\xb6',
 '\xe1\x89\xa0\xe1\x88\x8b']
 len(abebe)
 23
 
 so my question is
 1)why  abebe prints  '\xe1\x8a\xa0\xe1\x89\xa0\xe1\x89\xa0
 \xe1\x89\xa0\xe1\x88\xb6 \xe1\x89\xa0\xe1\x88\x8b' instead of አበበ በሶ
 በላ
 2) why  print abeba don't print the expected አበበ በሶ በላ string

When you just type an identifier X on the commandline, Python outputs the
result of calling repr(X). This typically gives you something that you
could enter in any Python program. Note that e.g. the string 'አበበ በሶ በላ' is
not suitable in any Python program, it requires an encoding where those
characters are supported like e.g. UTF-8.

Now, if you type print X on the commandline, it will output the thing as a
string instead, giving you the original contents. If, like for a list, no
string representation exists, it will fall back to using repr() instead.


Disclaimer: I'm not a pro yet myself, but I think this covers the background
a bit. Maybe someone will correct me if I'm horribly wrong.

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

--
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode Problem

2008-10-30 Thread Bard Aase
On Thu, Oct 30, 2008 at 8:28 AM, Seid Mohammed [EMAIL PROTECTED] wrote:
 I am new to python.
 I want to print Amharic character using the Python IDLE.
 here goes somple code
 ==
 abebe = 'አበበ በሶ በላ'
 abebe
 '\xe1\x8a\xa0\xe1\x89\xa0\xe1\x89\xa0 \xe1\x89\xa0\xe1\x88\xb6
 \xe1\x89\xa0\xe1\x88\x8b'
 print abebe
 አበበ በሶ በላ
 abeba = ['አበበ','በሶ','በላ']
 abeba
 ['\xe1\x8a\xa0\xe1\x89\xa0\xe1\x89\xa0', '\xe1\x89\xa0\xe1\x88\xb6',
 '\xe1\x89\xa0\xe1\x88\x8b']
 print abeba
 ['\xe1\x8a\xa0\xe1\x89\xa0\xe1\x89\xa0', '\xe1\x89\xa0\xe1\x88\xb6',
 '\xe1\x89\xa0\xe1\x88\x8b']
 len(abebe)
 23
 
 so my question is
 1)why  abebe prints  '\xe1\x8a\xa0\xe1\x89\xa0\xe1\x89\xa0
 \xe1\x89\xa0\xe1\x88\xb6 \xe1\x89\xa0\xe1\x88\x8b' instead of አበበ በሶ
 በላ
 2) why  print abeba don't print the expected አበበ በሶ በላ string


when you print strings from the interpreter using  abeba it will
escape any non-ascii characters.
if you, instead print it using  print abeba it will print the
proper characters, as long as your terminal supports it.


-- 
mvh base (Bård Aase)
MSN: [EMAIL PROTECTED]
http://blog.elzapp.com
:wq
--
http://mail.python.org/mailman/listinfo/python-list


Re: Logging library unicode problem

2008-08-20 Thread Vinay Sajip
On 13 Aug, 11:08, Victor Lin [EMAIL PROTECTED] wrote:
 Hi,
 I'm writting a application using python standardloggingsystem. I
 encounter some problem with unicode message passed tologginglibrary.
 I found that unicode message will be messed up bylogginghandler.

 piese of StreamHandler:

 try:
 self.stream.write(fs % msg)
 except UnicodeError:
 self.stream.write(fs % msg.encode(UTF-8))

 It just write the message to stream. If there is some unicode error,
 it would rewrite msg with utf8 encoding.

 I write some code to try:

 import sys
 print u'中文字測試'
 print sys.stdout.encoding
 sys.stdout.write(u'中文')

 result of that program:

 中文字測試
 cp950
 Traceback (most recent call last):
   File update_stockprice.py, line 92, in module
 sys.stdout.write(u'銝剜?')
 UnicodeEncodeError: 'ascii' codec can't encode characters in position
 0-1: ordin
 al not in range(128)

 It show that

 1. print statement encode what it get with stream.encoding?
 2. stream.write don't do anything like encoding, just write it
 (because it might be binary data?)

 So the problem is : the StreamHandler of standardlogginglibrary use
 stream.write to log message, if there is unicode error, unicode string
 will be encode to utf8. This behavior mess my unicode up.

 Here I modify the code of StreamHandler:

 try:
 print  self.stream, msg
 #self.stream.write(fs % msg)
 except UnicodeError:
 self.stream.write(fs % msg.encode(UTF-8))

 I replace stream.write with print statement, so that it will try to
 use stream.encoding to encode msg. Now everything works fine.

 My question is :
 Could the behavior of StreamHandler be considered as a bug?
 If it is, how to report this bug?
 Is my solution correct?
 Are there any side effect will caused by doing so?
 If the code I write is fine, and solve that problem, how to report it
 to Python's project?
 I think this could be helpful for people who also encountered this
 problem.

 Thanks.
 Victor Lin.

Hi Victor,

Can you try modifying your patch to use the following logic instead of
the print statement?

if hasattr(self.stream, 'encoding'):
self.stream.write(fs % msg.encode(self.stream.encoding))
else:
self.stream.write(fs % msg)

Does this work in your scenario?

Regards,


Vinay Sajip
--
http://mail.python.org/mailman/listinfo/python-list

Logging library unicode problem

2008-08-13 Thread Victor Lin
Hi,
I'm writting a application using python standard logging system. I
encounter some problem with unicode message passed to logging library.
I found that unicode message will be messed up by logging handler.

piese of StreamHandler:

try:
self.stream.write(fs % msg)
except UnicodeError:
self.stream.write(fs % msg.encode(UTF-8))

It just write the message to stream. If there is some unicode error,
it would rewrite msg with utf8 encoding.

I write some code to try:

import sys
print u'中文字測試'
print sys.stdout.encoding
sys.stdout.write(u'中文')

result of that program:

中文字測試
cp950
Traceback (most recent call last):
  File update_stockprice.py, line 92, in module
sys.stdout.write(u'銝剜?')
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-1: ordin
al not in range(128)

It show that

1. print statement encode what it get with stream.encoding?
2. stream.write don't do anything like encoding, just write it
(because it might be binary data?)

So the problem is : the StreamHandler of standard logging library use
stream.write to log message, if there is unicode error, unicode string
will be encode to utf8. This behavior mess my unicode up.

Here I modify the code of StreamHandler:

try:
print  self.stream, msg
#self.stream.write(fs % msg)
except UnicodeError:
self.stream.write(fs % msg.encode(UTF-8))

I replace stream.write with print statement, so that it will try to
use stream.encoding to encode msg. Now everything works fine.

My question is :
Could the behavior of StreamHandler be considered as a bug?
If it is, how to report this bug?
Is my solution correct?
Are there any side effect will caused by doing so?
If the code I write is fine, and solve that problem, how to report it
to Python's project?
I think this could be helpful for people who also encountered this
problem.

Thanks.
Victor Lin.
--
http://mail.python.org/mailman/listinfo/python-list

Re: Logging library unicode problem

2008-08-13 Thread Patrol Sun
What's your system? Simple Chinese Windows???

2008/8/13 Victor Lin [EMAIL PROTECTED]

 Hi,
 I'm writting a application using python standard logging system. I
 encounter some problem with unicode message passed to logging library.
 I found that unicode message will be messed up by logging handler.

 piese of StreamHandler:

try:
self.stream.write(fs % msg)
except UnicodeError:
self.stream.write(fs % msg.encode(UTF-8))

 It just write the message to stream. If there is some unicode error,
 it would rewrite msg with utf8 encoding.

 I write some code to try:

import sys
print u'中文字�y��'
print sys.stdout.encoding
sys.stdout.write(u'中文')

 result of that program:

 中文字�y��
 cp950
 Traceback (most recent call last):
  File update_stockprice.py, line 92, in module
sys.stdout.write(u'��剜?')
 UnicodeEncodeError: 'ascii' codec can't encode characters in position
 0-1: ordin
 al not in range(128)

 It show that

 1. print statement encode what it get with stream.encoding?
 2. stream.write don't do anything like encoding, just write it
 (because it might be binary data?)

 So the problem is : the StreamHandler of standard logging library use
 stream.write to log message, if there is unicode error, unicode string
 will be encode to utf8. This behavior mess my unicode up.

 Here I modify the code of StreamHandler:

try:
print  self.stream, msg
#self.stream.write(fs % msg)
except UnicodeError:
self.stream.write(fs % msg.encode(UTF-8))

 I replace stream.write with print statement, so that it will try to
 use stream.encoding to encode msg. Now everything works fine.

 My question is :
 Could the behavior of StreamHandler be considered as a bug?
 If it is, how to report this bug?
 Is my solution correct?
 Are there any side effect will caused by doing so?
 If the code I write is fine, and solve that problem, how to report it
 to Python's project?
 I think this could be helpful for people who also encountered this
 problem.

 Thanks.
 Victor Lin.
 --
 http://mail.python.org/mailman/listinfo/python-list
--
http://mail.python.org/mailman/listinfo/python-list

Unicode Problem

2008-01-28 Thread Victor Subervi
Hi;
New to unicode. Got this error:

Traceback (most recent call last):
  File stdin, line 1, in module
  File stdin, line 29, in tagWords
  File /usr/local/lib/python2.5/codecs.py, line 303, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 9:
ordinal not in range(128)
I think the problem comes from this code snippet:


for line in sentences:

print line

tup = re.split(' ', line)

for word in tup:

for key, value in dictionary.items():

if key == word:

word = word + '::' + value

newLine.append(word)

sentences.close()

TIA,

Victor
-- 
http://mail.python.org/mailman/listinfo/python-list

[issue1040] Unicode problem with TZ

2007-08-30 Thread Martin v. Löwis

Martin v. Löwis added the comment:

This is now fixed in r57720.

Using wide APIs would be possible through GetTimeZoneInformation,
however, then TZ won't be supported anymore (unless the CRT code to
parse TZ is duplicated).

--
nosy: +loewis
resolution:  - fixed
status: open - closed

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1040
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1040] Unicode problem with TZ

2007-08-29 Thread Martin v. Löwis

Changes by Martin v. Löwis:


--
assignee:  - loewis

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1040
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1040] Unicode problem with TZ

2007-08-29 Thread Amaury Forgeot d'Arc

Amaury Forgeot d'Arc added the comment:

I have a patch for this, which uses MBCS conversion instead of relying
on the default utf-8 (here and several other places). Tested on a French
version of winXP.

Which leads me to the question: should Windows use MBCS encoding by
default when converting between char* and PyUnicode, and not utf-8?
There are some other tracker items which would benefit from this.

After all, C strings can only come from 1) python code, 2) system i/o
and messages, and 3) constants in source code.
IMO, 1) can use the representation it prefers, 2) would clearly lead to
less error if handled as MBCS and 3) only uses 7bit ascii.
There is very little need for utf-8 here.

--
nosy: +amaury.forgeotdarc

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1040
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1040] Unicode problem with TZ

2007-08-29 Thread Thomas Heller

Thomas Heller added the comment:

IMO the very best would be to avoid as many conversions as possible by
using the wide apis on Windows.  Not for _tzname maybe, but for env
vars, sys.argv, sys.path, and so on.  Not that I would have time to work
on that...

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1040
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1040] Unicode problem with TZ

2007-08-28 Thread Thomas Heller

Thomas Heller added the comment:

BTW, setting the environment variable TZ to, say, 'GMT' makes the
problem go away.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1040
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1040] Unicode problem with TZ

2007-08-28 Thread Thomas Heller

New submission from Thomas Heller:

In my german version of winXP SP2, python3 cannot import the time module:

c:\svn\py3k\PCbuildpython_d
Python 3.0x (py3k:57600M, Aug 28 2007, 07:58:23) [MSC v.1310 32 bit
(Intel)] on win32
Type help, copyright, credits or license for more information.
 import time
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-11:
invalid data
[36719 refs]
 ^Z

The problem is that the libc '_tzname' variable contains umlauts.  For
comparison, here is what Python2.5 does:

c:\svn\py3k\PCbuild\python25\python
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on win32
Type help, copyright, credits or license for more information.
 import time
 time.tzname
('Westeurop\xe4ische Normalzeit', 'Westeurop\xe4ische Normalzeit')


--
components: Windows
messages: 55351
nosy: theller
severity: normal
status: open
title: Unicode problem with TZ
versions: Python 3.0

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1040
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: Parsing XML with ElementTree (unicode problem?)

2007-07-26 Thread Stefan Behnel
[EMAIL PROTECTED] wrote:
 On Jul 26, 3:13 pm, John Machin [EMAIL PROTECTED] wrote:
 On Jul 26, 9:24 pm, [EMAIL PROTECTED] wrote:

 OK, I solved the problem but I still don't get what went wrong.
 Solution - use tree builder in order to create the new xml file
 (previously I was  manually creating it).
 I'm still curious so I'm adding a link to a short and very simple
 script that gets an xml (containing non ascii chars) from the web and
 saves some of the elements to 2 different local xml files - one is
 created by XMLWriter and the other is created manually. you could see
 that parsing of the first local file is OK while parsing of the
 manually created xml file fails. obviously I'm doing something wrong
 and I'd love to learn what.
 the toy script:http://staff.science.uva.nl/~otsur/code/xmlConversions.py
 Simple file comparison:

 File 1: ... Modern Church.  lt;pgt;The book ...
 File 2: ... Modern Church.  pThe book ...

 Firefox:

 XML Parsing Error: mismatched tag. Expected: /p.
 Location: file:///C:/junk/myDeVinciCode166_2.xml
 Line Number 3, Column 1153:

 CONTENTThe...Church.  pThe...thrill./CONTENT
 --^
 
 yup, but why does this happen - on the script side - I write the exact
 same strings, of content with supposedly, same encoding, so why the
 encoding is different?

Read the mail. It's not the encoding, it's the p which does not get
through as a tag in the first file.

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing XML with ElementTree (unicode problem?)

2007-07-26 Thread John Machin
On Jul 26, 9:24 pm, [EMAIL PROTECTED] wrote:
 OK, I solved the problem but I still don't get what went wrong.
 Solution - use tree builder in order to create the new xml file
 (previously I was  manually creating it).

 I'm still curious so I'm adding a link to a short and very simple
 script that gets an xml (containing non ascii chars) from the web and
 saves some of the elements to 2 different local xml files - one is
 created by XMLWriter and the other is created manually. you could see
 that parsing of the first local file is OK while parsing of the
 manually created xml file fails. obviously I'm doing something wrong
 and I'd love to learn what.

 the toy script:http://staff.science.uva.nl/~otsur/code/xmlConversions.py


Simple file comparison:

File 1: ... Modern Church.  lt;pgt;The book ...
File 2: ... Modern Church.  pThe book ...

Firefox:

XML Parsing Error: mismatched tag. Expected: /p.
Location: file:///C:/junk/myDeVinciCode166_2.xml
Line Number 3, Column 1153:

CONTENTThe...Church.  pThe...thrill./CONTENT
--^


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing XML with ElementTree (unicode problem?)

2007-07-26 Thread oren . tsur
On Jul 26, 3:13 pm, John Machin [EMAIL PROTECTED] wrote:
 On Jul 26, 9:24 pm, [EMAIL PROTECTED] wrote:

  OK, I solved the problem but I still don't get what went wrong.
  Solution - use tree builder in order to create the new xml file
  (previously I was  manually creating it).

  I'm still curious so I'm adding a link to a short and very simple
  script that gets an xml (containing non ascii chars) from the web and
  saves some of the elements to 2 different local xml files - one is
  created by XMLWriter and the other is created manually. you could see
  that parsing of the first local file is OK while parsing of the
  manually created xml file fails. obviously I'm doing something wrong
  and I'd love to learn what.

  the toy script:http://staff.science.uva.nl/~otsur/code/xmlConversions.py

 Simple file comparison:

 File 1: ... Modern Church.  lt;pgt;The book ...
 File 2: ... Modern Church.  pThe book ...

 Firefox:

 XML Parsing Error: mismatched tag. Expected: /p.
 Location: file:///C:/junk/myDeVinciCode166_2.xml
 Line Number 3, Column 1153:

 CONTENTThe...Church.  pThe...thrill./CONTENT
 --^

yup, but why does this happen - on the script side - I write the exact
same strings, of content with supposedly, same encoding, so why the
encoding is different?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing XML with ElementTree (unicode problem?)

2007-07-26 Thread oren . tsur
OK, I solved the problem but I still don't get what went wrong.
Solution - use tree builder in order to create the new xml file
(previously I was  manually creating it).

I'm still curious so I'm adding a link to a short and very simple
script that gets an xml (containing non ascii chars) from the web and
saves some of the elements to 2 different local xml files - one is
created by XMLWriter and the other is created manually. you could see
that parsing of the first local file is OK while parsing of the
manually created xml file fails. obviously I'm doing something wrong
and I'd love to learn what.

the toy script:
http://staff.science.uva.nl/~otsur/code/xmlConversions.py

Thaks for all your help,

Oren

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing XML with ElementTree (unicode problem?)

2007-07-26 Thread oren . tsur
On Jul 26, 4:34 pm, Stefan Behnel [EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] wrote:
  On Jul 26, 3:13 pm, John Machin [EMAIL PROTECTED] wrote:
  On Jul 26, 9:24 pm, [EMAIL PROTECTED] wrote:

  OK, I solved the problem but I still don't get what went wrong.
  Solution - use tree builder in order to create the new xml file
  (previously I was  manually creating it).
  I'm still curious so I'm adding a link to a short and very simple
  script that gets an xml (containing non ascii chars) from the web and
  saves some of the elements to 2 different local xml files - one is
  created by XMLWriter and the other is created manually. you could see
  that parsing of the first local file is OK while parsing of the
  manually created xml file fails. obviously I'm doing something wrong
  and I'd love to learn what.
  the toy script:http://staff.science.uva.nl/~otsur/code/xmlConversions.py
  Simple file comparison:

  File 1: ... Modern Church.  lt;pgt;The book ...
  File 2: ... Modern Church.  pThe book ...

  Firefox:

  XML Parsing Error: mismatched tag. Expected: /p.
  Location: file:///C:/junk/myDeVinciCode166_2.xml
  Line Number 3, Column 1153:

  CONTENTThe...Church.  pThe...thrill./CONTENT
  --^

  yup, but why does this happen - on the script side - I write the exact
  same strings, of content with supposedly, same encoding, so why the
  encoding is different?

 Read the mail. It's not the encoding, it's the p which does not get
 through as a tag in the first file.

 Stefan

thanks. I guess it was a dumb question after all. thanks again :)

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing XML with ElementTree (unicode problem?)

2007-07-24 Thread oren . tsur
On Jul 23, 4:46 pm, Richard Brodie [EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] wrote in message

 news:[EMAIL PROTECTED]

  so what's the difference? how comes parsing is fine
  in the first case but erroneous in the second case?

 You may have guessed the encoding wrong. It probably
 wasn't utf-8 to start with but iso8859-1 or similar.
 What actual byte value is in the file?

I tried it with different encodings and it didn't work. Anyways, I
would expect it to be utf-8 since the XML response to the amazon query
indicates a utf-8 (check it with
http://ecs.amazonaws.com/onca/xml?Service=AWSECommerceServiceAWSAccessKeyId=189P5TE3VP7N9MN0G302Operation=ItemLookupItemId=1400079179ResponseGroup=ReviewsReviewPage=166

 in your browser, the first line in the source is ?xml version=1.0
encoding=UTF-8?)

but the thing is that the parser parses it all right from the web (the
amazon response) but fails to parse the locally saved file.

  2. there is another problem that might be similar I get a similar
  error if the content of the (locally saved) xml have special
  characters such as ''

 Either the originator of the XML has messed up, or whatever
 you have done to save a local copy has mangled it.

I think i made a mess. I changed the '' in the original response to
'and' because the parser failed to parse the '' (in the locally saved
file) just like it failed with the French characters. Again, parsing
the original response was just fine.

Thanks again,

Oren


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing XML with ElementTree (unicode problem?)

2007-07-24 Thread Stefan Behnel
[EMAIL PROTECTED] wrote:
 On Jul 23, 4:46 pm, Richard Brodie [EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] wrote in message

 news:[EMAIL PROTECTED]

 so what's the difference? how comes parsing is fine
 in the first case but erroneous in the second case?
 You may have guessed the encoding wrong. It probably
 wasn't utf-8 to start with but iso8859-1 or similar.
 What actual byte value is in the file?
 
 I tried it with different encodings and it didn't work. Anyways, I
 would expect it to be utf-8 since the XML response to the amazon query
 indicates a utf-8 (check it with
 http://ecs.amazonaws.com/onca/xml?Service=AWSECommerceServiceAWSAccessKeyId=189P5TE3VP7N9MN0G302Operation=ItemLookupItemId=1400079179ResponseGroup=ReviewsReviewPage=166
 
  in your browser, the first line in the source is ?xml version=1.0
 encoding=UTF-8?)
 
 but the thing is that the parser parses it all right from the web (the
 amazon response) but fails to parse the locally saved file.

Then how did you save it to a file? Using your browser? Maybe that messed it
up? Or did you edit it with an Editor that doesn't understand UTF-8?

If you want to extract the interesting stuff programmatically, you can use
lxml.etree. It's ElementTree compatible, but it can parse right from HTTP URLs
and it supports XPath for selecting stuff.

http://codespeak.net/lxml/

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing XML with ElementTree (unicode problem?)

2007-07-24 Thread Marc 'BlackJack' Rintsch
On Tue, 24 Jul 2007 05:57:26 +, oren.tsur wrote:

 but the thing is that the parser parses it all right from the web (the
 amazon response) but fails to parse the locally saved file.

I've just used wget to fetch that URL and `ElementTree` parses that local
file without problems.

Maybe you should stop searching the explanation within Python or
`ElementTree` and accept having a broken XML file on your disk.  :-)

Have you checked the local XML file with something like `xmllint` or
another XML parser already?

Ciao,
Marc 'BlackJack' Rintsch
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing XML with ElementTree (unicode problem?)

2007-07-24 Thread Steve Holden
Marc 'BlackJack' Rintsch wrote:
 On Tue, 24 Jul 2007 05:57:26 +, oren.tsur wrote:
 
 but the thing is that the parser parses it all right from the web (the
 amazon response) but fails to parse the locally saved file.
 
 I've just used wget to fetch that URL and `ElementTree` parses that local
 file without problems.
 
 Maybe you should stop searching the explanation within Python or
 `ElementTree` and accept having a broken XML file on your disk.  :-)
 
 Have you checked the local XML file with something like `xmllint` or
 another XML parser already?
 
 Ciao,
   Marc 'BlackJack' Rintsch

You should also realise that your posting compromised the Access ID 
embedded in the URL. If that was live it might be a good idea to replace it.

regards
  Steve
-- 
Steve Holden+1 571 484 6266   +1 800 494 3119
Holden Web LLC/Ltd   http://www.holdenweb.com
Skype: holdenweb  http://del.icio.us/steve.holden
--- Asciimercial --
Get on the web: Blog, lens and tag the Internet
Many services currently offer free registration
--- Thank You for Reading -

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing XML with ElementTree (unicode problem?)

2007-07-24 Thread André
On Jul 23, 11:29 am, [EMAIL PROTECTED] wrote:
 (this question was also posted in the devshed python 
 forum:http://forums.devshed.com/python-programming-11/parsing-xml-with-elem...
 ).
 -

 (it's a bit longish but I hope I give all the information)

 1. here is my problem: I'm trying to parse an XML file (saved locally)
 using elementtree.parse but I get the following error:
 xml.parsers.expat.ExpatError: not well-formed (invalid token): line
 13, column 327
 apparently, the problem is caused by the token 'Saunière' due to the
 apostrophe.

 the thing is that I'm sure that python (ElementTree module and parse()
 function) can handle this type of encoding since I obtain my xml file
 from the web by opening it with:

 from elementtree import ElementTree
 from urllib import urlopen
 query = r'http://ecs.amazonaws.com/onca/xml?
 Service=AWSECommerceServiceAWSAccessKeyId=189P5TE3VP7N9MN0G302Operation=ItemLookupItemId=1400079179ResponseGroup=ReviewsReviewPage=166'
 root = ElementTree.parse(urlopen(query))

How about trying
root = ElementTree.parse(urlopen(query), encoding ='utf-8')

André

-- 
http://mail.python.org/mailman/listinfo/python-list


  1   2   >