subject:"translating foreign data"

Re: translating foreign data

2018-06-26 Thread Stefan Ram

  To: Richard Damon
From: "Stefan Ram" 

  To: Richard Damon
From: r...@zedat.fu-berlin.de (Stefan Ram)

Richard Damon  writes:
>Now, if I have a parser that doesn't use the locale, but some other rule
>base than I just need to provide it with the right rules, which is
>basically just defining the right locale.

  Here's an example C++ program I wrote. It uses the class s
  to provide rules for an ad hoc locale which then is used to
  imbue a temporary string stream which then can parse numbers
  using the thousands separator given by s.

  main.cpp

#include 
#include 
#include 
#include 
#include 

using namespace ::std::literals;

struct s : ::std::numpunct< char >
{ char do_thousands_sep() const override { return ','; }
  ::std::string do_grouping() const override { return "\3"; }};

static double double_value_of( ::std::string const & string ) {
::std::stringstream source { string };
  source.imbue( ::std::locale( source.getloc(), new s ));
  double number; source >> number; return number; }

int main()
{ ::std::cout << double_value_of( "4,800.1"s )<< '\n';
  ::std::cout << double_value_of( "3,334.5e9"s )<< '\n'; }

  transcript

4800.1
3.3345e+012

-+- BBBS/Li6 v4.10 Toy-3
 + Origin: Prism bbs (1:261/38)

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-26 Thread Marko Rauhamaa

  To: Richard Damon
From: "Marko Rauhamaa" 

  To: Richard Damon
From: Marko Rauhamaa 

Richard Damon :

> On 6/23/18 9:05 AM, Marko Rauhamaa wrote:
>> Richard Damon :
>>
>>> On 6/23/18 8:03 AM, Marko Rauhamaa wrote:
 I always know my locale. The locale is tied to the human user.
>>> No, it should be tied to the data you are processing.
>>In computing, a locale is a set of parameters that defines the user's
>>language, region and any special variant preferences that the user
>>wants to see in their user interface.
>>
>>https://en.wikipedia.org/wiki/Locale_(computer_software)>
>>
>> The data should not depend on the locale.
> So no one foreign ever gives you data?

Never in my decades in computer programming have I found any use for locales.

In particular, they have never helped me decode "foreign" data, whether in
ASCII, Latin-1, Latin-3, Latin-9, JIS or UTF-8.

> Note, that wikipedia article is focused on the SYSTEM locale, which
> yes, that should reflect the what the user wants in his interface.

I don't think locales have anything to do with anything else.


>>> If an English user is feeding a program Chinese documents, while
>>> processing those documents the program should be using the
>>> appropriate Chinese Locale.
>> Not true.
> How else is the program going to understand the Chinese data?

If someone gives me a file, they had better indicate the file format.

> The fact that locale issues leak into data is the reason that the
> single immutable global locale doesn't work.

Locales don't work. Period.

> You really want to imbue into data streams what locale their data
> represents (and use that in some of the later processing of data from
> that stream).

Can you refer to a standard for that kind of imbuement?

Of course, you have document types, schema definitions and other implicit and
explicit format indicators. You shouldn't call them locales, though.

>>> Data presented to the user should normally use his locale (unless he
>>> has specified something different).
>> Ok. Here's a value for you:
>>
>> 100ΓΘ¼
>>
>> I see '1', '0', '0', 'ΓΘ¼'. What do you see in your locale (LC_MONETARY)?
> If I processed that on my system I would either get $100, or an error of
> wrong currency symbol depending on the error checking.

Don't forget to convert the amount as well...

>> The single global is due to what the locale was introduced for. It
>> came about around the time when Unix applications were being made
>> "8-bit clean." Along with UCS-2 and XML, it's one of those things you
>> wish you'd never have to deal with.
>
> Locale predates UCS-2, it was the early attempt to provide
> internationalization to C code so even programmers who didn't think
> about it could add the line setlocale(LC_ALL, "") and make their code
> work at least mostly right in more places. A single global was quick
> and simple, and since threads didn't exist, not an issue.
>
> In many ways it was the first attempt that should have been thrown
> away, but got too intertwined. C++ made a significant improvement to
> it by having streams remember their own locale.

Noone should breathe any new life into locales.

And yes, add C++ to the list of things you wish you'd never have to deal
with...


Marko

-+- BBBS/Li6 v4.10 Toy-3
 + Origin: Prism bbs (1:261/38)

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-26 Thread Marko Rauhamaa

  To: Richard Damon
From: "Marko Rauhamaa" 

  To: Richard Damon
From: Marko Rauhamaa 

Richard Damon :

> On 6/23/18 8:03 AM, Marko Rauhamaa wrote:
>> I always know my locale. The locale is tied to the human user.
> No, it should be tied to the data you are processing.

   In computing, a locale is a set of parameters that defines the user's
   language, region and any special variant preferences that the user
   wants to see in their user interface.

   https://en.wikipedia.org/wiki/Locale_(computer_software)>

The data should not depend on the locale.

> If an English user is feeding a program Chinese documents, while
> processing those documents the program should be using the appropriate
> Chinese Locale.

Not true.

> Again, no, a locale is tied to the data, not the user (unless you want
> to require the user to translate all data to his locale conventions
> (without using a program that can use locale information) before
> providing it to a program. Yes, the default for the interpretation
> should be the users default/current locale, but you really want them
> to be able to say I got this file from someone whose locale was
> different than mine.

The locale is not directly related to data or data formats. Of course, locales
leak into data and create the sorry mess we are talking about.

> Data presented to the user should normally use his locale (unless he
> has specified something different).

Ok. Here's a value for you:

100ΓΘ¼

I see '1', '0', '0', 'ΓΘ¼'. What do you see in your locale (LC_MONETARY)?

>> BTW, I think the locale is a terrible invention.
>
> The locale is a lot better than the alternative, where every
> application that needs to deal with internationalization need to
> recreate (and debub) all of the mechanism. I agree it isn't perfect,
> and for small simple programs it would be nice to be able to say "I
> don't want all this stuff, make it go away".

The locale doesn't solve a single problem in practice and often trips up
programs. For example, a customer-visible bug was once caused by:

   sort  Python took its locale (at least initially) from C, which was a single
> global which does have more issues because of this.

The single global is due to what the locale was introduced for. It came about
around the time when Unix applications were being made "8-bit clean." Along
with UCS-2 and XML, it's one of those things you wish you'd never have to deal
with.

Marko

-+- BBBS/Li6 v4.10 Toy-3
 + Origin: Prism bbs (1:261/38)

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-26 Thread Marko Rauhamaa

  To: Richard Damon
From: "Marko Rauhamaa" 

  To: Richard Damon
From: Marko Rauhamaa 

Richard Damon :
> If you know the Locale, then you do know what the decimal separator
> is, as that is part of what a locale defines.

I don't know what that sentence means.

> The issue is that if you just know the encoding, you don't necessarily
> know the locale.

I always know my locale. The locale is tied to the human user.

> He also commented that he didn't want to set the locale in the
> routine, as that sets it globally for the full application (but
> perhaps that latter could be fixed by first doing a
> locale.getlocale(), then setlocale for the files locale, and then at
> the end of reading and processing restore back the old locale.

Setting a locale application-wise is

 * not in accordance with the idea of a locale (the locale should be
   constant within a user session)

 * not easily possible (the locale is seen by all threads
   simultaneously)


BTW, I think the locale is a terrible invention.


Marko

-+- BBBS/Li6 v4.10 Toy-3
 + Origin: Prism bbs (1:261/38)

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-26 Thread Steven D'Aprano

From: "Steven D'Aprano" 

From: Steven D'Aprano 

On Sat, 23 Jun 2018 06:26:22 -0400, Richard Damon wrote:

> If you know the Locale, then you do know what the decimal separator is,
> as that is part of what a locale defines.

A locale defines a set of common cultural conventions. It doesn't mandate the
actual conventions in use in any specific document.

If I'm in Australia, using the en-AU locale, nevertheless I can generate a file
 using , as a decimal separator. Try and stop me :-)

But your point is taken -- I misread Ethan saying that he knew the locale and
it wasn't helping, when in fact he was reluctant to change the locale as that's
 a process-wide global change.

> The issue is that if you just
> know the encoding, you don't necessarily know the locale. He also
> commented that he didn't want to set the locale in the routine, as that
> sets it globally for the full application (but perhaps that latter could
> be fixed by first doing a locale.getlocale(), then setlocale for the
> files locale, and then at the end of reading and processing restore back
> the old locale.

Indeed.

--
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing it everywhere."
 -- Jon Ronson

-+- BBBS/Li6 v4.10 Toy-3
 + Origin: Prism bbs (1:261/38)

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-26 Thread Richard Damon

From: "Richard Damon" 

From: Richard Damon 

On 6/22/18 11:21 PM, Steven D'Aprano wrote:
> On Fri, 22 Jun 2018 20:06:35 +0100, Ben Bacarisse wrote:
>
>> Steven D'Aprano  writes:
>>
>>> On Fri, 22 Jun 2018 11:14:59 +0100, Ben Bacarisse wrote:
>>>
>> The code page remark is curious.  Will some "code pages" have digits
>> that are not ASCII digits?
> Good question.  I have no idea.
 It's much more of an open question than I thought.
>>> Nah, Python already solves that for you:
>> My understanding was that the OP does not (reliably) know the encoding,
>> though that was a guess based on a turn of phrase.
> I took it the other way: that Ethan *does* know the encoding, and his
> problem is that knowing the encoding and/or locale is not enough to
> recognise whether to use a period or comma as the decimal separator.
>
> Which it isn't.
If you know the Locale, then you do know what the decimal separator is, as that
 is part of what a locale defines. The issue is that if you just know the
encoding, you don't necessarily know the locale. He also commented that he
didn't want to set the locale in the routine, as that sets it globally for the
full application (but perhaps that latter could be fixed by first doing a
locale.getlocale(), then setlocale for the files locale, and then at the end of
 reading and processing restore back the old locale.
>
> If he doesn't know the encoding, he has bigger problems than just
> converting strings into floats. Without knowing the encoding, he cannot
> even reliably detect non-ASCII digits at all.
>
>
>> Another guess is that the OP does not have Unicode data.  The term "code
>> page" hints at an 8-bit encoding or at least a pre-Unicode one.
> Assuming he is using Python 3, or using Python 2 sensibly, once he has
> specified the encoding and read the data from the file, he has Unicode.
>
> Unicode is a superset of (ideally) all code pages. Once you have decoded
> the data using the appropriate code page, you have a Unicode string, and
> Python doesn't care where it came from.
>
> The point is, once Ethan can get the intended characters out of the file
> into Python, it doesn't matter what code page they came from. They're now
> full-fledged Unicode characters, and Python's float() and int() functions
> can easily deal with non-ASCII digits. So long as he has digits in the
> first place, float() and int() will deal with them correctly.
>
>

--
Richard Damon

-+- BBBS/Li6 v4.10 Toy-3
 + Origin: Prism bbs (1:261/38)

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-26 Thread Steven D'Aprano

From: "Steven D'Aprano" 

From: Steven D'Aprano 

On Fri, 22 Jun 2018 20:06:35 +0100, Ben Bacarisse wrote:

> Steven D'Aprano  writes:
>
>> On Fri, 22 Jun 2018 11:14:59 +0100, Ben Bacarisse wrote:
>>
> The code page remark is curious.  Will some "code pages" have digits
> that are not ASCII digits?

 Good question.  I have no idea.
>>>
>>> It's much more of an open question than I thought.
>>
>> Nah, Python already solves that for you:
>
> My understanding was that the OP does not (reliably) know the encoding,
> though that was a guess based on a turn of phrase.

I took it the other way: that Ethan *does* know the encoding, and his problem
is that knowing the encoding and/or locale is not enough to recognise whether
to use a period or comma as the decimal separator.

Which it isn't.

If he doesn't know the encoding, he has bigger problems than just converting
strings into floats. Without knowing the encoding, he cannot even reliably
detect non-ASCII digits at all.

> Another guess is that the OP does not have Unicode data.  The term "code
> page" hints at an 8-bit encoding or at least a pre-Unicode one.

Assuming he is using Python 3, or using Python 2 sensibly, once he has
specified the encoding and read the data from the file, he has Unicode.

Unicode is a superset of (ideally) all code pages. Once you have decoded the
data using the appropriate code page, you have a Unicode string, and Python
doesn't care where it came from.

The point is, once Ethan can get the intended characters out of the file into
Python, it doesn't matter what code page they came from. They're now
full-fledged Unicode characters, and Python's float() and int() functions can
easily deal with non-ASCII digits. So long as he has digits in the first place,
 float() and int() will deal with them correctly.

--
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing it everywhere."
 -- Jon Ronson

-+- BBBS/Li6 v4.10 Toy-3
 + Origin: Prism bbs (1:261/38)

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-26 Thread Ben Bacarisse

  To: Steven D'Aprano
From: "Ben Bacarisse" 

  To: Steven D'Aprano
From: Ben Bacarisse 

Steven D'Aprano  writes:

> On Fri, 22 Jun 2018 11:14:59 +0100, Ben Bacarisse wrote:
>
 The code page remark is curious.  Will some "code pages" have digits
 that are not ASCII digits?
>>>
>>> Good question.  I have no idea.
>>
>> It's much more of an open question than I thought.
>
> Nah, Python already solves that for you:

My understanding was that the OP does not (reliably) know the encoding, though
that was a guess based on a turn of phrase.

Another guess is that the OP does not have Unicode data.  The term "code page"
hints at an 8-bit encoding or at least a pre-Unicode one.

--
Ben.

-+- BBBS/Li6 v4.10 Toy-3
 + Origin: Prism bbs (1:261/38)

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-26 Thread Richard Damon

From: Richard Damon 

On 6/23/18 10:44 PM, Steven D'Aprano wrote:
> On Sat, 23 Jun 2018 17:52:55 -0400, Richard Damon wrote:
>
>> If you have more than just a number representing a value in the locale
>> currency, you can't ask the locale how to present/accept it.
> You're the only one saying that it has to be handled by the locale.
>
>
Actually, it was part of the problem statement by Marko, since he said to use
LC_MONETARY, which is the part of the Locale machinery dealing with monetary
quantities (and can ONLY handle the currency defined by the Locale). What would
 you think of providing a program in say, Java, to a problem statement that
said to write a Python program.

I suppose he could have just meant use the number, which would be like asking
to interpret the value of 100 euros using math.pi

Or it could have been just a bad question like how heavy is blue. (Since by
definition a locale only knows how to handle a single type of currency,
assuming any value is of that type).

My answer was in part to point out the problem with the problem statement (and
people seem to want to jump on me for pointing out the strengths and weaknesses
 of the locale system.

This also goes back to the very original question at the beginning of the
thread, the OP had a bunch of data with numbers using varying locale
conventions (he didn't use the words), but had various decimal separators and
some people asked about non-'arabic' numbersΓ  (0-9).

This also goes back to some of the comments about file formats. Most file
formats are designed to be 'Machine Read' (even if they use text formatting)
and as such do NOT use localization facilities, so when processing them you
want the I/O processing system to be in a non-localized mode (typically numbers
 always use . as the decimal separator, and usually nothing as the thousands
separator). While the text format files might be opened in a text editor, the
file format doesn't cater to making things pretty for the user. Some programs
will create input/output/storage files where it is expected that the user WILL
open them, look at them and maybe even edit them. Numbers will use
the locale convention of currency and decimal/thousands separators. If you have
 such a system, changing the locale rules for these files may cause
misinterpreting the values.

If you are bringing such files from a 'foreign' system, you need to be able to
indicate what locale to use when reading that file. This sounds very much like
the category of problem that the OP was dealing with. They have apparently a
large number files, presumably organized in some consistent manner that the
values in them make sense, but the numbers are written in different local
conventions, and this was causing the simplistic processing to fail.

--
Richard Damon

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-25 Thread Chris Angelico

From: Chris Angelico 

On Sun, Jun 24, 2018 at 1:23 PM, Steven D'Aprano
 wrote:
> On Sun, 24 Jun 2018 12:53:49 +1000, Chris Angelico wrote:
>
> [...]
>>> Okay, you want a bit-pattern. In hex:
>>>
>>> '0x313030e282ac'
> [...]
>
>> Hmm. Actually, I'm a bit confused.
>>
> hex("100ΓΘ¼".encode())
>> Traceback (most recent call last):
>>   File "", line 1, in 
>> TypeError: 'bytes' object cannot be interpreted as an integer
>>
>> Nope, that's not it. Needs something to turn the bytes into an integer
>> first. But I can't find a way to do that. Best I can find is:
>>
> "100ΓΘ¼".encode().hex()
>> '313030e282ac'
>
> Dammit, that was what I was looking for, but I only looked on *strings*,
> not bytes.
>
>
>> No "0x" prefix, no function call. So, I'm stuck. How did you create your
>> one?
>
> py> hex(int.from_bytes("100ΓΘ¼".encode("utf-8"), 'big'))
> '0x313030e282ac'

Ahhh thanks, that's the part I couldn't find (and didn't remember).

Anyhow, encoding to UTF-8 and then to bytes is pretty easy.

ChrisA

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-25 Thread Steven D'Aprano

From: Steven D'Aprano 

On Sun, 24 Jun 2018 12:53:49 +1000, Chris Angelico wrote:

[...]
>> Okay, you want a bit-pattern. In hex:
>>
>> '0x313030e282ac'
[...]

> Hmm. Actually, I'm a bit confused.
>
 hex("100ΓΘ¼".encode())
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: 'bytes' object cannot be interpreted as an integer
>
> Nope, that's not it. Needs something to turn the bytes into an integer
> first. But I can't find a way to do that. Best I can find is:
>
 "100ΓΘ¼".encode().hex()
> '313030e282ac'

Dammit, that was what I was looking for, but I only looked on *strings*, not
bytes.


> No "0x" prefix, no function call. So, I'm stuck. How did you create your
> one?

py> hex(int.from_bytes("100ΓΘ¼".encode("utf-8"), 'big'))
'0x313030e282ac'



--
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing it everywhere."
 -- Jon Ronson

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-25 Thread Chris Angelico

From: Chris Angelico 

On Sun, Jun 24, 2018 at 12:44 PM, Steven D'Aprano
 wrote:
> You're joking, right? You can't possibly be so ignorant as to actually
> believe that. You have, right in front of you, a news post or email
> containing the text string "100ΓΘ¼", and yet you are writing apparently in
> full seriousness that it is impossible to get that text string in a file.
>
> Okay, you want a bit-pattern. In hex:
>
> '0x313030e282ac'
>
> I'll leave the question of how I generated that as an exercise. (Hint: it
> was a one-liner, involving two method calls and a function call, all
> builtins in Python.)

Hmm. Actually, I'm a bit confused.

>>> hex("100ΓΘ¼".encode())
Traceback (most recent call last):
  File "", line 1, in 
TypeError: 'bytes' object cannot be interpreted as an integer

Nope, that's not it. Needs something to turn the bytes into an integer first.
But I can't find a way to do that. Best I can find is:

>>> "100ΓΘ¼".encode().hex()
'313030e282ac'

No "0x" prefix, no function call. So, I'm stuck. How did you create your one?

ChrisA

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-25 Thread Steven D'Aprano

From: Steven D'Aprano 

On Sat, 23 Jun 2018 17:52:55 -0400, Richard Damon wrote:

> If you have more than just a number representing a value in the locale
> currency, you can't ask the locale how to present/accept it.

You're the only one saying that it has to be handled by the locale.


--
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing it everywhere."
 -- Jon Ronson

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-25 Thread Steven D'Aprano

From: Steven D'Aprano 

On Sat, 23 Jun 2018 17:05:17 -0400, Richard Damon wrote:

> On 6/23/18 11:27 AM, Steven D'Aprano wrote:
>> On Sat, 23 Jun 2018 09:42:29 -0400, Richard Damon wrote:
>>
>>> On 6/23/18 9:05 AM, Marko Rauhamaa wrote:
 Ok. Here's a value for you:

 100ΓΘ¼

[...]
> Locale based currency transformations are defined as a number to/from a
> text string.
>
> The number CAN'T say 100 Euros (can you give me what bit pattern you
> would use for such a number).

You're joking, right? You can't possibly be so ignorant as to actually believe
that. You have, right in front of you, a news post or email containing the text
 string "100ΓΘ¼", and yet you are writing apparently in full seriousness that
it is impossible to get that text string in a file.

Okay, you want a bit-pattern. In hex:

'0x313030e282ac'

I'll leave the question of how I generated that as an exercise. (Hint: it was a
 one-liner, involving two method calls and a function call, all builtins in
Python.)

> The currency is encoded in the locale used for the conversion, so if it
> is using en-US, the currency value would ALWAYS be US$ (which the
> general locale format is just $).

I cannot imagine for a second why you think any of this is even a tiny bit
relevant to the question of how one should read a data file containing currency
 in Euro.

You seem to have heard about the locale and decide it is the One True Hammer
than all nails must be hammered with.

--
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing it everywhere."
 -- Jon Ronson

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-25 Thread Richard Damon

From: Richard Damon 

On 6/23/18 5:31 PM, Ben Finney wrote:
> Richard Damon  writes:
>
>> On 6/23/18 11:27 AM, Steven D'Aprano wrote:
 On 6/23/18 9:05 AM, Marko Rauhamaa wrote:
> Richard Damon wrote:
>> Data presented to the user should normally use his locale
>> (unless he has specified something different).
> Ok. Here's a value for you:
>
> 100ΓΘ¼
>
>>> [Γ |]
>>> The data you were given was 100 Euros. If your system is incapable
>>> of reading that as 100 Euros, and errors out, then at least to know
>>> that it is brain-damaged and useless.
>>>
>>> But if instead it silently changes the data to $100 (US dollars?
>>> Australian dollars? Zimbabwe dollars? the gods only know what a
>>> system that broken might do...) then it is not only broken but
>>> *dangerously* broken.
>>>
>> [Γ |]
>>
>> The number CAN'T say 100 Euros (can you give me what bit pattern you
>> would use for such a number).
> That is (I believe) the point being made: The data is *not* a number. It
> is a value that must encapsulate more than only the number 100, but also
> and simultaneously the curency Γ úEuroΓ ╪.
If you have more than just a number representing a value in the locale
currency, you can't ask the locale how to present/accept it.
>
>> The currency is encoded in the locale used for the conversion, so if it
>> is using en-US, the currency value would ALWAYS be US$ (which the
>> general locale format is just $). As such 100ΓΘ¼ is an invalid input to a
>> system getting a Locale based input for a currency if the locale is not
>> one from a country that uses the euro.
> The value is 100 Euro, a quantity of a particular currency and not
> something trivially converted to US$ (for many reasons, including the
> obvious one that we don't know the exact exchange rate to use, and it
> will be different at a different time).
>
> You appear to be arguing that this value must either be arbitrarily
> converted to the user's local currency, something we agree is impossible
> to do given the data, or the value is simply invalid.
>
> So the rule you assert Γ ⌠ Γ úData presented to the user should normally use
> his localeΓ ╪ Γ ⌠ fails to usefuly handle the very normal case of data that
> represents a quantity of some foreign currency. Any system following
> your asserted rule will give either the wrong answer, or an error. We
> had better hope the rule you assert is not in effect.
>
If the user wants to talk in Euro using software that uses locales, then he
should specify a locale that uses Euros.

If you have a field to enter a foreign currency, then you can NOT make that a
LC_CURRENCY field, or you need to make that field use a different locale than
the local locale. This isn't the fault of locales, but in a misuse of the
system.

This original question came when it was asked what do I see with 100ΓΘ¼ in MY
locale LC_CURRENCY, well MY locale doesn't have a LC_CURRENCY that is euros, so
 it can't express that. It is a bit like asking how to draw a circle with 4
straight lines or get to the moon in a boat. It is a question with an improper
premise.

--
Richard Damon

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-25 Thread Ben Finney

From: Ben Finney 

Richard Damon  writes:

> On 6/23/18 11:27 AM, Steven D'Aprano wrote:
> >> On 6/23/18 9:05 AM, Marko Rauhamaa wrote:
> >>> Richard Damon wrote:
> >>> > Data presented to the user should normally use his locale
> >>> > (unless he has specified something different).
> >>>
> >>> Ok. Here's a value for you:
> >>>
> >>> 100ΓΘ¼
> >>>
> > [Γ |]
> > The data you were given was 100 Euros. If your system is incapable
> > of reading that as 100 Euros, and errors out, then at least to know
> > that it is brain-damaged and useless.
> >
> > But if instead it silently changes the data to $100 (US dollars?
> > Australian dollars? Zimbabwe dollars? the gods only know what a
> > system that broken might do...) then it is not only broken but
> > *dangerously* broken.
> >
> [Γ |]
>
> The number CAN'T say 100 Euros (can you give me what bit pattern you
> would use for such a number).

That is (I believe) the point being made: The data is *not* a number. It is a
value that must encapsulate more than only the number 100, but also and
simultaneously the curency Γ úEuroΓ ╪.

> The currency is encoded in the locale used for the conversion, so if it
> is using en-US, the currency value would ALWAYS be US$ (which the
> general locale format is just $). As such 100ΓΘ¼ is an invalid input to a
> system getting a Locale based input for a currency if the locale is not
> one from a country that uses the euro.

The value is 100 Euro, a quantity of a particular currency and not something
trivially converted to US$ (for many reasons, including the obvious one that we
 don't know the exact exchange rate to use, and it will be different at a
different time).

You appear to be arguing that this value must either be arbitrarily converted
to the user's local currency, something we agree is impossible to do given the
data, or the value is simply invalid.

So the rule you assert Γ ⌠ Γ úData presented to the user should normally use
his localeΓ ╪ Γ ⌠ fails to usefuly handle the very normal case of data that
represents a quantity of some foreign currency. Any system following your
asserted rule will give either the wrong answer, or an error. We had better
hope the rule you assert is not in effect.

--
 \ Γ úDRM doesn't inconvenience [lawbreakers] Γ ÷ indeed, over time it |
  `\ trains law-abiding users to become [lawbreakers] out of sheer |
_o__)frustration.Γ ╪ Γ ÷Charles Stross, 2010-05-09 |
Ben Finney

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-25 Thread Richard Damon

From: Richard Damon 

On 6/23/18 11:27 AM, Steven D'Aprano wrote:
> On Sat, 23 Jun 2018 09:42:29 -0400, Richard Damon wrote:
>
>> On 6/23/18 9:05 AM, Marko Rauhamaa wrote:
>>> Ok. Here's a value for you:
>>>
>>> 100ΓΘ¼
>>>
>>> I see '1', '0', '0', 'ΓΘ¼'. What do you see in your locale (LC_MONETARY)?
>> If I processed that on my system I would either get $100, or an error of
>> wrong currency symbol depending on the error checking.
> Then your system is so unbelievably broken that it should be nuked from
> orbit, just to be sure.
>
> The data you were given was 100 Euros. If your system is incapable of
> reading that as 100 Euros, and errors out, then at least to know that it
> is brain-damaged and useless.
>
> But if instead it silently changes the data to $100 (US dollars?
> Australian dollars? Zimbabwe dollars? the gods only know what a system
> that broken might do...) then it is not only broken but *dangerously*
> broken.
>
Locale based currency transformations are defined as a number to/from a text
string.

The number CAN'T say 100 Euros (can you give me what bit pattern you would use
for such a number).
The currency is encoded in the locale used for the conversion, so if it is
using en-US, the currency value would ALWAYS be US$ (which the general locale
format is just $). As such 100ΓΘ¼ is an invalid input to a system getting a
Locale based input for a currency if the locale is not one from a country that
uses the euro. What the input sees is '1', '0',
'0',Γ  some funny character (or maybe 2 of them). A poorly designed
input, or one being intentionally generous on input acceptance would return
100, which would be implied US Dollars. A better error checking routine would
give an error. It is IMPOSSIBLE for it to return a number that would be 100
euros. I suppose a very smart system might see that it was in a different
currency and try to convert it, but unless time reference point to use for the
currency, you are likely to get a wrong answer, but in any case, the answer
will NOT be 100 euros, but some equivalent value in Dollars.

Now, if you want to define a perhaps more general currency input routine that
tries to detect a pan-locale currency input, and returned both a value and a
currency type, that could be more useful in some contexts. But you then run
into the interesting (and difficult) problem that if you see the input of
123.456ΓΘ¼ what is that value, is it a value around a hundred euros specified
to 3 decimal places, or is it a number just over 100 thousand euros.
>
> [...]
>> Locale predates UCS-2, it was the early attempt to provide
>> internationalization to C code so even programmers who didn't think
>> about it could add the line setlocale(LC_ALL, "") and make their code
>> work at least mostly right in more places. A single global was quick and
>> simple, and since threads didn't exist, not an issue.
> Threads were first used in 1967, five years before C even existed.
>
> https://en.wikipedia.org/wiki/Thread_%28computing%29#History
>
Threads did NOT exist (at least to the Standard) in C when locales were added,
and the C language did nothing to support threading at that time. Looking back,
 it was perhaps a regrettable decision to implement locales globally the way
there were, but it is what it is.

--
Richard Damon

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-24 Thread Stefan Ram

  To: Richard Damon
From: r...@zedat.fu-berlin.de (Stefan Ram)

Richard Damon  writes:
>Now, if I have a parser that doesn't use the locale, but some other rule
>base than I just need to provide it with the right rules, which is
>basically just defining the right locale.

  Here's an example C++ program I wrote. It uses the class s
  to provide rules for an ad hoc locale which then is used to
  imbue a temporary string stream which then can parse numbers
  using the thousands separator given by s.

  main.cpp

#include 
#include 
#include 
#include 
#include 

using namespace ::std::literals;

struct s : ::std::numpunct< char >
{ char do_thousands_sep() const override { return ','; }
  ::std::string do_grouping() const override { return "\3"; }};

static double double_value_of( ::std::string const & string ) {
::std::stringstream source { string };
  source.imbue( ::std::locale( source.getloc(), new s ));
  double number; source >> number; return number; }

int main()
{ ::std::cout << double_value_of( "4,800.1"s )<< '\n';
  ::std::cout << double_value_of( "3,334.5e9"s )<< '\n'; }

  transcript

4800.1
3.3345e+012

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-24 Thread Peter J. Holzer

From: "Peter J. Holzer" 


--b2wbudmypdkmv7il
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 2018-06-23 12:11:34 -0400, Richard Damon wrote:
> On 6/23/18 10:05 AM, Peter J. Holzer wrote:
> > On 2018-06-23 08:41:38 -0400, Richard Damon wrote:
> >> Once you open the Locale can of worms, EVERYTHING has a locale, to say
> >> you aren't using a locale is to say you are writing
> >> something unintelligible, as you can thing of the locale as the set of
> >> rules to interpret
> > I don't think that's a useful way to look at it. "Locale" in
> > (non-technical) English means "place" or "site". The idea behind the
> > locale concept is that some conventions (e.g. how to write numbers or
> > how to write strings) depend on the place where the program runs (or
> > maybe where the user is sitting or grew up or maybe where a file was
> > produced).
> >
> > For stuff which doesn't depend on the place (e.g. how a Python program
> > should be parsed), the locale concept doesn't apply.
> >
> The Locale should NOT be the place the computer is running in (at least
> not anymore), but where the data and the user are from (which can be
> different).

Yes, it can be different, but for some *very* common cases (PCs, smartphones
most of the time) it isn't. More imporantly for the concept, when the concept
was developed (in the late 1980's) is was very common (probably more common
than 10 years earlier).

> Do your really mean that when I travel to a place that uses
> . as the thousands separator and , as the decimal separator (instead of
> my normal environment when they are the other way around) all my
> programs should immediately change how they read all my data files and
> how I need to enter data? I hope not.

Sometimes, yes. If you want to work with your colleagues at that place they
might thank you to use the local conventions.

> I want my computer to use the Locale of where "I" came from (not
> current am) to talk to me,

That's why I wrote "or grew up".

> and to be able to set the Locale to interpret data to match the rules
> the person who generated them used to generate them,

And that's why I wrote "where a file was produced".

So many words to repeat what I already wrote ...


> so if they swap . and , compared to me, I can tell the program that.
> Your last parenthetical comment in the first paragraph is my key
> point,

I think it is the weakest point. The locale is useful for interactive use
(input and output) and also for output intended for human users. For parsing
files it is woefully inadequate (also for generating files intended to be
parsed).

> the locale used to read data should match the locale used to generate
> it, and that can easily be different than the locale being used to
> interact with the user.

Which is basically why "locale" is a rather useless concept with files. When I
get a CSV file, I don't want to say "use locale en_US.cp437", because the
location "US" is almost completely irrelevant, the language "English" is
somewhat relevant but much too specific", and the list separator isn't there at
 all. I want to tell it: Decode using CP437, a decimal point, tabs as a list
separator, CRLF as the record separator, no quoting.

> If a program doesn't care about the locale it is running in, like a
> Python compiler, the either it needs to use routines that totally ignore
> the locale or it needs to set the locale to one that matches the rules
> it wants.

The former. Because locales are in general opaque, so you can never be sure
that a given locale will use the rules you want ("C" is the exception, but not
very useful).

hp

--=20
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson 

--b2wbudmypdkmv7il
Content-Type: application/pgp-signature; name="signature.asc"

-BEGIN PGP SIGNATURE-

iQIzBAABCAAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAlsujL8ACgkQ8g5IURL+
KF0dVxAAryd5Ew15J/aco4xdLAa+EGzIwFKoPQjIb90C4vbIXCbmrVwVYmbJN/yK
J9AhrnUoyipGbK2IMamCUpCp7XVfkCgnMvGmf8ZFeflwaw5NCJLvv42JqAgl+lUP
I1H/hEz/RoR8NFRWGseXGmTSt6KMwAUJjdDzK5eQru25U0vxTPkGmXLXE8wqmynA
lft/CbsPy374Dda99033UJpG9QDZubmhfnt8j4xuHX0u8ZJmY7LJWGQ+zt06RtP1
RA7m+IxyKRRLlRiVSoS5XslRMKSEGfUhqt+jYjVASE5nOgtPPmQswjpKb3fzTJ13
wVCQJGKTu9mOO7xPwhxms8bKBegxjDbrGF0G8FJYW/ty/brItUkhhtb+Z7Pkj1iq
d4xKRBQmux0tz5/kdFFUkz0u9CpFfB0+pzHGfK1edsAGh+lwzN2KgNcBff6H+5FL
er+FZ1HbicXecq2XgTgpq8UYdHeANQI5yMEPjrCiHG4ybZ9T+bVanayji1vC2xqC
DNZ7cWGTFz5AtSDC1fgRupG6ZX5BK/kZjdQz8Hx0AdgW1i4fKu28/giz6mbB+NnC
XgcyXRjj1Jr3aCThjaZ7bq5dYwLzZKNLRQ3shnJE+Tfm3HfvDjvw5Wqf3lOIkcaQ
Wf40K7bivvEZfdZzy/QkmtHjevVutrMZclj/e3NXPevBTWle6eI=
=W2Ut
-END PGP SIGNATURE-

--b2wbudmypdkmv7il--

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-24 Thread Richard Damon

From: Richard Damon 

On 6/23/18 11:44 AM, Steven D'Aprano wrote:
> On Sat, 23 Jun 2018 08:12:52 -0400, Richard Damon wrote:
>
>> On 6/23/18 7:46 AM, Steven D'Aprano wrote:
>>> On Sat, 23 Jun 2018 06:26:22 -0400, Richard Damon wrote:
>>>
 If you know the Locale, then you do know what the decimal separator
 is, as that is part of what a locale defines.
>>> A locale defines a set of common cultural conventions. It doesn't
>>> mandate the actual conventions in use in any specific document.
>>>
>>> If I'm in Australia, using the en-AU locale, nevertheless I can
>>> generate a file using , as a decimal separator. Try and stop me :-)
>> yes, you can MIS-use the en-AU locale and write 1,000 to mean the number
>> One, just as you can misuse the language and write cat when you mean a
>> member of the Canine group,
> How about if I write "le chien" or "der Hund" or "╤ⁿD1D▒D░D║D░"? Is that also
a
> misuse of the locale because I choose to write in a foreign language,
> using foreign conventions for spelling, grammar and syntax?
>
>
>> but then the misinterpretation is on the
>> creator of the document, not on the program that was told how the
>> document is to be read.
> You're assuming that there will be a misinterpretation. That's an absurd
> assumption to make. There might be, of course, but the documentation for
> my document might be clear that comma is to be used for decimal
> separators. Or it might include numbers like
>
> 1.234.567,012345678
>
> which is understandable to anyone who is aware of the possibility that
> comma may mean decimal separator and period the thousands separator.
>
Then I shouldn't be using en-AU to decode the file. If I use a locale based
parser, I need to give it the right locale.

Now, if I have a parser that doesn't use the locale, but some other rule base
than I just need to provide it with the right rules, which is basically just
defining the right locale.

--
Richard Damon

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-24 Thread Richard Damon

From: Richard Damon 

On 6/23/18 10:05 AM, Peter J. Holzer wrote:
> On 2018-06-23 08:41:38 -0400, Richard Damon wrote:
>> On 6/23/18 8:28 AM, Peter J. Holzer wrote:
>>> On 2018-06-23 08:12:52 -0400, Richard Damon wrote:
 On 6/23/18 7:46 AM, Steven D'Aprano wrote:
> If I'm in Australia, using the en-AU locale, nevertheless I can generate
> a file using , as a decimal separator. Try and stop me :-)
 yes, you can MIS-use the en-AU locale and write 1,000 to mean the number
 One, just as you can misuse the language and write cat when you mean a
 member of the Canine group, but then the misinterpretation is on the
 creator of the document, not on the program that was told how the
 document is to be read.
>>> How would he mis-use the en-AU locale to write 1 as "1,000"? I think
>>> to do that he would simply NOT use the locale.
>> Once you open the Locale can of worms, EVERYTHING has a locale, to say
>> you aren't using a locale is to say you are writing
>> something unintelligible, as you can thing of the locale as the set of
>> rules to interpret
> I don't think that's a useful way to look at it. "Locale" in
> (non-technical) English means "place" or "site". The idea behind the
> locale concept is that some conventions (e.g. how to write numbers or
> how to write strings) depend on the place where the program runs (or
> maybe where the user is sitting or grew up or maybe where a file was
> produced).
>
> For stuff which doesn't depend on the place (e.g. how a Python program
> should be parsed), the locale concept doesn't apply.
>
The Locale should NOT be the place the computer is running in (at least not
anymore), but where the data and the user are from (which can be different). Do
 your really mean that when I travel to a place that uses . as the thousands
separator and , as the decimal separator (instead of my normal environment when
 they are the other way around) all my programs should immediately change how
they read all my data files and how I need to enter data? I hope not. I want my
 computer to use the Locale of where "I" came from (not current am) to talk to
me, and to be able to set the Locale to interpret data to match the rules the
person who generated them used to generate them, so if they swap . and ,
compared to me, I can tell the program that. Your last parenthetical comment in
 the first paragraph is my key point, the locale used to read data should match
 the locale used to generate it, and that can easily be different than the
locale being used to interact with the user.

If a program doesn't care about the locale it is running in, like a Python
compiler, the either it needs to use routines that totally ignore the locale or
 it needs to set the locale to one that matches the rules it wants.

--
Richard Damon

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-24 Thread Peter J. Holzer

From: "Peter J. Holzer" 


--ngg56dmsr6vcxzs5
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 2018-06-23 12:41:33 -0400, Richard Damon wrote:
> On 6/23/18 11:44 AM, Steven D'Aprano wrote:
> > You're assuming that there will be a misinterpretation. That's an absur=
d=20
> > assumption to make. There might be, of course, but the documentation fo=
r=20
> > my document might be clear that comma is to be used for decimal=20
> > separators. Or it might include numbers like
> >
> > 1.234.567,012345678
> >
> > which is understandable to anyone who is aware of the possibility that=
=20
> > comma may mean decimal separator and period the thousands separator.
> >
> Then I shouldn't be using en-AU to decode the file.

Quite right, You shouldn't.

> Now, if I have a parser that doesn't use the locale, but some other rule
> base than I just need to provide it with the right rules, which is
> basically just defining the right locale.

Nope. The right rules for almost any file format are much more than the locale.

hp

--=20
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson 

--ngg56dmsr6vcxzs5
Content-Type: application/pgp-signature; name="signature.asc"

-BEGIN PGP SIGNATURE-

iQIzBAABCAAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAlsujY4ACgkQ8g5IURL+
KF1J7Q/+K7QhbxqE/4Q1jiCCFrU2OezQMwBAD7WXX60rk5gUzB2ubmz/X5VkUxVl
E9dwfQTIYzfT2e3rCif75jxjWEi1i8J6diNSJdl4SZWrLOngCLtxKC/Ns9JB54wG
NBUPwKwVMO+dYkXmiGz0RTQGfnQEqaPeHsWRj5uKLc7iFFQFHvSznasIUnhwRhLF
/DDzEQW7tFM0eLJ8XKCydL0BInbXNMdhnQ9WSo/N287Fio2dbVpgjVd4bVBaMhJR
5UrrP3Nim9ZWsvN0uRk75lgRzQhID/unxCC3d6J54+83ma4nNpOoBOMyHcaeU9d0
pSSE5LOeQB3QwUKOBo7kzvcxm/abK+qwaZ9D54ex170DY5O54FWrmeo/3kEU+xMX
BGXvn4DSfK4f6OcDExSWd6N0W1B5fxXxHqDiaDGPsDvlXT3Jc3OSD79FYc1LEh7z
6TAOy3VgOVmbF7M5DnSNPzEn2OTCGkIANf5C7zVS7GX/izki8H1Rk654yuhoZlBj
F0ixIQb2mSxsiJnOyYUT8dTFuQYhXcbgRWUM24oTUb51QEdD3DragY+J5Fai+WJH
QAZS4ryWjiQfvKtzhGYuAExHFA9IPpUS+qT2tLSpqY8sboow/jLt6oAzf8MNUl7j
CB+A2jxMgrvGjfgCOPkP4Ruwix1jJ62O6AvSePOmQxta+svmydQ=
=nEaJ
-END PGP SIGNATURE-

--ngg56dmsr6vcxzs5--

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-24 Thread Dennis Lee Bieber

From: Dennis Lee Bieber 

On Sat, 23 Jun 2018 15:44:14 + (UTC), Steven D'Aprano
 declaimed the following:

>1.234.567,012345678
>
>which is understandable to anyone who is aware of the possibility that
>comma may mean decimal separator and period the thousands separator.
>
Or it is an oddly marked local (7-digit) telephone number (though
typically for US, a leading "1" implies long-distance and should be followed by
 a 10-digit phone number) followed by a 9-digit (US) postal zip-code.



--
Wulfraed Dennis Lee Bieber AF6VN
wlfr...@ix.netcom.comHTTP://wlfraed.home.netcom.com/

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-24 Thread Steven D'Aprano

From: Steven D'Aprano 

On Sat, 23 Jun 2018 08:12:52 -0400, Richard Damon wrote:

> On 6/23/18 7:46 AM, Steven D'Aprano wrote:
>> On Sat, 23 Jun 2018 06:26:22 -0400, Richard Damon wrote:
>>
>>> If you know the Locale, then you do know what the decimal separator
>>> is, as that is part of what a locale defines.
>> A locale defines a set of common cultural conventions. It doesn't
>> mandate the actual conventions in use in any specific document.
>>
>> If I'm in Australia, using the en-AU locale, nevertheless I can
>> generate a file using , as a decimal separator. Try and stop me :-)
>
> yes, you can MIS-use the en-AU locale and write 1,000 to mean the number
> One, just as you can misuse the language and write cat when you mean a
> member of the Canine group,

How about if I write "le chien" or "der Hund" or "╤ⁿD1D▒D░D║D░"? Is that also a
 misuse of the locale because I choose to write in a foreign language, using
foreign conventions for spelling, grammar and syntax?

> but then the misinterpretation is on the
> creator of the document, not on the program that was told how the
> document is to be read.

You're assuming that there will be a misinterpretation. That's an absurd
assumption to make. There might be, of course, but the documentation for my
document might be clear that comma is to be used for decimal separators. Or it
might include numbers like

1.234.567,012345678

which is understandable to anyone who is aware of the possibility that comma
may mean decimal separator and period the thousands separator.

--
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing it everywhere."
 -- Jon Ronson

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-24 Thread Steven D'Aprano

From: Steven D'Aprano 

On Sat, 23 Jun 2018 09:42:29 -0400, Richard Damon wrote:

> On 6/23/18 9:05 AM, Marko Rauhamaa wrote:

>> Ok. Here's a value for you:
>>
>> 100ΓΘ¼
>>
>> I see '1', '0', '0', 'ΓΘ¼'. What do you see in your locale (LC_MONETARY)?
>
> If I processed that on my system I would either get $100, or an error of
> wrong currency symbol depending on the error checking.

Then your system is so unbelievably broken that it should be nuked from orbit,
just to be sure.

The data you were given was 100 Euros. If your system is incapable of reading
that as 100 Euros, and errors out, then at least to know that it is
brain-damaged and useless.

But if instead it silently changes the data to $100 (US dollars? Australian
dollars? Zimbabwe dollars? the gods only know what a system that broken might
do...) then it is not only broken but *dangerously* broken.

[...]
> Locale predates UCS-2, it was the early attempt to provide
> internationalization to C code so even programmers who didn't think
> about it could add the line setlocale(LC_ALL, "") and make their code
> work at least mostly right in more places. A single global was quick and
> simple, and since threads didn't exist, not an issue.

Threads were first used in 1967, five years before C even existed.

https://en.wikipedia.org/wiki/Thread_%28computing%29#History

--
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing it everywhere."
 -- Jon Ronson

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-24 Thread Peter J. Holzer

From: "Peter J. Holzer" 


--p4u6dkqn7e5fhtwt
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 2018-06-23 08:41:38 -0400, Richard Damon wrote:
> On 6/23/18 8:28 AM, Peter J. Holzer wrote:
> > On 2018-06-23 08:12:52 -0400, Richard Damon wrote:
> >> On 6/23/18 7:46 AM, Steven D'Aprano wrote:
> >>> If I'm in Australia, using the en-AU locale, nevertheless I can gener=
ate=20
> >>> a file using , as a decimal separator. Try and stop me :-)
> >> yes, you can MIS-use the en-AU locale and write 1,000 to mean the numb=
er
> >> One, just as you can misuse the language and write cat when you mean a
> >> member of the Canine group, but then the misinterpretation is on the
> >> creator of the document, not on the program that was told how the
> >> document is to be read.
> > How would he mis-use the en-AU locale to write 1 as "1,000"? I think
> > to do that he would simply NOT use the locale.
> Once you open the Locale can of worms, EVERYTHING has a locale, to say
> you aren't using a locale is to say you are writing
> something unintelligible, as you can thing of the locale as the set of
> rules to interpret

I don't think that's a useful way to look at it. "Locale" in (non-technical)
English means "place" or "site". The idea behind the locale concept is that
some conventions (e.g. how to write numbers or how to write strings) depend on
the place where the program runs (or maybe where the user is sitting or grew up
 or maybe where a file was produced).

For stuff which doesn't depend on the place (e.g. how a Python program should
be parsed), the locale concept doesn't apply.


> > You two also seem to be writing about different things when you write
> > "THE locale". Steven seems to mean the global settings a user has
> > chosen, you seem to mean the specidic settings appropriate for parsing a
> > specific file.

While I was writing this paragraph I realized that I had also used "the locale"
 in a specific meaning in the previous paragraph. I decided to let it stand and
 see whether anyone would call me out it.

> You have THE locale for a given piece of data.

Well, you didn't. Even though I quite obviously used "the locale" in Steven's
meaning, you didn't react to that at all and just continue as if your
definition is the only possible one.

hp

--=20
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson 

--p4u6dkqn7e5fhtwt
Content-Type: application/pgp-signature; name="signature.asc"

-BEGIN PGP SIGNATURE-

iQIzBAABCAAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAlsuU7kACgkQ8g5IURL+
KF308BAAkxEOzzNfnaRez6RwpTWGI7CbWb5HK8B8mmn+AoAo788Cj+yQYE0CEUbG
HBM0UPtc44ZqSrvIcyXlmL6xjUzkuPOYUZbtxVunWQm3NLgaViHf3b+1uSqsapYT
jqQo+LbQiENqPrroSjqWmOKpo3B5T9m7howPuvGMCBRV7B/CriOnVwYjrolo02JL
gcdcLPPyN4tGHhWFvMN6xycLS5m/bC4do8yRz/GPzRT/IoEI4gmKbk/10pzEK7iH
s3V+P74uznvR8B4PxCPNCWiI9LJD61K+u1qhdrmg+7XPHHqr/04GiPFZ2JHSthhG
joRP0UdlbTs+esELjUhrN7Xcd+Z1qlA9N86ULXv5QA0YaDCUXNhjOEkLAzOpx9Af
XWcmnVfaHePxhqIKHvo5tsx/eEdLRttiScw11UoAvyNmFHW9oiVZhP0za6GgiiHr
jXNOcC5uDTpisi3TsR8jV/MhBwGc01up6JMkVnCIXArGneZAeTDVocQrp99IDg4z
bGLoSeSafqH0Xxzv+f0UomOrvTlCV011Tst/rn94EkI3SonNq6/0TxLVO5da/Jj+
626iYqQcqk7vzF5cL8umtYDe78oRdycvJcxMmfheORuCiLTWLbYYi+0nHmo3q35F
/bfsla74h5ysvuJAyNLNCBoamziZvK/b35j9CQzLLN86QQNntdI=
=HQTI
-END PGP SIGNATURE-

--p4u6dkqn7e5fhtwt--

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-24 Thread Peter J. Holzer

From: "Peter J. Holzer" 


--jbhqoow7s7225t6e
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 2018-06-23 16:05:49 +0200, Peter J. Holzer wrote:
> I don't think that's a useful way to look at it. "Locale" in
> (non-technical) English means "place" or "site". The idea behind the
> locale concept is that some conventions (e.g. how to write numbers or
> how to write strings) depend on the place where the program runs

Sorry, I meant "how to *sort* strings.

hp

--=20
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson 

--jbhqoow7s7225t6e
Content-Type: application/pgp-signature; name="signature.asc"

-BEGIN PGP SIGNATURE-

iQIzBAABCAAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAlsuV/UACgkQ8g5IURL+
KF33DQ//aHNtdu5IitrWXN2+HIiyjN6I2Pepd6Uw59rIItHOTqSbO0quv0SEIYqp
pA/MBmVFVkCnzImOEY4/vP7CbZREnqDoSrjpK0UOTzm8rowi7Ovgr94b1eNYdfCv
B8fJxh4EJ7d852afVb3UM6SMrqJnGk2LZd4Ck0ViCTQg0AFI3BlGSGvvqwLt1tJ5
sN4J1pMR+Y6wlCZ3D7ElF/qEwnJTdSllteNWZA2egAtrvoFP+sk2spb+8PC9KaeU
cvhYILaQLI0Tqqfud6J4qNDGztWN9NtGYnoPcbwG6siXTwMIKniihKcogeGtNA8m
ynn3MES2BuSZ0tnxyCtdQcmN4bRkrtWM+2DALs6dnRizRgmH2WAn+PWb1t609oXV
9uegyyloaOtUDNwLkhMI0+W7VmE7yraUYpvXqOZIeNK+Aorbh0rsDtEcQyxc4TwJ
oPfW4ExQyV75d9n+IiFerDj/lKNQN5nIWrBzCf4ue29sxfAuABdDBudrgQ6HMM3K
w4kmUgzfvKqk+srnmFeMG+aCmglsUvZKQvzb+7W+yRx6zQkie/uzjFS5j4r8pOP/
Kk43L2oH3OMjZMBkky6WoRZkKMKexD7MfEpS1UaISkat2NQYeVIyal7Er9k3s4ZV
VXO/G77ac+qysAr2PPJXxrtVRcX/7/s2/p/P3bNxElUqk/oqE6s=
=lwIj
-END PGP SIGNATURE-

--jbhqoow7s7225t6e--

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-24 Thread Marko Rauhamaa

  To: Richard Damon
From: Marko Rauhamaa 

Richard Damon :

> On 6/23/18 9:05 AM, Marko Rauhamaa wrote:
>> Richard Damon :
>>
>>> On 6/23/18 8:03 AM, Marko Rauhamaa wrote:
 I always know my locale. The locale is tied to the human user.
>>> No, it should be tied to the data you are processing.
>>In computing, a locale is a set of parameters that defines the user's
>>language, region and any special variant preferences that the user
>>wants to see in their user interface.
>>
>>https://en.wikipedia.org/wiki/Locale_(computer_software)>
>>
>> The data should not depend on the locale.
> So no one foreign ever gives you data?

Never in my decades in computer programming have I found any use for locales.

In particular, they have never helped me decode "foreign" data, whether in
ASCII, Latin-1, Latin-3, Latin-9, JIS or UTF-8.

> Note, that wikipedia article is focused on the SYSTEM locale, which
> yes, that should reflect the what the user wants in his interface.

I don't think locales have anything to do with anything else.


>>> If an English user is feeding a program Chinese documents, while
>>> processing those documents the program should be using the
>>> appropriate Chinese Locale.
>> Not true.
> How else is the program going to understand the Chinese data?

If someone gives me a file, they had better indicate the file format.

> The fact that locale issues leak into data is the reason that the
> single immutable global locale doesn't work.

Locales don't work. Period.

> You really want to imbue into data streams what locale their data
> represents (and use that in some of the later processing of data from
> that stream).

Can you refer to a standard for that kind of imbuement?

Of course, you have document types, schema definitions and other implicit and
explicit format indicators. You shouldn't call them locales, though.

>>> Data presented to the user should normally use his locale (unless he
>>> has specified something different).
>> Ok. Here's a value for you:
>>
>> 100ΓΘ¼
>>
>> I see '1', '0', '0', 'ΓΘ¼'. What do you see in your locale (LC_MONETARY)?
> If I processed that on my system I would either get $100, or an error of
> wrong currency symbol depending on the error checking.

Don't forget to convert the amount as well...

>> The single global is due to what the locale was introduced for. It
>> came about around the time when Unix applications were being made
>> "8-bit clean." Along with UCS-2 and XML, it's one of those things you
>> wish you'd never have to deal with.
>
> Locale predates UCS-2, it was the early attempt to provide
> internationalization to C code so even programmers who didn't think
> about it could add the line setlocale(LC_ALL, "") and make their code
> work at least mostly right in more places. A single global was quick
> and simple, and since threads didn't exist, not an issue.
>
> In many ways it was the first attempt that should have been thrown
> away, but got too intertwined. C++ made a significant improvement to
> it by having streams remember their own locale.

Noone should breathe any new life into locales.

And yes, add C++ to the list of things you wish you'd never have to deal
with...


Marko

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-24 Thread Richard Damon

From: Richard Damon 

On 6/23/18 9:05 AM, Marko Rauhamaa wrote:
> Richard Damon :
>
>> On 6/23/18 8:03 AM, Marko Rauhamaa wrote:
>>> I always know my locale. The locale is tied to the human user.
>> No, it should be tied to the data you are processing.
>In computing, a locale is a set of parameters that defines the user's
>language, region and any special variant preferences that the user
>wants to see in their user interface.
>
>https://en.wikipedia.org/wiki/Locale_(computer_software)>
>
> The data should not depend on the locale.
So no one foreign ever gives you data? Note, that wikipedia article is focused
on the SYSTEM locale, which yes, that should reflect the what the user wants in
 his interface.
>
>> If an English user is feeding a program Chinese documents, while
>> processing those documents the program should be using the appropriate
>> Chinese Locale.
> Not true.
How else is the program going to understand the Chinese data?
>
>> Again, no, a locale is tied to the data, not the user (unless you want
>> to require the user to translate all data to his locale conventions
>> (without using a program that can use locale information) before
>> providing it to a program. Yes, the default for the interpretation
>> should be the users default/current locale, but you really want them
>> to be able to say I got this file from someone whose locale was
>> different than mine.
> The locale is not directly related to data or data formats. Of course,
> locales leak into data and create the sorry mess we are talking about.
The fact that locale issues leak into data is the reason that the single
immutable global locale doesn't work. You really want to imbue into data
streams what locale their data represents (and use that in some of the later
processing of data from that stream).
>
>> Data presented to the user should normally use his locale (unless he
>> has specified something different).
> Ok. Here's a value for you:
>
> 100ΓΘ¼
>
> I see '1', '0', '0', 'ΓΘ¼'. What do you see in your locale (LC_MONETARY)?
If I processed that on my system I would either get $100, or an error of wrong
currency symbol depending on the error checking.
>
>>> BTW, I think the locale is a terrible invention.
>> The locale is a lot better than the alternative, where every
>> application that needs to deal with internationalization need to
>> recreate (and debub) all of the mechanism. I agree it isn't perfect,
>> and for small simple programs it would be nice to be able to say "I
>> don't want all this stuff, make it go away".
> The locale doesn't solve a single problem in practice and often trips up
> programs. For example, a customer-visible bug was once caused by:
>
>sort 
> producing different results on different customers' machines.
>
> Mental note: *always* prefix GNU textutils commands with LANG=C.
Yes, one issue is that systems currently don't naturally tag data with the
locale to use (you can't even know for sure character set a file is in, so your
 example above might be 100 some funny character(s). It is starting be true
that you can often assume UTF-8 (at least on Linux, on Windows it is much less
so), and validating that it is valid UTF-8 is a pretty good sign that it is.
>
>> Python took its locale (at least initially) from C, which was a single
>> global which does have more issues because of this.
> The single global is due to what the locale was introduced for. It came
> about around the time when Unix applications were being made "8-bit
> clean." Along with UCS-2 and XML, it's one of those things you wish
> you'd never have to deal with.
>
>
> Marko

Locale predates UCS-2, it was the early attempt to provide internationalization
 to C code so even programmers who didn't think about it could add the line
setlocale(LC_ALL, "") and make their code work at least mostly right in more
places. A single global was quick and simple, and since threads didn't exist,
not an issue.

In many ways it was the first attempt that should have been thrown away, but
got too intertwined. C++ made a significant improvement to it by having streams
 remember their own locale.

--
Richard Damon

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-24 Thread Richard Damon

From: Richard Damon 

On 6/23/18 8:03 AM, Marko Rauhamaa wrote:
> Richard Damon :
>> If you know the Locale, then you do know what the decimal separator
>> is, as that is part of what a locale defines.
> I don't know what that sentence means.
When you set the locale
>
>> The issue is that if you just know the encoding, you don't necessarily
>> know the locale.
> I always know my locale. The locale is tied to the human user.
No, it should be tied to the data you are processing. If an English user is
feeding a program Chinese documents, while processing those documents the
program should be using the appropriate Chinese Locale. When generating output
to the user, it should switch (back) to the appropriate English Locale (likely
the system locale that the user set).
>
>> He also commented that he didn't want to set the locale in the
>> routine, as that sets it globally for the full application (but
>> perhaps that latter could be fixed by first doing a
>> locale.getlocale(), then setlocale for the files locale, and then at
>> the end of reading and processing restore back the old locale.
> Setting a locale application-wise is
>
>  * not in accordance with the idea of a locale (the locale should be
>constant within a user session)
Again, no, a locale is tied to the data, not the user (unless you want to
require the user to translate all data to his locale conventions (without using
 a program that can use locale information) before providing it to a program.
Yes, the default for the interpretation should be the users default/current
locale, but you really want them to be able to say I got this file from someone
 whose locale was different than mine.

Data presented to the user should normally use his locale (unless he has
specified something different).
>
>  * not easily possible (the locale is seen by all threads
>simultaneously)
That is an implementation error. It should be possible to create a thread
specific locale, and it is really useful to create a local locale that can be
used by the various conversion operators to say for this conversion use this
specific locale as that is what this data indicated how it is to be
interpreted.
>
>
> BTW, I think the locale is a terrible invention.
>
>
> Marko

The locale is a lot better than the alternative, where every application that
needs to deal with internationalization need to recreate (and debub) all of the
 mechanism. I agree it isn't perfect, and for small simple programs it would be
 nice to be able to say "I don't want all this stuff, make it go away".

Python took its locale (at least initially) from C, which was a single global
which does have more issues because of this. C++ objectified the locale and
allows the programmer to imbue a specific locale into different parts of his
program (in particular, each I/O Stream knows what locale its data is to be
processed with). Perhaps (maybe it has) it could be good to adopt the object
based locale concept of C++ (but that does come at a significant cost for
things like CPython) where streams know their locale, and other operations can
be optionally passed a locale to use.

--
Richard Damon

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-24 Thread Richard Damon

From: Richard Damon 

On 6/23/18 7:46 AM, Steven D'Aprano wrote:
> On Sat, 23 Jun 2018 06:26:22 -0400, Richard Damon wrote:
>
>> If you know the Locale, then you do know what the decimal separator is,
>> as that is part of what a locale defines.
> A locale defines a set of common cultural conventions. It doesn't mandate
> the actual conventions in use in any specific document.
>
> If I'm in Australia, using the en-AU locale, nevertheless I can generate
> a file using , as a decimal separator. Try and stop me :-)
yes, you can MIS-use the en-AU locale and write 1,000 to mean the number One,
just as you can misuse the language and write cat when you mean a member of the
 Canine group, but then the misinterpretation is on the creator of the
document, not on the program that was told how the document is to be read.
>
> But your point is taken -- I misread Ethan saying that he knew the locale
> and it wasn't helping, when in fact he was reluctant to change the locale
> as that's a process-wide global change.
>
>> The issue is that if you just
>> know the encoding, you don't necessarily know the locale. He also
>> commented that he didn't want to set the locale in the routine, as that
>> sets it globally for the full application (but perhaps that latter could
>> be fixed by first doing a locale.getlocale(), then setlocale for the
>> files locale, and then at the end of reading and processing restore back
>> the old locale.
> Indeed.
>
>

--
Richard Damon

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-24 Thread Steven D'Aprano

From: Steven D'Aprano 

On Sat, 23 Jun 2018 06:26:22 -0400, Richard Damon wrote:

> If you know the Locale, then you do know what the decimal separator is,
> as that is part of what a locale defines.

A locale defines a set of common cultural conventions. It doesn't mandate the
actual conventions in use in any specific document.

If I'm in Australia, using the en-AU locale, nevertheless I can generate a file
 using , as a decimal separator. Try and stop me :-)

But your point is taken -- I misread Ethan saying that he knew the locale and
it wasn't helping, when in fact he was reluctant to change the locale as that's
 a process-wide global change.

> The issue is that if you just
> know the encoding, you don't necessarily know the locale. He also
> commented that he didn't want to set the locale in the routine, as that
> sets it globally for the full application (but perhaps that latter could
> be fixed by first doing a locale.getlocale(), then setlocale for the
> files locale, and then at the end of reading and processing restore back
> the old locale.

Indeed.

--
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing it everywhere."
 -- Jon Ronson

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-24 Thread Marko Rauhamaa

  To: Richard Damon
From: Marko Rauhamaa 

Richard Damon :

> On 6/23/18 8:03 AM, Marko Rauhamaa wrote:
>> I always know my locale. The locale is tied to the human user.
> No, it should be tied to the data you are processing.

   In computing, a locale is a set of parameters that defines the user's
   language, region and any special variant preferences that the user
   wants to see in their user interface.

   https://en.wikipedia.org/wiki/Locale_(computer_software)>

The data should not depend on the locale.

> If an English user is feeding a program Chinese documents, while
> processing those documents the program should be using the appropriate
> Chinese Locale.

Not true.

> Again, no, a locale is tied to the data, not the user (unless you want
> to require the user to translate all data to his locale conventions
> (without using a program that can use locale information) before
> providing it to a program. Yes, the default for the interpretation
> should be the users default/current locale, but you really want them
> to be able to say I got this file from someone whose locale was
> different than mine.

The locale is not directly related to data or data formats. Of course, locales
leak into data and create the sorry mess we are talking about.

> Data presented to the user should normally use his locale (unless he
> has specified something different).

Ok. Here's a value for you:

100ΓΘ¼

I see '1', '0', '0', 'ΓΘ¼'. What do you see in your locale (LC_MONETARY)?

>> BTW, I think the locale is a terrible invention.
>
> The locale is a lot better than the alternative, where every
> application that needs to deal with internationalization need to
> recreate (and debub) all of the mechanism. I agree it isn't perfect,
> and for small simple programs it would be nice to be able to say "I
> don't want all this stuff, make it go away".

The locale doesn't solve a single problem in practice and often trips up
programs. For example, a customer-visible bug was once caused by:

   sort  Python took its locale (at least initially) from C, which was a single
> global which does have more issues because of this.

The single global is due to what the locale was introduced for. It came about
around the time when Unix applications were being made "8-bit clean." Along
with UCS-2 and XML, it's one of those things you wish you'd never have to deal
with.

Marko

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-24 Thread Richard Damon

From: Richard Damon 

On 6/23/18 8:28 AM, Peter J. Holzer wrote:
> On 2018-06-23 08:12:52 -0400, Richard Damon wrote:
>> On 6/23/18 7:46 AM, Steven D'Aprano wrote:
>>> On Sat, 23 Jun 2018 06:26:22 -0400, Richard Damon wrote:
 If you know the Locale, then you do know what the decimal separator is,
 as that is part of what a locale defines.
>>> A locale defines a set of common cultural conventions. It doesn't mandate
>>> the actual conventions in use in any specific document.
>>>
>>> If I'm in Australia, using the en-AU locale, nevertheless I can generate
>>> a file using , as a decimal separator. Try and stop me :-)
>> yes, you can MIS-use the en-AU locale and write 1,000 to mean the number
>> One, just as you can misuse the language and write cat when you mean a
>> member of the Canine group, but then the misinterpretation is on the
>> creator of the document, not on the program that was told how the
>> document is to be read.
> How would he mis-use the en-AU locale to write 1 as "1,000"? I think
> to do that he would simply NOT use the locale.
Once you open the Locale can of worms, EVERYTHING has a locale, to say you
aren't using a locale is to say you are writing something unintelligible, as
you can thing of the locale as the set of rules to interpret
>
> I think there are very good reasons to ignore the locale for specific
> purposes. For example, a Python interpreter should not use the locale
> when parsing Python, and a program producing Python should also ignore
> the locale.
Python, like many languages, define the formatting of things, so Python
programs should be interpreted according to the "Python" locale (which may
actually be named "C").
>
> You two also seem to be writing about different things when you write
> "THE locale". Steven seems to mean the global settings a user has
> chosen, you seem to mean the specidic settings appropriate for parsing a
> specific file.
>
> hp
>
You have THE locale for a given piece of data. My point is that Python has
adopted the C method of a single global locale for a program, so in the program
 there is a 'THE Locale' which may actually need to be different when
processing different information, leading to some of the issues.

--
Richard Damon

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-24 Thread Peter J. Holzer

From: "Peter J. Holzer" 


--drblskvcly73v23o
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 2018-06-23 08:12:52 -0400, Richard Damon wrote:
> On 6/23/18 7:46 AM, Steven D'Aprano wrote:
> > On Sat, 23 Jun 2018 06:26:22 -0400, Richard Damon wrote:
> >> If you know the Locale, then you do know what the decimal separator is,
> >> as that is part of what a locale defines.
> > A locale defines a set of common cultural conventions. It doesn't manda=
te=20
> > the actual conventions in use in any specific document.
> >
> > If I'm in Australia, using the en-AU locale, nevertheless I can generat=
e=20
> > a file using , as a decimal separator. Try and stop me :-)
> yes, you can MIS-use the en-AU locale and write 1,000 to mean the number
> One, just as you can misuse the language and write cat when you mean a
> member of the Canine group, but then the misinterpretation is on the
> creator of the document, not on the program that was told how the
> document is to be read.

How would he mis-use the en-AU locale to write 1 as "1,000"? I think to do that
 he would simply NOT use the locale.

I think there are very good reasons to ignore the locale for specific purposes.
 For example, a Python interpreter should not use the locale when parsing
Python, and a program producing Python should also ignore the locale.

You two also seem to be writing about different things when you write "THE
locale". Steven seems to mean the global settings a user has chosen, you seem
to mean the specidic settings appropriate for parsing a specific file.

hp

--=20
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson 

--drblskvcly73v23o
Content-Type: application/pgp-signature; name="signature.asc"

-BEGIN PGP SIGNATURE-

iQIzBAABCAAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAlsuPOoACgkQ8g5IURL+
KF3nWw//T34BviAcOezJAU59Fkp7i6gcaTJ4meYOkvvXaipB3QQGIVKZck1T/6VE
UcNJQipftT1/g3Uf4C9VRlrGwe2vq7QbeP220jBEECztmoqCBzpOgaVxOlpiP0gD
YiPDdk69KZYzjtt6kTO6kwAVLRereyYh4kPeq2zrpSe0tmx53jg9RrmztQz9SFtk
kU8klPFb1jzmgG8RLqrcB9FuUrBzfDxEXSsbHEqVqckAT9rYMLUtuqQPPZbi2zkC
ncXXVvBVA061CcYwvnIxfp8jWvAlXwKwC1mv7DFkOtSgnoo85STmWcGryybJsTID
cgCY90hnWfWM6rqLCS9eoeMMYOUItsxu0/uOAhsRMipt4lMI2Ebzhk5Udv87RFme
CmJEcSEHwYD4iB2Zw2BE8DksSyfciNbuWYS7GHiMz/fiO25upVikCoNZEPk1Xu/C
6wJ+H6fsSv4GGdQls25ykyyt73b+OXGvbIr8hp3Mcup/Fn5P0BCc2vZehDxNaC4p
alzqhRLfql2Hhr1TyPdapuxixBuD55PeRYOpKrpZmeQ0/O2m3Zube2Z3CrhPmQvH
JNdI5suWRyV52QVvDQXV/3bUVywehe8C/kmPtWl0FeDaZPjcO/yHOgQK2abGYfAs
t9vAMldEpQmBixh/hOqeGs5y9xLwQgf7liyvAV1ak2gIr9ntgdI=
=XB3R
-END PGP SIGNATURE-

--drblskvcly73v23o--

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-24 Thread Marko Rauhamaa

  To: Richard Damon
From: Marko Rauhamaa 

Richard Damon :
> If you know the Locale, then you do know what the decimal separator
> is, as that is part of what a locale defines.

I don't know what that sentence means.

> The issue is that if you just know the encoding, you don't necessarily
> know the locale.

I always know my locale. The locale is tied to the human user.

> He also commented that he didn't want to set the locale in the
> routine, as that sets it globally for the full application (but
> perhaps that latter could be fixed by first doing a
> locale.getlocale(), then setlocale for the files locale, and then at
> the end of reading and processing restore back the old locale.

Setting a locale application-wise is

 * not in accordance with the idea of a locale (the locale should be
   constant within a user session)

 * not easily possible (the locale is seen by all threads
   simultaneously)


BTW, I think the locale is a terrible invention.


Marko

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-24 Thread Steven D'Aprano

From: Steven D'Aprano 

On Fri, 22 Jun 2018 20:06:35 +0100, Ben Bacarisse wrote:

> Steven D'Aprano  writes:
>
>> On Fri, 22 Jun 2018 11:14:59 +0100, Ben Bacarisse wrote:
>>
> The code page remark is curious.  Will some "code pages" have digits
> that are not ASCII digits?

 Good question.  I have no idea.
>>>
>>> It's much more of an open question than I thought.
>>
>> Nah, Python already solves that for you:
>
> My understanding was that the OP does not (reliably) know the encoding,
> though that was a guess based on a turn of phrase.

I took it the other way: that Ethan *does* know the encoding, and his problem
is that knowing the encoding and/or locale is not enough to recognise whether
to use a period or comma as the decimal separator.

Which it isn't.

If he doesn't know the encoding, he has bigger problems than just converting
strings into floats. Without knowing the encoding, he cannot even reliably
detect non-ASCII digits at all.

> Another guess is that the OP does not have Unicode data.  The term "code
> page" hints at an 8-bit encoding or at least a pre-Unicode one.

Assuming he is using Python 3, or using Python 2 sensibly, once he has
specified the encoding and read the data from the file, he has Unicode.

Unicode is a superset of (ideally) all code pages. Once you have decoded the
data using the appropriate code page, you have a Unicode string, and Python
doesn't care where it came from.

The point is, once Ethan can get the intended characters out of the file into
Python, it doesn't matter what code page they came from. They're now
full-fledged Unicode characters, and Python's float() and int() functions can
easily deal with non-ASCII digits. So long as he has digits in the first place,
 float() and int() will deal with them correctly.

--
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing it everywhere."
 -- Jon Ronson

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-24 Thread Richard Damon

From: Richard Damon 

On 6/22/18 11:21 PM, Steven D'Aprano wrote:
> On Fri, 22 Jun 2018 20:06:35 +0100, Ben Bacarisse wrote:
>
>> Steven D'Aprano  writes:
>>
>>> On Fri, 22 Jun 2018 11:14:59 +0100, Ben Bacarisse wrote:
>>>
>> The code page remark is curious.  Will some "code pages" have digits
>> that are not ASCII digits?
> Good question.  I have no idea.
 It's much more of an open question than I thought.
>>> Nah, Python already solves that for you:
>> My understanding was that the OP does not (reliably) know the encoding,
>> though that was a guess based on a turn of phrase.
> I took it the other way: that Ethan *does* know the encoding, and his
> problem is that knowing the encoding and/or locale is not enough to
> recognise whether to use a period or comma as the decimal separator.
>
> Which it isn't.
If you know the Locale, then you do know what the decimal separator is, as that
 is part of what a locale defines. The issue is that if you just know the
encoding, you don't necessarily know the locale. He also commented that he
didn't want to set the locale in the routine, as that sets it globally for the
full application (but perhaps that latter could be fixed by first doing a
locale.getlocale(), then setlocale for the files locale, and then at the end of
 reading and processing restore back the old locale.
>
> If he doesn't know the encoding, he has bigger problems than just
> converting strings into floats. Without knowing the encoding, he cannot
> even reliably detect non-ASCII digits at all.
>
>
>> Another guess is that the OP does not have Unicode data.  The term "code
>> page" hints at an 8-bit encoding or at least a pre-Unicode one.
> Assuming he is using Python 3, or using Python 2 sensibly, once he has
> specified the encoding and read the data from the file, he has Unicode.
>
> Unicode is a superset of (ideally) all code pages. Once you have decoded
> the data using the appropriate code page, you have a Unicode string, and
> Python doesn't care where it came from.
>
> The point is, once Ethan can get the intended characters out of the file
> into Python, it doesn't matter what code page they came from. They're now
> full-fledged Unicode characters, and Python's float() and int() functions
> can easily deal with non-ASCII digits. So long as he has digits in the
> first place, float() and int() will deal with them correctly.
>
>

--
Richard Damon

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-24 Thread Ben Bacarisse

  To: Steven D'Aprano
From: Ben Bacarisse 

Steven D'Aprano  writes:

> On Fri, 22 Jun 2018 11:14:59 +0100, Ben Bacarisse wrote:
>
 The code page remark is curious.  Will some "code pages" have digits
 that are not ASCII digits?
>>>
>>> Good question.  I have no idea.
>>
>> It's much more of an open question than I thought.
>
> Nah, Python already solves that for you:

My understanding was that the OP does not (reliably) know the encoding, though
that was a guess based on a turn of phrase.

Another guess is that the OP does not have Unicode data.  The term "code page"
hints at an 8-bit encoding or at least a pre-Unicode one.

--
Ben.

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-24 Thread Richard Damon

On 6/23/18 10:44 PM, Steven D'Aprano wrote:
> On Sat, 23 Jun 2018 17:52:55 -0400, Richard Damon wrote:
>
>> If you have more than just a number representing a value in the locale
>> currency, you can't ask the locale how to present/accept it.
> You're the only one saying that it has to be handled by the locale.
>
>
Actually, it was part of the problem statement by Marko, since he said
to use LC_MONETARY, which is the part of the Locale machinery dealing
with monetary quantities (and can ONLY handle the currency defined by
the Locale). What would you think of providing a program in say, Java,
to a problem statement that said to write a Python program.

I suppose he could have just meant use the number, which would be like
asking to interpret the value of 100 euros using math.pi

Or it could have been just a bad question like how heavy is blue. (Since
by definition a locale only knows how to handle a single type of
currency, assuming any value is of that type).

My answer was in part to point out the problem with the problem
statement (and people seem to want to jump on me for pointing out the
strengths and weaknesses of the locale system.

This also goes back to the very original question at the beginning of
the thread, the OP had a bunch of data with numbers using varying locale
conventions (he didn't use the words), but had various decimal
separators and some people asked about non-'arabic' numbers  (0-9).

This also goes back to some of the comments about file formats. Most
file formats are designed to be 'Machine Read' (even if they use text
formatting) and as such do NOT use localization facilities, so when
processing them you want the I/O processing system to be in a
non-localized mode (typically numbers always use . as the decimal
separator, and usually nothing as the thousands separator). While the
text format files might be opened in a text editor, the file format
doesn't cater to making things pretty for the user. Some programs will
create input/output/storage files where it is expected that the user
WILL open them, look at them and maybe even edit them. Numbers will use
the locale convention of currency and decimal/thousands separators. If
you have such a system, changing the locale rules for these files may
cause misinterpreting the values.

If you are bringing such files from a 'foreign' system, you need to be
able to indicate what locale to use when reading that file. This sounds
very much like the category of problem that the OP was dealing with.
They have apparently a large number files, presumably organized in some
consistent manner that the values in them make sense, but the numbers
are written in different local conventions, and this was causing the
simplistic processing to fail.

-- 
Richard Damon

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Chris Angelico

On Sun, Jun 24, 2018 at 1:23 PM, Steven D'Aprano
 wrote:
> On Sun, 24 Jun 2018 12:53:49 +1000, Chris Angelico wrote:
>
> [...]
>>> Okay, you want a bit-pattern. In hex:
>>>
>>> '0x313030e282ac'
> [...]
>
>> Hmm. Actually, I'm a bit confused.
>>
> hex("100€".encode())
>> Traceback (most recent call last):
>>   File "", line 1, in 
>> TypeError: 'bytes' object cannot be interpreted as an integer
>>
>> Nope, that's not it. Needs something to turn the bytes into an integer
>> first. But I can't find a way to do that. Best I can find is:
>>
> "100€".encode().hex()
>> '313030e282ac'
>
> Dammit, that was what I was looking for, but I only looked on *strings*,
> not bytes.
>
>
>> No "0x" prefix, no function call. So, I'm stuck. How did you create your
>> one?
>
> py> hex(int.from_bytes("100€".encode("utf-8"), 'big'))
> '0x313030e282ac'

Ahhh thanks, that's the part I couldn't find (and didn't remember).

Anyhow, encoding to UTF-8 and then to bytes is pretty easy.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Steven D'Aprano

On Sun, 24 Jun 2018 12:53:49 +1000, Chris Angelico wrote:

[...]
>> Okay, you want a bit-pattern. In hex:
>>
>> '0x313030e282ac'
[...]

> Hmm. Actually, I'm a bit confused.
> 
 hex("100€".encode())
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: 'bytes' object cannot be interpreted as an integer
> 
> Nope, that's not it. Needs something to turn the bytes into an integer
> first. But I can't find a way to do that. Best I can find is:
> 
 "100€".encode().hex()
> '313030e282ac'

Dammit, that was what I was looking for, but I only looked on *strings*, 
not bytes.

 
> No "0x" prefix, no function call. So, I'm stuck. How did you create your
> one?

py> hex(int.from_bytes("100€".encode("utf-8"), 'big'))
'0x313030e282ac'



-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Chris Angelico

On Sun, Jun 24, 2018 at 12:44 PM, Steven D'Aprano
 wrote:
> You're joking, right? You can't possibly be so ignorant as to actually
> believe that. You have, right in front of you, a news post or email
> containing the text string "100€", and yet you are writing apparently in
> full seriousness that it is impossible to get that text string in a file.
>
> Okay, you want a bit-pattern. In hex:
>
> '0x313030e282ac'
>
> I'll leave the question of how I generated that as an exercise. (Hint: it
> was a one-liner, involving two method calls and a function call, all
> builtins in Python.)

Hmm. Actually, I'm a bit confused.

>>> hex("100€".encode())
Traceback (most recent call last):
  File "", line 1, in 
TypeError: 'bytes' object cannot be interpreted as an integer

Nope, that's not it. Needs something to turn the bytes into an integer
first. But I can't find a way to do that. Best I can find is:

>>> "100€".encode().hex()
'313030e282ac'

No "0x" prefix, no function call. So, I'm stuck. How did you create your one?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Steven D'Aprano

On Sat, 23 Jun 2018 17:52:55 -0400, Richard Damon wrote:

> If you have more than just a number representing a value in the locale
> currency, you can't ask the locale how to present/accept it.

You're the only one saying that it has to be handled by the locale.


-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Steven D'Aprano

On Sat, 23 Jun 2018 17:05:17 -0400, Richard Damon wrote:

> On 6/23/18 11:27 AM, Steven D'Aprano wrote:
>> On Sat, 23 Jun 2018 09:42:29 -0400, Richard Damon wrote:
>>
>>> On 6/23/18 9:05 AM, Marko Rauhamaa wrote:
 Ok. Here's a value for you:

 100€

[...]
> Locale based currency transformations are defined as a number to/from a
> text string.
> 
> The number CAN'T say 100 Euros (can you give me what bit pattern you
> would use for such a number).

You're joking, right? You can't possibly be so ignorant as to actually 
believe that. You have, right in front of you, a news post or email 
containing the text string "100€", and yet you are writing apparently in 
full seriousness that it is impossible to get that text string in a file.

Okay, you want a bit-pattern. In hex:

'0x313030e282ac'

I'll leave the question of how I generated that as an exercise. (Hint: it 
was a one-liner, involving two method calls and a function call, all 
builtins in Python.)

> The currency is encoded in the locale used for the conversion, so if it
> is using en-US, the currency value would ALWAYS be US$ (which the
> general locale format is just $).

I cannot imagine for a second why you think any of this is even a tiny 
bit relevant to the question of how one should read a data file 
containing currency in Euro.

You seem to have heard about the locale and decide it is the One True 
Hammer than all nails must be hammered with.

-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Richard Damon

On 6/23/18 5:31 PM, Ben Finney wrote:
> Richard Damon  writes:
>
>> On 6/23/18 11:27 AM, Steven D'Aprano wrote:
 On 6/23/18 9:05 AM, Marko Rauhamaa wrote:
> Richard Damon wrote:
>> Data presented to the user should normally use his locale
>> (unless he has specified something different).
> Ok. Here's a value for you:
>
> 100€
>
>>> […]
>>> The data you were given was 100 Euros. If your system is incapable
>>> of reading that as 100 Euros, and errors out, then at least to know
>>> that it is brain-damaged and useless.
>>>
>>> But if instead it silently changes the data to $100 (US dollars?
>>> Australian dollars? Zimbabwe dollars? the gods only know what a
>>> system that broken might do...) then it is not only broken but
>>> *dangerously* broken.
>>>
>> […]
>>
>> The number CAN'T say 100 Euros (can you give me what bit pattern you
>> would use for such a number).
> That is (I believe) the point being made: The data is *not* a number. It
> is a value that must encapsulate more than only the number 100, but also
> and simultaneously the curency “Euro”.
If you have more than just a number representing a value in the locale
currency, you can't ask the locale how to present/accept it.
>
>> The currency is encoded in the locale used for the conversion, so if it
>> is using en-US, the currency value would ALWAYS be US$ (which the
>> general locale format is just $). As such 100€ is an invalid input to a
>> system getting a Locale based input for a currency if the locale is not
>> one from a country that uses the euro.
> The value is 100 Euro, a quantity of a particular currency and not
> something trivially converted to US$ (for many reasons, including the
> obvious one that we don't know the exact exchange rate to use, and it
> will be different at a different time).
>
> You appear to be arguing that this value must either be arbitrarily
> converted to the user's local currency, something we agree is impossible
> to do given the data, or the value is simply invalid.
>
> So the rule you assert – “Data presented to the user should normally use
> his locale” – fails to usefuly handle the very normal case of data that
> represents a quantity of some foreign currency. Any system following
> your asserted rule will give either the wrong answer, or an error. We
> had better hope the rule you assert is not in effect.
>
If the user wants to talk in Euro using software that uses locales, then
he should specify a locale that uses Euros.

If you have a field to enter a foreign currency, then you can NOT make
that a LC_CURRENCY field, or you need to make that field use a different
locale than the local locale. This isn't the fault of locales, but in a
misuse of the system.

This original question came when it was asked what do I see with 100€ in
MY locale LC_CURRENCY, well MY locale doesn't have a LC_CURRENCY that is
euros, so it can't express that. It is a bit like asking how to draw a
circle with 4 straight lines or get to the moon in a boat. It is a
question with an improper premise.

-- 
Richard Damon

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Ben Finney

Richard Damon  writes:

> On 6/23/18 11:27 AM, Steven D'Aprano wrote:
> >> On 6/23/18 9:05 AM, Marko Rauhamaa wrote:
> >>> Richard Damon wrote:
> >>> > Data presented to the user should normally use his locale
> >>> > (unless he has specified something different).
> >>>
> >>> Ok. Here's a value for you:
> >>>
> >>> 100€
> >>>
> > […]
> > The data you were given was 100 Euros. If your system is incapable
> > of reading that as 100 Euros, and errors out, then at least to know
> > that it is brain-damaged and useless.
> >
> > But if instead it silently changes the data to $100 (US dollars?
> > Australian dollars? Zimbabwe dollars? the gods only know what a
> > system that broken might do...) then it is not only broken but
> > *dangerously* broken.
> >
> […]
>
> The number CAN'T say 100 Euros (can you give me what bit pattern you
> would use for such a number).

That is (I believe) the point being made: The data is *not* a number. It
is a value that must encapsulate more than only the number 100, but also
and simultaneously the curency “Euro”.

> The currency is encoded in the locale used for the conversion, so if it
> is using en-US, the currency value would ALWAYS be US$ (which the
> general locale format is just $). As such 100€ is an invalid input to a
> system getting a Locale based input for a currency if the locale is not
> one from a country that uses the euro.

The value is 100 Euro, a quantity of a particular currency and not
something trivially converted to US$ (for many reasons, including the
obvious one that we don't know the exact exchange rate to use, and it
will be different at a different time).

You appear to be arguing that this value must either be arbitrarily
converted to the user's local currency, something we agree is impossible
to do given the data, or the value is simply invalid.

So the rule you assert – “Data presented to the user should normally use
his locale” – fails to usefuly handle the very normal case of data that
represents a quantity of some foreign currency. Any system following
your asserted rule will give either the wrong answer, or an error. We
had better hope the rule you assert is not in effect.

-- 
 \ “DRM doesn't inconvenience [lawbreakers] — indeed, over time it |
  `\ trains law-abiding users to become [lawbreakers] out of sheer |
_o__)frustration.” —Charles Stross, 2010-05-09 |
Ben Finney

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Richard Damon

On 6/23/18 11:27 AM, Steven D'Aprano wrote:
> On Sat, 23 Jun 2018 09:42:29 -0400, Richard Damon wrote:
>
>> On 6/23/18 9:05 AM, Marko Rauhamaa wrote:
>>> Ok. Here's a value for you:
>>>
>>> 100€
>>>
>>> I see '1', '0', '0', '€'. What do you see in your locale (LC_MONETARY)?
>> If I processed that on my system I would either get $100, or an error of
>> wrong currency symbol depending on the error checking.
> Then your system is so unbelievably broken that it should be nuked from 
> orbit, just to be sure.
>
> The data you were given was 100 Euros. If your system is incapable of 
> reading that as 100 Euros, and errors out, then at least to know that it 
> is brain-damaged and useless.
>
> But if instead it silently changes the data to $100 (US dollars? 
> Australian dollars? Zimbabwe dollars? the gods only know what a system 
> that broken might do...) then it is not only broken but *dangerously* 
> broken.
>
Locale based currency transformations are defined as a number to/from a
text string.

The number CAN'T say 100 Euros (can you give me what bit pattern you
would use for such a number).
The currency is encoded in the locale used for the conversion, so if it
is using en-US, the currency value would ALWAYS be US$ (which the
general locale format is just $). As such 100€ is an invalid input to a
system getting a Locale based input for a currency if the locale is not
one from a country that uses the euro. What the input sees is '1', '0',
'0',  some funny character (or maybe 2 of them). A poorly designed
input, or one being intentionally generous on input acceptance would
return 100, which would be implied US Dollars. A better error checking
routine would give an error. It is IMPOSSIBLE for it to return a number
that would be 100 euros. I suppose a very smart system might see that it
was in a different currency and try to convert it, but unless time
reference point to use for the currency, you are likely to get a wrong
answer, but in any case, the answer will NOT be 100 euros, but some
equivalent value in Dollars.

Now, if you want to define a perhaps more general currency input routine
that tries to detect a pan-locale currency input, and returned both a
value and a currency type, that could be more useful in some contexts.
But you then run into the interesting (and difficult) problem that if
you see the input of 123.456€ what is that value, is it a value around a
hundred euros specified to 3 decimal places, or is it a number just over
100 thousand euros.
>
> [...]
>> Locale predates UCS-2, it was the early attempt to provide
>> internationalization to C code so even programmers who didn't think
>> about it could add the line setlocale(LC_ALL, "") and make their code
>> work at least mostly right in more places. A single global was quick and
>> simple, and since threads didn't exist, not an issue.
> Threads were first used in 1967, five years before C even existed.
>
> https://en.wikipedia.org/wiki/Thread_%28computing%29#History
>
Threads did NOT exist (at least to the Standard) in C when locales were
added, and the C language did nothing to support threading at that time.
Looking back, it was perhaps a regrettable decision to implement locales
globally the way there were, but it is what it is.

-- 
Richard Damon

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Peter J. Holzer

On 2018-06-23 12:41:33 -0400, Richard Damon wrote:
> On 6/23/18 11:44 AM, Steven D'Aprano wrote:
> > You're assuming that there will be a misinterpretation. That's an absurd 
> > assumption to make. There might be, of course, but the documentation for 
> > my document might be clear that comma is to be used for decimal 
> > separators. Or it might include numbers like
> >
> > 1.234.567,012345678
> >
> > which is understandable to anyone who is aware of the possibility that 
> > comma may mean decimal separator and period the thousands separator.
> >
> Then I shouldn't be using en-AU to decode the file.

Quite right, You shouldn't.

> Now, if I have a parser that doesn't use the locale, but some other rule
> base than I just need to provide it with the right rules, which is
> basically just defining the right locale.

Nope. The right rules for almost any file format are much more than the
locale.

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson 


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Peter J. Holzer

On 2018-06-23 12:11:34 -0400, Richard Damon wrote:
> On 6/23/18 10:05 AM, Peter J. Holzer wrote:
> > On 2018-06-23 08:41:38 -0400, Richard Damon wrote:
> >> Once you open the Locale can of worms, EVERYTHING has a locale, to say
> >> you aren't using a locale is to say you are writing
> >> something unintelligible, as you can thing of the locale as the set of
> >> rules to interpret
> > I don't think that's a useful way to look at it. "Locale" in
> > (non-technical) English means "place" or "site". The idea behind the
> > locale concept is that some conventions (e.g. how to write numbers or
> > how to write strings) depend on the place where the program runs (or
> > maybe where the user is sitting or grew up or maybe where a file was
> > produced).
> >
> > For stuff which doesn't depend on the place (e.g. how a Python program
> > should be parsed), the locale concept doesn't apply.
> >
> The Locale should NOT be the place the computer is running in (at least
> not anymore), but where the data and the user are from (which can be
> different).

Yes, it can be different, but for some *very* common cases (PCs,
smartphones most of the time) it isn't. More imporantly for the concept,
when the concept was developed (in the late 1980's) is was very common
(probably more common than 10 years earlier).

> Do your really mean that when I travel to a place that uses
> . as the thousands separator and , as the decimal separator (instead of
> my normal environment when they are the other way around) all my
> programs should immediately change how they read all my data files and
> how I need to enter data? I hope not.

Sometimes, yes. If you want to work with your colleagues at that place
they might thank you to use the local conventions.

> I want my computer to use the Locale of where "I" came from (not
> current am) to talk to me,

That's why I wrote "or grew up".

> and to be able to set the Locale to interpret data to match the rules
> the person who generated them used to generate them,

And that's why I wrote "where a file was produced".

So many words to repeat what I already wrote ...

> so if they swap . and , compared to me, I can tell the program that.
> Your last parenthetical comment in the first paragraph is my key
> point,

I think it is the weakest point. The locale is useful for interactive
use (input and output) and also for output intended for human users. For
parsing files it is woefully inadequate (also for generating files
intended to be parsed).

> the locale used to read data should match the locale used to generate
> it, and that can easily be different than the locale being used to
> interact with the user.

Which is basically why "locale" is a rather useless concept with files.
When I get a CSV file, I don't want to say "use locale en_US.cp437",
because the location "US" is almost completely irrelevant, the language
"English" is somewhat relevant but much too specific", and the list
separator isn't there at all. I want to tell it: Decode using CP437, a
decimal point, tabs as a list separator, CRLF as the record separator,
no quoting.

> If a program doesn't care about the locale it is running in, like a
> Python compiler, the either it needs to use routines that totally ignore
> the locale or it needs to set the locale to one that matches the rules
> it wants.

The former. Because locales are in general opaque, so you can never be
sure that a given locale will use the rules you want ("C" is the
exception, but not very useful).

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson 

signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Richard Damon

On 6/23/18 11:44 AM, Steven D'Aprano wrote:
> On Sat, 23 Jun 2018 08:12:52 -0400, Richard Damon wrote:
>
>> On 6/23/18 7:46 AM, Steven D'Aprano wrote:
>>> On Sat, 23 Jun 2018 06:26:22 -0400, Richard Damon wrote:
>>>
 If you know the Locale, then you do know what the decimal separator
 is, as that is part of what a locale defines.
>>> A locale defines a set of common cultural conventions. It doesn't
>>> mandate the actual conventions in use in any specific document.
>>>
>>> If I'm in Australia, using the en-AU locale, nevertheless I can
>>> generate a file using , as a decimal separator. Try and stop me :-)
>> yes, you can MIS-use the en-AU locale and write 1,000 to mean the number
>> One, just as you can misuse the language and write cat when you mean a
>> member of the Canine group, 
> How about if I write "le chien" or "der Hund" or "собака"? Is that also a 
> misuse of the locale because I choose to write in a foreign language, 
> using foreign conventions for spelling, grammar and syntax?
>
>
>> but then the misinterpretation is on the
>> creator of the document, not on the program that was told how the
>> document is to be read.
> You're assuming that there will be a misinterpretation. That's an absurd 
> assumption to make. There might be, of course, but the documentation for 
> my document might be clear that comma is to be used for decimal 
> separators. Or it might include numbers like
>
> 1.234.567,012345678
>
> which is understandable to anyone who is aware of the possibility that 
> comma may mean decimal separator and period the thousands separator.
>
Then I shouldn't be using en-AU to decode the file. If I use a locale
based parser, I need to give it the right locale.

Now, if I have a parser that doesn't use the locale, but some other rule
base than I just need to provide it with the right rules, which is
basically just defining the right locale.

-- 
Richard Damon

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Richard Damon

On 6/23/18 10:05 AM, Peter J. Holzer wrote:
> On 2018-06-23 08:41:38 -0400, Richard Damon wrote:
>> On 6/23/18 8:28 AM, Peter J. Holzer wrote:
>>> On 2018-06-23 08:12:52 -0400, Richard Damon wrote:
 On 6/23/18 7:46 AM, Steven D'Aprano wrote:
> If I'm in Australia, using the en-AU locale, nevertheless I can generate 
> a file using , as a decimal separator. Try and stop me :-)
 yes, you can MIS-use the en-AU locale and write 1,000 to mean the number
 One, just as you can misuse the language and write cat when you mean a
 member of the Canine group, but then the misinterpretation is on the
 creator of the document, not on the program that was told how the
 document is to be read.
>>> How would he mis-use the en-AU locale to write 1 as "1,000"? I think
>>> to do that he would simply NOT use the locale.
>> Once you open the Locale can of worms, EVERYTHING has a locale, to say
>> you aren't using a locale is to say you are writing
>> something unintelligible, as you can thing of the locale as the set of
>> rules to interpret
> I don't think that's a useful way to look at it. "Locale" in
> (non-technical) English means "place" or "site". The idea behind the
> locale concept is that some conventions (e.g. how to write numbers or
> how to write strings) depend on the place where the program runs (or
> maybe where the user is sitting or grew up or maybe where a file was
> produced).
>
> For stuff which doesn't depend on the place (e.g. how a Python program
> should be parsed), the locale concept doesn't apply.
>
The Locale should NOT be the place the computer is running in (at least
not anymore), but where the data and the user are from (which can be
different). Do your really mean that when I travel to a place that uses
. as the thousands separator and , as the decimal separator (instead of
my normal environment when they are the other way around) all my
programs should immediately change how they read all my data files and
how I need to enter data? I hope not. I want my computer to use the
Locale of where "I" came from (not current am) to talk to me, and to be
able to set the Locale to interpret data to match the rules the person
who generated them used to generate them, so if they swap . and ,
compared to me, I can tell the program that. Your last parenthetical
comment in the first paragraph is my key point, the locale used to read
data should match the locale used to generate it, and that can easily be
different than the locale being used to interact with the user.

If a program doesn't care about the locale it is running in, like a
Python compiler, the either it needs to use routines that totally ignore
the locale or it needs to set the locale to one that matches the rules
it wants.

-- 
Richard Damon

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Steven D'Aprano

On Sat, 23 Jun 2018 08:12:52 -0400, Richard Damon wrote:

> On 6/23/18 7:46 AM, Steven D'Aprano wrote:
>> On Sat, 23 Jun 2018 06:26:22 -0400, Richard Damon wrote:
>>
>>> If you know the Locale, then you do know what the decimal separator
>>> is, as that is part of what a locale defines.
>> A locale defines a set of common cultural conventions. It doesn't
>> mandate the actual conventions in use in any specific document.
>>
>> If I'm in Australia, using the en-AU locale, nevertheless I can
>> generate a file using , as a decimal separator. Try and stop me :-)
>
> yes, you can MIS-use the en-AU locale and write 1,000 to mean the number
> One, just as you can misuse the language and write cat when you mean a
> member of the Canine group, 

How about if I write "le chien" or "der Hund" or "собака"? Is that also a 
misuse of the locale because I choose to write in a foreign language, 
using foreign conventions for spelling, grammar and syntax?

> but then the misinterpretation is on the
> creator of the document, not on the program that was told how the
> document is to be read.

You're assuming that there will be a misinterpretation. That's an absurd 
assumption to make. There might be, of course, but the documentation for 
my document might be clear that comma is to be used for decimal 
separators. Or it might include numbers like

1.234.567,012345678

which is understandable to anyone who is aware of the possibility that 
comma may mean decimal separator and period the thousands separator.

-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Steven D'Aprano

On Sat, 23 Jun 2018 09:42:29 -0400, Richard Damon wrote:

> On 6/23/18 9:05 AM, Marko Rauhamaa wrote:

>> Ok. Here's a value for you:
>>
>> 100€
>>
>> I see '1', '0', '0', '€'. What do you see in your locale (LC_MONETARY)?
> 
> If I processed that on my system I would either get $100, or an error of
> wrong currency symbol depending on the error checking.

Then your system is so unbelievably broken that it should be nuked from 
orbit, just to be sure.

The data you were given was 100 Euros. If your system is incapable of 
reading that as 100 Euros, and errors out, then at least to know that it 
is brain-damaged and useless.

But if instead it silently changes the data to $100 (US dollars? 
Australian dollars? Zimbabwe dollars? the gods only know what a system 
that broken might do...) then it is not only broken but *dangerously* 
broken.

[...]
> Locale predates UCS-2, it was the early attempt to provide
> internationalization to C code so even programmers who didn't think
> about it could add the line setlocale(LC_ALL, "") and make their code
> work at least mostly right in more places. A single global was quick and
> simple, and since threads didn't exist, not an issue.

Threads were first used in 1967, five years before C even existed.

https://en.wikipedia.org/wiki/Thread_%28computing%29#History

-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Peter J. Holzer

On 2018-06-23 16:05:49 +0200, Peter J. Holzer wrote:
> I don't think that's a useful way to look at it. "Locale" in
> (non-technical) English means "place" or "site". The idea behind the
> locale concept is that some conventions (e.g. how to write numbers or
> how to write strings) depend on the place where the program runs

Sorry, I meant "how to *sort* strings.

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson 


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Marko Rauhamaa

Richard Damon :

> On 6/23/18 9:05 AM, Marko Rauhamaa wrote:
>> Richard Damon :
>>
>>> On 6/23/18 8:03 AM, Marko Rauhamaa wrote:
 I always know my locale. The locale is tied to the human user.
>>> No, it should be tied to the data you are processing.
>>In computing, a locale is a set of parameters that defines the user's
>>language, region and any special variant preferences that the user
>>wants to see in their user interface.
>>
>>https://en.wikipedia.org/wiki/Locale_(computer_software)>
>>
>> The data should not depend on the locale.
> So no one foreign ever gives you data?

Never in my decades in computer programming have I found any use for
locales.

In particular, they have never helped me decode "foreign" data, whether
in ASCII, Latin-1, Latin-3, Latin-9, JIS or UTF-8.

> Note, that wikipedia article is focused on the SYSTEM locale, which
> yes, that should reflect the what the user wants in his interface.

I don't think locales have anything to do with anything else.


>>> If an English user is feeding a program Chinese documents, while
>>> processing those documents the program should be using the
>>> appropriate Chinese Locale.
>> Not true.
> How else is the program going to understand the Chinese data?

If someone gives me a file, they had better indicate the file format.

> The fact that locale issues leak into data is the reason that the
> single immutable global locale doesn't work.

Locales don't work. Period.

> You really want to imbue into data streams what locale their data
> represents (and use that in some of the later processing of data from
> that stream).

Can you refer to a standard for that kind of imbuement?

Of course, you have document types, schema definitions and other
implicit and explicit format indicators. You shouldn't call them
locales, though.

>>> Data presented to the user should normally use his locale (unless he
>>> has specified something different).
>> Ok. Here's a value for you:
>>
>> 100€
>>
>> I see '1', '0', '0', '€'. What do you see in your locale (LC_MONETARY)?
> If I processed that on my system I would either get $100, or an error of
> wrong currency symbol depending on the error checking.

Don't forget to convert the amount as well...

>> The single global is due to what the locale was introduced for. It
>> came about around the time when Unix applications were being made
>> "8-bit clean." Along with UCS-2 and XML, it's one of those things you
>> wish you'd never have to deal with.
>
> Locale predates UCS-2, it was the early attempt to provide
> internationalization to C code so even programmers who didn't think
> about it could add the line setlocale(LC_ALL, "") and make their code
> work at least mostly right in more places. A single global was quick
> and simple, and since threads didn't exist, not an issue.
>
> In many ways it was the first attempt that should have been thrown
> away, but got too intertwined. C++ made a significant improvement to
> it by having streams remember their own locale.

Noone should breathe any new life into locales.

And yes, add C++ to the list of things you wish you'd never have to deal
with...


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Peter J. Holzer

On 2018-06-23 08:41:38 -0400, Richard Damon wrote:
> On 6/23/18 8:28 AM, Peter J. Holzer wrote:
> > On 2018-06-23 08:12:52 -0400, Richard Damon wrote:
> >> On 6/23/18 7:46 AM, Steven D'Aprano wrote:
> >>> If I'm in Australia, using the en-AU locale, nevertheless I can generate 
> >>> a file using , as a decimal separator. Try and stop me :-)
> >> yes, you can MIS-use the en-AU locale and write 1,000 to mean the number
> >> One, just as you can misuse the language and write cat when you mean a
> >> member of the Canine group, but then the misinterpretation is on the
> >> creator of the document, not on the program that was told how the
> >> document is to be read.
> > How would he mis-use the en-AU locale to write 1 as "1,000"? I think
> > to do that he would simply NOT use the locale.
> Once you open the Locale can of worms, EVERYTHING has a locale, to say
> you aren't using a locale is to say you are writing
> something unintelligible, as you can thing of the locale as the set of
> rules to interpret

I don't think that's a useful way to look at it. "Locale" in
(non-technical) English means "place" or "site". The idea behind the
locale concept is that some conventions (e.g. how to write numbers or
how to write strings) depend on the place where the program runs (or
maybe where the user is sitting or grew up or maybe where a file was
produced).

For stuff which doesn't depend on the place (e.g. how a Python program
should be parsed), the locale concept doesn't apply.

> > You two also seem to be writing about different things when you write
> > "THE locale". Steven seems to mean the global settings a user has
> > chosen, you seem to mean the specidic settings appropriate for parsing a
> > specific file.

While I was writing this paragraph I realized that I had also used "the
locale" in a specific meaning in the previous paragraph. I decided to
let it stand and see whether anyone would call me out it.

> You have THE locale for a given piece of data.

Well, you didn't. Even though I quite obviously used "the locale" in
Steven's meaning, you didn't react to that at all and just continue as
if your definition is the only possible one.

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson 

signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Richard Damon

On 6/23/18 9:05 AM, Marko Rauhamaa wrote:
> Richard Damon :
>
>> On 6/23/18 8:03 AM, Marko Rauhamaa wrote:
>>> I always know my locale. The locale is tied to the human user.
>> No, it should be tied to the data you are processing.
>In computing, a locale is a set of parameters that defines the user's
>language, region and any special variant preferences that the user
>wants to see in their user interface.
>
>https://en.wikipedia.org/wiki/Locale_(computer_software)>
>
> The data should not depend on the locale.
So no one foreign ever gives you data? Note, that wikipedia article is
focused on the SYSTEM locale, which yes, that should reflect the what
the user wants in his interface.
>
>> If an English user is feeding a program Chinese documents, while
>> processing those documents the program should be using the appropriate
>> Chinese Locale.
> Not true.
How else is the program going to understand the Chinese data?
>
>> Again, no, a locale is tied to the data, not the user (unless you want
>> to require the user to translate all data to his locale conventions
>> (without using a program that can use locale information) before
>> providing it to a program. Yes, the default for the interpretation
>> should be the users default/current locale, but you really want them
>> to be able to say I got this file from someone whose locale was
>> different than mine.
> The locale is not directly related to data or data formats. Of course,
> locales leak into data and create the sorry mess we are talking about.
The fact that locale issues leak into data is the reason that the single
immutable global locale doesn't work. You really want to imbue into data
streams what locale their data represents (and use that in some of the
later processing of data from that stream).
>
>> Data presented to the user should normally use his locale (unless he
>> has specified something different).
> Ok. Here's a value for you:
>
> 100€
>
> I see '1', '0', '0', '€'. What do you see in your locale (LC_MONETARY)?
If I processed that on my system I would either get $100, or an error of
wrong currency symbol depending on the error checking.
>
>>> BTW, I think the locale is a terrible invention.
>> The locale is a lot better than the alternative, where every
>> application that needs to deal with internationalization need to
>> recreate (and debub) all of the mechanism. I agree it isn't perfect,
>> and for small simple programs it would be nice to be able to say "I
>> don't want all this stuff, make it go away".
> The locale doesn't solve a single problem in practice and often trips up
> programs. For example, a customer-visible bug was once caused by:
>
>sort 
> producing different results on different customers' machines.
>
> Mental note: *always* prefix GNU textutils commands with LANG=C.
Yes, one issue is that systems currently don't naturally tag data with
the locale to use (you can't even know for sure character set a file is
in, so your example above might be 100 some funny character(s). It is
starting be true that you can often assume UTF-8 (at least on Linux, on
Windows it is much less so), and validating that it is valid UTF-8 is a
pretty good sign that it is.
>
>> Python took its locale (at least initially) from C, which was a single
>> global which does have more issues because of this.
> The single global is due to what the locale was introduced for. It came
> about around the time when Unix applications were being made "8-bit
> clean." Along with UCS-2 and XML, it's one of those things you wish
> you'd never have to deal with.
>
>
> Marko

Locale predates UCS-2, it was the early attempt to provide
internationalization to C code so even programmers who didn't think
about it could add the line setlocale(LC_ALL, "") and make their code
work at least mostly right in more places. A single global was quick and
simple, and since threads didn't exist, not an issue.

In many ways it was the first attempt that should have been thrown away,
but got too intertwined. C++ made a significant improvement to it by
having streams remember their own locale.

-- 
Richard Damon

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Marko Rauhamaa

Richard Damon :

> On 6/23/18 8:03 AM, Marko Rauhamaa wrote:
>> I always know my locale. The locale is tied to the human user.
> No, it should be tied to the data you are processing.

   In computing, a locale is a set of parameters that defines the user's
   language, region and any special variant preferences that the user
   wants to see in their user interface.

   https://en.wikipedia.org/wiki/Locale_(computer_software)>

The data should not depend on the locale.

> If an English user is feeding a program Chinese documents, while
> processing those documents the program should be using the appropriate
> Chinese Locale.

Not true.

> Again, no, a locale is tied to the data, not the user (unless you want
> to require the user to translate all data to his locale conventions
> (without using a program that can use locale information) before
> providing it to a program. Yes, the default for the interpretation
> should be the users default/current locale, but you really want them
> to be able to say I got this file from someone whose locale was
> different than mine.

The locale is not directly related to data or data formats. Of course,
locales leak into data and create the sorry mess we are talking about.

> Data presented to the user should normally use his locale (unless he
> has specified something different).

Ok. Here's a value for you:

100€

I see '1', '0', '0', '€'. What do you see in your locale (LC_MONETARY)?

>> BTW, I think the locale is a terrible invention.
>
> The locale is a lot better than the alternative, where every
> application that needs to deal with internationalization need to
> recreate (and debub) all of the mechanism. I agree it isn't perfect,
> and for small simple programs it would be nice to be able to say "I
> don't want all this stuff, make it go away".

The locale doesn't solve a single problem in practice and often trips up
programs. For example, a customer-visible bug was once caused by:

   sort  Python took its locale (at least initially) from C, which was a single
> global which does have more issues because of this.

The single global is due to what the locale was introduced for. It came
about around the time when Unix applications were being made "8-bit
clean." Along with UCS-2 and XML, it's one of those things you wish
you'd never have to deal with.

Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Richard Damon

On 6/23/18 8:28 AM, Peter J. Holzer wrote:
> On 2018-06-23 08:12:52 -0400, Richard Damon wrote:
>> On 6/23/18 7:46 AM, Steven D'Aprano wrote:
>>> On Sat, 23 Jun 2018 06:26:22 -0400, Richard Damon wrote:
 If you know the Locale, then you do know what the decimal separator is,
 as that is part of what a locale defines.
>>> A locale defines a set of common cultural conventions. It doesn't mandate 
>>> the actual conventions in use in any specific document.
>>>
>>> If I'm in Australia, using the en-AU locale, nevertheless I can generate 
>>> a file using , as a decimal separator. Try and stop me :-)
>> yes, you can MIS-use the en-AU locale and write 1,000 to mean the number
>> One, just as you can misuse the language and write cat when you mean a
>> member of the Canine group, but then the misinterpretation is on the
>> creator of the document, not on the program that was told how the
>> document is to be read.
> How would he mis-use the en-AU locale to write 1 as "1,000"? I think
> to do that he would simply NOT use the locale.
Once you open the Locale can of worms, EVERYTHING has a locale, to say
you aren't using a locale is to say you are writing
something unintelligible, as you can thing of the locale as the set of
rules to interpret
>
> I think there are very good reasons to ignore the locale for specific
> purposes. For example, a Python interpreter should not use the locale
> when parsing Python, and a program producing Python should also ignore
> the locale.
Python, like many languages, define the formatting of things, so Python
programs should be interpreted according to the "Python" locale (which
may actually be named "C").
>
> You two also seem to be writing about different things when you write
> "THE locale". Steven seems to mean the global settings a user has
> chosen, you seem to mean the specidic settings appropriate for parsing a
> specific file.
>
> hp
>
You have THE locale for a given piece of data. My point is that Python
has adopted the C method of a single global locale for a program, so in
the program there is a 'THE Locale' which may actually need to be
different when processing different information, leading to some of the
issues.

-- 
Richard Damon

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Richard Damon

On 6/23/18 8:03 AM, Marko Rauhamaa wrote:
> Richard Damon :
>> If you know the Locale, then you do know what the decimal separator
>> is, as that is part of what a locale defines.
> I don't know what that sentence means.
When you set the locale
>
>> The issue is that if you just know the encoding, you don't necessarily
>> know the locale.
> I always know my locale. The locale is tied to the human user.
No, it should be tied to the data you are processing. If an English user
is feeding a program Chinese documents, while processing those documents
the program should be using the appropriate Chinese Locale. When
generating output to the user, it should switch (back) to the
appropriate English Locale (likely the system locale that the user set).
>
>> He also commented that he didn't want to set the locale in the
>> routine, as that sets it globally for the full application (but
>> perhaps that latter could be fixed by first doing a
>> locale.getlocale(), then setlocale for the files locale, and then at
>> the end of reading and processing restore back the old locale.
> Setting a locale application-wise is
>
>  * not in accordance with the idea of a locale (the locale should be
>constant within a user session)
Again, no, a locale is tied to the data, not the user (unless you want
to require the user to translate all data to his locale conventions
(without using a program that can use locale information) before
providing it to a program. Yes, the default for the interpretation
should be the users default/current locale, but you really want them to
be able to say I got this file from someone whose locale was different
than mine.

Data presented to the user should normally use his locale (unless he has
specified something different).
>
>  * not easily possible (the locale is seen by all threads
>simultaneously)
That is an implementation error. It should be possible to create a
thread specific locale, and it is really useful to create a local locale
that can be used by the various conversion operators to say for this
conversion use this specific locale as that is what this data indicated
how it is to be interpreted.
>
>
> BTW, I think the locale is a terrible invention.
>
>
> Marko

The locale is a lot better than the alternative, where every application
that needs to deal with internationalization need to recreate (and
debub) all of the mechanism. I agree it isn't perfect, and for small
simple programs it would be nice to be able to say "I don't want all
this stuff, make it go away".

Python took its locale (at least initially) from C, which was a single
global which does have more issues because of this. C++ objectified the
locale and allows the programmer to imbue a specific locale into
different parts of his program (in particular, each I/O Stream knows
what locale its data is to be processed with). Perhaps (maybe it has) it
could be good to adopt the object based locale concept of C++ (but that
does come at a significant cost for things like CPython) where streams
know their locale, and other operations can be optionally passed a
locale to use.

-- 
Richard Damon

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Peter J. Holzer

On 2018-06-23 08:12:52 -0400, Richard Damon wrote:
> On 6/23/18 7:46 AM, Steven D'Aprano wrote:
> > On Sat, 23 Jun 2018 06:26:22 -0400, Richard Damon wrote:
> >> If you know the Locale, then you do know what the decimal separator is,
> >> as that is part of what a locale defines.
> > A locale defines a set of common cultural conventions. It doesn't mandate 
> > the actual conventions in use in any specific document.
> >
> > If I'm in Australia, using the en-AU locale, nevertheless I can generate 
> > a file using , as a decimal separator. Try and stop me :-)
> yes, you can MIS-use the en-AU locale and write 1,000 to mean the number
> One, just as you can misuse the language and write cat when you mean a
> member of the Canine group, but then the misinterpretation is on the
> creator of the document, not on the program that was told how the
> document is to be read.

How would he mis-use the en-AU locale to write 1 as "1,000"? I think
to do that he would simply NOT use the locale.

I think there are very good reasons to ignore the locale for specific
purposes. For example, a Python interpreter should not use the locale
when parsing Python, and a program producing Python should also ignore
the locale.

You two also seem to be writing about different things when you write
"THE locale". Steven seems to mean the global settings a user has
chosen, you seem to mean the specidic settings appropriate for parsing a
specific file.

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson 

signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Richard Damon

On 6/23/18 7:46 AM, Steven D'Aprano wrote:
> On Sat, 23 Jun 2018 06:26:22 -0400, Richard Damon wrote:
>
>> If you know the Locale, then you do know what the decimal separator is,
>> as that is part of what a locale defines.
> A locale defines a set of common cultural conventions. It doesn't mandate 
> the actual conventions in use in any specific document.
>
> If I'm in Australia, using the en-AU locale, nevertheless I can generate 
> a file using , as a decimal separator. Try and stop me :-)
yes, you can MIS-use the en-AU locale and write 1,000 to mean the number
One, just as you can misuse the language and write cat when you mean a
member of the Canine group, but then the misinterpretation is on the
creator of the document, not on the program that was told how the
document is to be read.
>
> But your point is taken -- I misread Ethan saying that he knew the locale 
> and it wasn't helping, when in fact he was reluctant to change the locale 
> as that's a process-wide global change.
>
>> The issue is that if you just
>> know the encoding, you don't necessarily know the locale. He also
>> commented that he didn't want to set the locale in the routine, as that
>> sets it globally for the full application (but perhaps that latter could
>> be fixed by first doing a locale.getlocale(), then setlocale for the
>> files locale, and then at the end of reading and processing restore back
>> the old locale.
> Indeed.
>
>

-- 
Richard Damon

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Marko Rauhamaa

Richard Damon :
> If you know the Locale, then you do know what the decimal separator
> is, as that is part of what a locale defines.

I don't know what that sentence means.

> The issue is that if you just know the encoding, you don't necessarily
> know the locale.

I always know my locale. The locale is tied to the human user.

> He also commented that he didn't want to set the locale in the
> routine, as that sets it globally for the full application (but
> perhaps that latter could be fixed by first doing a
> locale.getlocale(), then setlocale for the files locale, and then at
> the end of reading and processing restore back the old locale.

Setting a locale application-wise is

 * not in accordance with the idea of a locale (the locale should be
   constant within a user session)

 * not easily possible (the locale is seen by all threads
   simultaneously)


BTW, I think the locale is a terrible invention.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Steven D'Aprano

On Sat, 23 Jun 2018 06:26:22 -0400, Richard Damon wrote:

> If you know the Locale, then you do know what the decimal separator is,
> as that is part of what a locale defines.

A locale defines a set of common cultural conventions. It doesn't mandate 
the actual conventions in use in any specific document.

If I'm in Australia, using the en-AU locale, nevertheless I can generate 
a file using , as a decimal separator. Try and stop me :-)

But your point is taken -- I misread Ethan saying that he knew the locale 
and it wasn't helping, when in fact he was reluctant to change the locale 
as that's a process-wide global change.

> The issue is that if you just
> know the encoding, you don't necessarily know the locale. He also
> commented that he didn't want to set the locale in the routine, as that
> sets it globally for the full application (but perhaps that latter could
> be fixed by first doing a locale.getlocale(), then setlocale for the
> files locale, and then at the end of reading and processing restore back
> the old locale.

Indeed.

-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-23 Thread Richard Damon

On 6/22/18 11:21 PM, Steven D'Aprano wrote:
> On Fri, 22 Jun 2018 20:06:35 +0100, Ben Bacarisse wrote:
>
>> Steven D'Aprano  writes:
>>
>>> On Fri, 22 Jun 2018 11:14:59 +0100, Ben Bacarisse wrote:
>>>
>> The code page remark is curious.  Will some "code pages" have digits
>> that are not ASCII digits?
> Good question.  I have no idea.
 It's much more of an open question than I thought.
>>> Nah, Python already solves that for you:
>> My understanding was that the OP does not (reliably) know the encoding,
>> though that was a guess based on a turn of phrase.
> I took it the other way: that Ethan *does* know the encoding, and his 
> problem is that knowing the encoding and/or locale is not enough to 
> recognise whether to use a period or comma as the decimal separator.
>
> Which it isn't.
If you know the Locale, then you do know what the decimal separator is,
as that is part of what a locale defines. The issue is that if you just
know the encoding, you don't necessarily know the locale. He also
commented that he didn't want to set the locale in the routine, as that
sets it globally for the full application (but perhaps that latter could
be fixed by first doing a locale.getlocale(), then setlocale for the
files locale, and then at the end of reading and processing restore back
the old locale.
>
> If he doesn't know the encoding, he has bigger problems than just 
> converting strings into floats. Without knowing the encoding, he cannot 
> even reliably detect non-ASCII digits at all.
>
>
>> Another guess is that the OP does not have Unicode data.  The term "code
>> page" hints at an 8-bit encoding or at least a pre-Unicode one.
> Assuming he is using Python 3, or using Python 2 sensibly, once he has 
> specified the encoding and read the data from the file, he has Unicode.
>
> Unicode is a superset of (ideally) all code pages. Once you have decoded 
> the data using the appropriate code page, you have a Unicode string, and 
> Python doesn't care where it came from.
>
> The point is, once Ethan can get the intended characters out of the file 
> into Python, it doesn't matter what code page they came from. They're now 
> full-fledged Unicode characters, and Python's float() and int() functions 
> can easily deal with non-ASCII digits. So long as he has digits in the 
> first place, float() and int() will deal with them correctly.
>
>

-- 
Richard Damon

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-22 Thread Steven D'Aprano

On Fri, 22 Jun 2018 20:06:35 +0100, Ben Bacarisse wrote:

> Steven D'Aprano  writes:
> 
>> On Fri, 22 Jun 2018 11:14:59 +0100, Ben Bacarisse wrote:
>>
> The code page remark is curious.  Will some "code pages" have digits
> that are not ASCII digits?

 Good question.  I have no idea.
>>> 
>>> It's much more of an open question than I thought.
>>
>> Nah, Python already solves that for you:
> 
> My understanding was that the OP does not (reliably) know the encoding,
> though that was a guess based on a turn of phrase.

I took it the other way: that Ethan *does* know the encoding, and his 
problem is that knowing the encoding and/or locale is not enough to 
recognise whether to use a period or comma as the decimal separator.

Which it isn't.

If he doesn't know the encoding, he has bigger problems than just 
converting strings into floats. Without knowing the encoding, he cannot 
even reliably detect non-ASCII digits at all.

> Another guess is that the OP does not have Unicode data.  The term "code
> page" hints at an 8-bit encoding or at least a pre-Unicode one.

Assuming he is using Python 3, or using Python 2 sensibly, once he has 
specified the encoding and read the data from the file, he has Unicode.

Unicode is a superset of (ideally) all code pages. Once you have decoded 
the data using the appropriate code page, you have a Unicode string, and 
Python doesn't care where it came from.

The point is, once Ethan can get the intended characters out of the file 
into Python, it doesn't matter what code page they came from. They're now 
full-fledged Unicode characters, and Python's float() and int() functions 
can easily deal with non-ASCII digits. So long as he has digits in the 
first place, float() and int() will deal with them correctly.

-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-22 Thread Ben Bacarisse

Steven D'Aprano  writes:

> On Fri, 22 Jun 2018 11:14:59 +0100, Ben Bacarisse wrote:
>
 The code page remark is curious.  Will some "code pages" have digits
 that are not ASCII digits?
>>>
>>> Good question.  I have no idea.
>> 
>> It's much more of an open question than I thought.
>
> Nah, Python already solves that for you:

My understanding was that the OP does not (reliably) know the encoding,
though that was a guess based on a turn of phrase.

Another guess is that the OP does not have Unicode data.  The term "code
page" hints at an 8-bit encoding or at least a pre-Unicode one.

-- 
Ben.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-22 Thread Richard Damon

On 6/22/18 4:43 AM, Ethan Furman wrote:
> On 06/21/2018 01:20 PM, Ben Bacarisse wrote:
>
>> The code page remark is curious.  Will some "code pages" have digits
>> that are not ASCII digits?
>
> Good question.  I have no idea.  I get the appropriate decoder/encoder
> based on the code page contained in the file, then decode to unicode
> and go from there.  Unfortunately, that doesn't convert the decimal
> comma to the decimal point. :(  So I was hoping to map the code page
> to a locale that would properly translate the numbers for me, but so
> far what I have found in my readings suggests that in order to use the
> locale option I would have to actually change the active locale and
> potentially mess up every other part of the program when the file in
> question is opened in a locale that's different from its code page.
>
> Worst case scenario is I manually create a map for each code page to
> decimal separator, but there's more than a few and I'd rather not if
> there is already a prebuilt solution out there.
>
> -- 
> ~Ethan~
>
One problem is that code page does NOT uniquely define what decimal
separator to use (or what locale to use). You can get the decimal
separator issue even on files that are pure ASCII, and Latin-1 is full
of the issue too.

-- 
Richard Damon

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-22 Thread Steven D'Aprano

On Fri, 22 Jun 2018 11:14:59 +0100, Ben Bacarisse wrote:

>>> The code page remark is curious.  Will some "code pages" have digits
>>> that are not ASCII digits?
>>
>> Good question.  I have no idea.
> 
> It's much more of an open question than I thought.

Nah, Python already solves that for you:

py> s = "১২৩৪৫.৬৭৮৯০"
py> for c in s:
... print(unicodedata.name(c))
...
BENGALI DIGIT ONE
BENGALI DIGIT TWO
BENGALI DIGIT THREE
BENGALI DIGIT FOUR
BENGALI DIGIT FIVE
FULL STOP
BENGALI DIGIT SIX
BENGALI DIGIT SEVEN
BENGALI DIGIT EIGHT
BENGALI DIGIT NINE
BENGALI DIGIT ZERO
py> float(s)
12345.6789

Further to my earlier post, if you call:

for sep in ",u\00B7u\066B":
mystring = mystring.replace(sep, '.')

before passing it to float, that ought to cover just about anything you 
will find in real-world data regardless of language. If Ethan finds 
something that isn't covered by those three cases (comma, middle dot and 
Arabic decimal separator) he'll likely need to consult an expert on that 
language.

Provided Ethan doesn't have to deal with thousands separators as well. 
Then it gets complicated.

-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-22 Thread Ben Bacarisse

Ethan Furman  writes:

> On 06/21/2018 01:20 PM, Ben Bacarisse wrote:

>> You say in a followup that you don't need to worry about digit grouping
>> marks (like thousands separators) so I'm not sure what the problem is.
>> Can't you just replace ',' with '.' a proceed as if you had only one
>> representation?
>
> I could, and that would work right up until a third decimal separator
> was found.  I'd like to solve the problem just once if possible.

Ah, I see.  I took you to mean you knew this won't be an issue.

>> The code page remark is curious.  Will some "code pages" have digits
>> that are not ASCII digits?
>
> Good question.  I have no idea.

It's much more of an open question than I thought.  My only advice,
then, it to ignore problems that *might* arise.  Solve the problem you
face now and hope that you can extend it as needed.  It's good to check
if there is an well-known solution ready to use out of the box, but
since there really isn't, you might as well get something working now.

> I get the appropriate decoder/encoder
> based on the code page contained in the file, then decode to unicode
> and go from there.

It's rather off-topic but what does it mean for the code page to be
contained in the file?  Are you guessing the character encoding from the
rest of the file contents or is there some actual description of the
encoding present?

> ... I was hoping to map the code page to
> a locale that would properly translate the numbers for me,

> Worst case scenario is I manually create a map for each code page to
> decimal separator, but there's more than a few and I'd rather not if
> there is already a prebuilt solution out there.

That can't work in general, but you may be lucky with your particular
data set.  For example, files using one of the "Latin" encodings could
have numbers written using the UK convention (0.5) or the French
convention (0,5).  I do both depending on the audience.

-- 
Ben.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-22 Thread Steven D'Aprano

On Fri, 22 Jun 2018 01:43:56 -0700, Ethan Furman wrote:

>> You say in a followup that you don't need to worry about digit grouping
>> marks (like thousands separators) so I'm not sure what the problem is.
>> Can't you just replace ',' with '.' a proceed as if you had only one
>> representation?
> 
> I could, and that would work right up until a third decimal separator
> was found.  I'd like to solve the problem just once if possible.

I don't know of any already existing solution, but there's only a limited 
number of decimal separators in common use around the world. There's 
probably nothing you can do ahead of time if somebody decides to start 
using (say) 5 as a decimal separator within Hindi numerals, except cry, 
but you can probably start by transforming all of the following into 
decimal points:

- interpuct (middle dot) · U+00B7
- comma ,
- Arabic decimal separator ٫ U+066B

https://en.wikipedia.org/wiki/Decimal_separator

Those three cover pretty much the whole world, using Hindu-Arabic 
numerals (1234...) and Eastern Arabic numerals (what the Arabs and 
Persians use). Other numeral systems seem to have either adopted Arabic 
numerals, or introduced the decimal point/comma into their own numeral 
system, or just don't use a decimal place value system.

Either way, I expect that the period . plus the three above will cover 
anything you are likely to find in real data.

-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-22 Thread Ethan Furman

On 06/21/2018 01:20 PM, Ben Bacarisse wrote:

Ethan Furman writes:

I need to translate numeric data in a string format into a binary
format. I know there are at least two different methods of
representing parts less that 1, such as "10.5" and "10,5". The data
is encoded using code pages, and can vary depending on the file being
read (so I can't rely on current locale settings).

I'm sure this is a solved problem, but I'm not finding those
solutions. Any pointers?

You say "at least two" and give two but not knowing the others will hamper
anyone trying to help. (I appreciate that you may not yet know if there
are to be any others.)

Yes, I don't know if there are others -- I have not studied the various ways different peoples represent decimal
numbers. ;)

You say in a followup that you don't need to worry about digit grouping
marks (like thousands separators) so I'm not sure what the problem is.
Can't you just replace ',' with '.' a proceed as if you had only one
representation?

I could, and that would work right up until a third decimal separator was found. I'd like to solve the problem just
once if possible.

The code page remark is curious. Will some "code pages" have digits
that are not ASCII digits?

Good question. I have no idea. I get the appropriate decoder/encoder based on the code page contained in the file,
then decode to unicode and go from there. Unfortunately, that doesn't convert the decimal comma to the decimal point.
:( So I was hoping to map the code page to a locale that would properly translate the numbers for me, but so far what I
have found in my readings suggests that in order to use the locale option I would have to actually change the active
locale and potentially mess up every other part of the program when the file in question is opened in a locale that's
different from its code page.

Worst case scenario is I manually create a map for each code page to decimal separator, but there's more than a few and
I'd rather not if there is already a prebuilt solution out there.

--
~Ethan~

--
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-21 Thread Cameron Simpson


On 21Jun2018 10:12, Ethan Furman  wrote:
I need to translate numeric data in a string format into a binary 
format.  I know there are at least two different methods of 
representing parts less that 1, such as "10.5" and "10,5".  The data 
is encoded using code pages, and can vary depending on the file being 
read (so I can't rely on current locale settings).


I'm sure this is a solved problem, but I'm not finding those solutions.  Any 
pointers?


It sounds like you're conflating two problems:

- the file character data encoding

- the numeric representation

Can't you just read the file as a text file using the correct 
codepage->decoding setting to get strings, _then_ parse numbers either with 
some clunky regexp based approach or some flexible external library for common 
numeric forms? (Someone suggested babel, I've never used it.)


Cheers,
Cameron Simpson 
--
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-21 Thread Gregory Ewing


George Fischhof wrote:


- if you found only one type, then that is the decimal


Only if you're sure that all numbers contain a decimal separator.
Otherwise there's no way to be sure in general.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-21 Thread George Fischhof

Peter Otten <__pete...@web.de> ezt írta (időpont: 2018. jún. 21., Cs,
22:45):

> Ethan Furman wrote:
>
> > I need to translate numeric data in a string format into a binary
> format.
> > I know there are at least two different
> > methods of representing parts less that 1, such as "10.5" and "10,5".
> The
> > data is encoded using code pages, and can vary depending on the file
> being
> > read (so I can't rely on current locale settings).
> >
> > I'm sure this is a solved problem, but I'm not finding those solutions.
> > Any pointers?
>
> There's babel
>
> http://babel.pocoo.org/en/latest/numbers.html#parsing-numbers
>
> though I'm not sure what to think of the note below the linked paragraph.
>
>
> --
> https://mail.python.org/mailman/listinfo/python-list


Hi,

if you have several values in a file, then you probably you can check the
delimiters: there is only one decimal separator,
- so if you find a number with 2 separators, then a rightmost is a the
decimal
- if you found only one type, then that is the decimal
- try to check the separator from right to left
- if you found 4 digits right to a separator, then that is the decimal
separator
etc (maybe wikipedia should be checked for other separators.
Other thousand separators used: space, apostrophe, and in India after the
first thousand separator the separation is done with two numbers, not three

And if you are able to identify the encoding codepage, then you should
follow what the codepage says

Another help can be if know the possible value range of the numbers (maybe
it should be asked ...)


George
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-21 Thread Peter Otten

Ethan Furman wrote:

> I need to translate numeric data in a string format into a binary format. 
> I know there are at least two different
> methods of representing parts less that 1, such as "10.5" and "10,5".  The
> data is encoded using code pages, and can vary depending on the file being
> read (so I can't rely on current locale settings).
> 
> I'm sure this is a solved problem, but I'm not finding those solutions. 
> Any pointers?

There's babel

http://babel.pocoo.org/en/latest/numbers.html#parsing-numbers

though I'm not sure what to think of the note below the linked paragraph.


-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-21 Thread Ben Bacarisse

Ethan Furman  writes:

> I need to translate numeric data in a string format into a binary
> format.  I know there are at least two different methods of
> representing parts less that 1, such as "10.5" and "10,5".  The data
> is encoded using code pages, and can vary depending on the file being
> read (so I can't rely on current locale settings).
>
> I'm sure this is a solved problem, but I'm not finding those
> solutions.  Any pointers?

You say "at least two" and give two but not knowing the others will hamper
anyone trying to help.  (I appreciate that you may not yet know if there
are to be any others.)

You say in a followup that you don't need to worry about digit grouping
marks (like thousands separators) so I'm not sure what the problem is.
Can't you just replace ',' with '.' a proceed as if you had only one
representation?

The code page remark is curious.  Will some "code pages" have digits
that are not ASCII digits?

-- 
Ben.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-21 Thread Ethan Furman


On 06/21/2018 12:07 PM, codewiz...@gmail.com wrote:

On Thursday, June 21, 2018 at 1:08:35 PM UTC-4, Ethan Furman wrote:



I need to translate numeric data in a string format into a binary format.  I 
know there are at least two different
methods of representing parts less that 1, such as "10.5" and "10,5".  The data 
is encoded using code pages, and can
vary depending on the file being read (so I can't rely on current locale 
settings).

I'm sure this is a solved problem, but I'm not finding those solutions.  Any 
pointers?


Try this StackOverflow answer: https://stackoverflow.com/a/17815252


--> import locale
--> a = u'545,545.'
--> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
'en_US.UTF-8'

The problem there is it sets the locale for the entire process -- I just need the conversion step for individual pieces 
of data without modifying the user's settings.


--
~Ethan~
--
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-21 Thread Ethan Furman


On 06/21/2018 10:36 AM, Peter Pearson wrote:

On Thu, 21 Jun 2018 10:12:27 -0700, Ethan Furman  wrote:

I need to translate numeric data in a string format into a binary
format.  I know there are at least two different methods of
representing parts less that 1, such as "10.5" and "10,5".  The data
is encoded using code pages, and can vary depending on the file being
read (so I can't rely on current locale settings).

I'm sure this is a solved problem, but I'm not finding those
solutions.  Any pointers?


Do you also have to accommodate the possibility that one thousand
might be written "1,000" or "1.000"?


Nope, just the decimal character.

--
~Ethan~


--
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-21 Thread codewizard

On Thursday, June 21, 2018 at 1:08:35 PM UTC-4, Ethan Furman wrote:
> I need to translate numeric data in a string format into a binary format.  I 
> know there are at least two different 
> methods of representing parts less that 1, such as "10.5" and "10,5".  The 
> data is encoded using code pages, and can 
> vary depending on the file being read (so I can't rely on current locale 
> settings).
> 
> I'm sure this is a solved problem, but I'm not finding those solutions.  Any 
> pointers?
> 
> --
> ~Ethan~

Try this StackOverflow answer: https://stackoverflow.com/a/17815252

Regards,
Igor.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: translating foreign data

2018-06-21 Thread Peter Pearson

On Thu, 21 Jun 2018 10:12:27 -0700, Ethan Furman  wrote:
> I need to translate numeric data in a string format into a binary
> format.  I know there are at least two different methods of
> representing parts less that 1, such as "10.5" and "10,5".  The data
> is encoded using code pages, and can vary depending on the file being
> read (so I can't rely on current locale settings).
>
> I'm sure this is a solved problem, but I'm not finding those
> solutions.  Any pointers?

Do you also have to accommodate the possibility that one thousand
might be written "1,000" or "1.000"?

-- 
To email me, substitute nowhere->runbox, invalid->com.
-- 
https://mail.python.org/mailman/listinfo/python-list

translating foreign data

2018-06-21 Thread Ethan Furman

I need to translate numeric data in a string format into a binary format.  I know there are at least two different 
methods of representing parts less that 1, such as "10.5" and "10,5".  The data is encoded using code pages, and can 
vary depending on the file being read (so I can't rely on current locale settings).


I'm sure this is a solved problem, but I'm not finding those solutions.  Any 
pointers?

--
~Ethan~
--
https://mail.python.org/mailman/listinfo/python-list

83 matches

Mail list logo