Re: [Pharo-dev] Better management of encoding of environment variables

2019-01-18 Thread Eliot Miranda
Hi Guille,

> On Jan 18, 2019, at 6:04 AM, Guillermo Polito  
> wrote:
> 
>> On Fri, Jan 18, 2019 at 2:46 PM Ben Coman  wrote:
>> 
>>> On Fri, 18 Jan 2019 at 21:39, Sven Van Caekenberghe  wrote:
>>> 
>>> > On 18 Jan 2019, at 14:23, Guillermo Polito  
>>> > wrote:
>>> > 
>>> > 
>>> > I think that will just overcomplicate things. Right now, all Strings in 
>>> > Pharo are unicode strings.
>> 
>> Cool. I didn't realise that.  But to be pedantic, which unicode encoding? 
>> Should I presume from Sven's "UTF-8 encoding step" comment below 
>> and the WideString class comment  "This class represents the array of 32 bit 
>> wide characters"
>> that the WideString encoding is UTF-32?  So should its comment be updated to 
>> advise that?
> 
> None :D
> 
> That's the funny thing, they are not encoded.
> 
> Actually, you should see Strings as collections of Characters, and Characters 
> defined in terms of their abstract code points.
> ByteStrings are an optimized (just more compact) version that stores 
> codepoints that fit in a byte.

And Spur supports 16-bit strings too, which would be versions that store code 
points that fit in doublebytes.

>> cheers -ben
>> 
>>> Characters are represented with their corresponding unicode codepoint.
>>> > If all characters in a string have codepoints < 256 then they are just 
>>> > stored in a bytestring. Otherwise they are WideStrings.
>>> > 
>>> > I think assuming a single representation for strings, and then encode 
>>> > when interacting with external apps/APIs is MUCH simpler.
>>> 
>>> Absolutely !
>>> 
>>> (and yes I know that for outgoing FFI calls that might mean a UTF-8 
>>> encoding step, so be it).
> 
> 
> -- 
>
> Guille Polito
> Research Engineer
> Centre de Recherche en Informatique, Signal et Automatique de Lille
> CRIStAL - UMR 9189
> French National Center for Scientific Research - http://www.cnrs.fr
> 
> Web: http://guillep.github.io
> Phone: +33 06 52 70 66 13


Re: [Pharo-dev] Better management of encoding of environment variables

2019-01-18 Thread Eliot Miranda


> On Jan 18, 2019, at 2:04 AM, Guillermo Polito  
> wrote:
[snip]
> 
> Well, personally I would like that getenv/setenv and getcwd setcwd support 
> are not in a plugin but as a basic service provided by the vm.

+1000

> Cheers,
> Guille



[Pharo-dev] literate programming

2019-01-18 Thread Nicolai Hess
Physically Based Rendering
an interesting book, using literate programming, as free book:
http://www.pbr-book.org/


Re: [Pharo-dev] [ANN] New stable VM version.

2019-01-18 Thread Esteban Lorenzano


> On 18 Jan 2019, at 15:57, Sean P. DeNigris via Pharo-dev 
>  wrote:
> 
> 
> From: "Sean P. DeNigris" 
> Subject: Re: [ANN] New stable VM version.
> Date: 18 January 2019 at 15:57:12 CET
> To: pharo-dev@lists.pharo.org
> 
> 
> EstebanLM wrote
>> Enjoy
> 
> Thanks! Which image versions does this affect? I want to clear my Launcher
> caches…

Pharo 7.0 and Pharo 8.0 images.

> 
> 
> 
> -
> Cheers,
> Sean
> --
> Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html
> 
> 
> 



[Pharo-dev] [ Save The Date ] Pharo Days 2019: Thursday April 4 & Friday April 5 @ Lille, FR

2019-01-18 Thread Sven Van Caekenberghe


[ Save The Date ] Pharo Days 2019: Thursday April 4 & Friday April 5 @ Lille, FR


Dear members of the Pharo community,


We are happy to announce that we will be organising Pharo Days this year, in 
Lille, France. This will be a two day event: Thursday April 4 & Friday April 5. 
The main venue will be the Amphitheatre of INRIA Lille - Nord Europe.


Each day will consist of a number of short 20 to 30 minutes tech talks in the 
morning, with a more free format in the afternoon: general hacking space, pair 
programming, demos, side meetings, tutorials, coding sprints, Q's, shows us 
your projects - real life human interaction. Of course, there will be social 
events as well.


Please join us to make this another successful edition, after Annecy (FR) 2011, 
Lille 2012, Bern (CH) & Lille 2013, Lille 2015, Namur (BE) 2016 and Lille 2017.


Mark your calendars, we will provide more details when they become available.


The Pharo Board, Association & Consortium 

http://pharo.org
http://association.pharo.org
http://consortium.pharo.org




Re: [Pharo-dev] [ANN] New stable VM version.

2019-01-18 Thread Sean P. DeNigris via Pharo-dev
--- Begin Message ---
EstebanLM wrote
> Enjoy

Thanks! Which image versions does this affect? I want to clear my Launcher
caches…



-
Cheers,
Sean
--
Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html

--- End Message ---


Re: [Pharo-dev] [ANN] New stable VM version.

2019-01-18 Thread Alistair Grant
Hi Esteban,

On Fri, 18 Jan 2019 at 14:49, Esteban Lorenzano  wrote:
>
> Hi,
>
> Yes, I needed to promote a new stable version because last week’s was having 
> a bug (which is not fixed).
> This one is based on the build:
>
> 201901172323-5a38b34

On Linux (Ubuntu 16.04) it's still the old VM:

$ curl get.pharo.org/64/vm70 | bash
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100  5320  100  53200 0  37021  0 --:--:-- --:--:-- --:--:-- 37202
Downloading the latest pharoVM:
http://files.pharo.org/get-files/70/pharo64-linux-stable.zip
pharo-vm/pharo
Creating starter scripts pharo and pharo-ui


$ ./pharo --version
5.0-201901051900  Sat Jan  5 19:12:50 UTC 2019 gcc 4.8 [Production
Spur 64-bit VM]
CoInterpreter VMMaker.oscog-eem.2504 uuid:
a00b0fad-c04c-47a6-8a11-5dbff110ac11 Jan  5 2019
StackToRegisterMappingCogit VMMaker.oscog-eem.2504 uuid:
a00b0fad-c04c-47a6-8a11-5dbff110ac11 Jan  5 2019
VM: 201901051900 https://github.com/OpenSmalltalk/opensmalltalk-vm.git
Date: Sat Jan 5 20:00:11 2019 CommitHash: 7a3c6b64
Plugins: 201901051900 https://github.com/OpenSmalltalk/opensmalltalk-vm.git
Linux travis-job-f22c8934-2412-48ed-8180-7a42b62c7389
4.4.0-101-generic #124~14.04.1-Ubuntu SMP Fri Nov 10 19:05:36 UTC 2017
x86_64 x86_64 x86_64 GNU/Linux
plugin path: /dev/shm/vm/pharo-vm/lib/pharo/5.0-201901051900 [default:
/dev/shm/vm/pharo-vm/lib/pharo/5.0-201901051900/]


Cheers,
Alistair


> Enjoy (and please report problems)
>
> Esteban



Re: [Pharo-dev] DebugSession>>activePC:

2019-01-18 Thread ducasse via Pharo-dev
--- Begin Message ---
>> 
>> On my TODO is to make it stand-alone and provide is as a “compatibility 
>> transform”, too.
> 
> I have to dig because I remember that I went over all the deprecation in 
> Pharo 60 and started to look at the ones that I 
> could “transformified” so that we get a nice package that rewrite more :)
> 
> 
>> So we can add it to methods that we want to keep for compatibility, but they 
>> will nevertheless transform the code automatically.
>> (this then might be disabled in production to not transform)
> 
> Yes I like that I will look for my code. 

Apparently I published under 

MCSmalltalkhubRepository
owner: 'PharoExtras'
project: ‘Migrator'

Several packages 
MigratorPharo60 contains what I did for Pharo60
probably 
Migrator contains some of the Pharo70 

Now I would like to understand how to organise it. 

I have the impression that we should keep the one that can be transformed.

I do not think that we should go to Pharo50 because it is too old. 

Stef




--- End Message ---


Re: [Pharo-dev] Better management of encoding of environment variables

2019-01-18 Thread ducasse via Pharo-dev
--- Begin Message ---

>>> I think that will just overcomplicate things. Right now, all Strings in 
>>> Pharo are unicode strings.
>> 
>> Cool. I didn't realise that.  But to be pedantic, which unicode encoding? 
>> Should I presume from Sven's "UTF-8 encoding step" comment below 
>> and the WideString class comment  "This class represents the array of 32 bit 
>> wide characters"
>> that the WideString encoding is UTF-32?  So should its comment be updated to 
>> advise that?
> 
> Not really, Pharo Strings are a collection of Characters, each of which is a 
> Unicode code point (yes a 32 bit one).
> 
> An encoding projects this rather abstract notion onto a sequence of bytes,
> 
> UTF-32 (ZnUTF32Encoder, https://en.wikipedia.org/wiki/UTF-32 
> ) is for example endian dependent.
> 
> Read the first part of
> 
> https://ci.inria.fr/pharo-contribution/job/EnterprisePharoBook/lastSuccessfulBuild/artifact/book-result/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html
>  
> 

I love that book :)

This is too cool to have cool doc--- End Message ---


Re: [Pharo-dev] DebugSession>>activePC:

2019-01-18 Thread ducasse via Pharo-dev
--- Begin Message ---
Hi marcus

>> I simply love the dynamic rewriting this is just too cool. We should 
>> systematically use it. 
>> I will continue to use it in any deprecation. 
>> 
> 
> On my TODO is to make it stand-alone and provide is as a “compatibility 
> transform”, too.

I have to dig because I remember that I went over all the deprecation in Pharo 
60 and started to look at the ones that I 
could “transformified” so that we get a nice package that rewrite more :)


> So we can add it to methods that we want to keep for compatibility, but they 
> will nevertheless transform the code automatically.
> (this then might be disabled in production to not transform)

Yes I like that I will look for my code. 
> 
>> Now I have a simple question (You can explain it to me over lunch one of 
>> these days).
>> 
>> I do not get why RBAST would not be a good representation for the compiler?
>> I would like to know what is the difference.
>> 
> I think it is a good one. I have not yet seen a reason why not. But remember, 
> Roel left Squeak because his visitor pattern for the compiler was rejected as 
> a dumb idea… so there are definitely different views on core questions.
> 
> E.g. the RB AST is annotated and the whole things for sure uses a bit more 
> memory than the compiler designed for a machine from 1978.
> 
>> You mean that before going from BC to AST was difficult?
> 
> You need to do the mapping somehow, the compiler needs to remember the BC 
> offset in the code generation phase and the AST (somehow) needs to store that 
> information (either in every node or some table).
> 
>> How opal performs it? It does not use the source of the method to recreate 
>> the AST but he can do it from the BC?
>> 
> 
> It uses the IR (which I still am not 100% sure about, it came from the old 
> “ClosureCompiler” Design and it turned out to be quite useful, for example 
> for the mapping: every IR node retains the offset of the BC it creates, then 
> the IR Nodes
> retain the AST node that created them. 
> 
> -> so we just do a query: “IRMethod, give me the IRInstruction that created 
> BC offset X. then “IR, which AST node did create you? then the AST Node: what 
> is your highlight interval in the source?
> 
> The devil is in the detail as one IR can produce multiple byte code offsets 
> (and byte codes) and one byte code might be created by two IR nodes, but it 
> does seem to work with some tricks. 

ok I see.
And all in all it is really nice. 

> Which I want to remove by improving the mapping and even the IR more… there 
> is even the question: do we need the IR? could we not do it simpler? 
> 
> The IR was quite nice back when we tried to do things with byte code 
> manipulation (Bytesurgeon), now it feels a bit of an overkill. But it 
> simplifies e.g. the bc mapping.


We should document this because this is cool. 

Stef



--- End Message ---


[Pharo-dev] [Pharo 7.0] Build #131: 22887-TonelWritercreateDefaultOrganizationFrom-can-return-string

2019-01-18 Thread ci-pharo-ci-jenkins2--- via Pharo-dev
--- Begin Message ---
There is a new Pharo build available!
  
The status of the build #131 was: SUCCESS.

The Pull Request #2267 was integrated: 
"22887-TonelWritercreateDefaultOrganizationFrom-can-return-string"
Pull request url: https://github.com/pharo-project/pharo/pull/2267

Issue Url: https://pharo.fogbugz.com/f/cases/22887
Build Url: 
https://ci.inria.fr/pharo-ci-jenkins2/job/Test%20pending%20pull%20request%20and%20branch%20Pipeline/job/Pharo7.0/131/
--- End Message ---


Re: [Pharo-dev] Better management of encoding of environment variables

2019-01-18 Thread Sven Van Caekenberghe



> On 18 Jan 2019, at 14:45, Ben Coman  wrote:
> 
> 
> 
> On Fri, 18 Jan 2019 at 21:39, Sven Van Caekenberghe  wrote:
> 
> 
> > On 18 Jan 2019, at 14:23, Guillermo Polito  
> > wrote:
> > 
> > 
> > I think that will just overcomplicate things. Right now, all Strings in 
> > Pharo are unicode strings.
> 
> Cool. I didn't realise that.  But to be pedantic, which unicode encoding? 
> Should I presume from Sven's "UTF-8 encoding step" comment below 
> and the WideString class comment  "This class represents the array of 32 bit 
> wide characters"
> that the WideString encoding is UTF-32?  So should its comment be updated to 
> advise that?

Not really, Pharo Strings are a collection of Characters, each of which is a 
Unicode code point (yes a 32 bit one).

An encoding projects this rather abstract notion onto a sequence of bytes,

UTF-32 (ZnUTF32Encoder, https://en.wikipedia.org/wiki/UTF-32) is for example 
endian dependent.

Read the first part of

https://ci.inria.fr/pharo-contribution/job/EnterprisePharoBook/lastSuccessfulBuild/artifact/book-result/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html

> cheers -ben
> 
> Characters are represented with their corresponding unicode codepoint.
> > If all characters in a string have codepoints < 256 then they are just 
> > stored in a bytestring. Otherwise they are WideStrings.
> > 
> > I think assuming a single representation for strings, and then encode when 
> > interacting with external apps/APIs is MUCH simpler.
> 
> Absolutely !
> 
> (and yes I know that for outgoing FFI calls that might mean a UTF-8 encoding 
> step, so be it).




Re: [Pharo-dev] Better management of encoding of environment variables

2019-01-18 Thread Guillermo Polito
On Fri, Jan 18, 2019 at 2:46 PM Ben Coman  wrote:

>
>
> On Fri, 18 Jan 2019 at 21:39, Sven Van Caekenberghe  wrote:
>
>>
>>
>> > On 18 Jan 2019, at 14:23, Guillermo Polito 
>> wrote:
>> >
>> >
>> > I think that will just overcomplicate things. Right now, all Strings in
>> Pharo are unicode strings.
>
>
> Cool. I didn't realise that.  But to be pedantic, which unicode encoding?
> Should I presume from Sven's "UTF-8 encoding step" comment below
> and the WideString class comment  "This class represents the array of 32
> bit wide characters"
> that the WideString encoding is UTF-32?  So should its comment be updated
> to advise that?
>

None :D

That's the funny thing, they are not encoded.

Actually, you should see Strings as collections of Characters, and
Characters defined in terms of their abstract code points.
ByteStrings are an optimized (just more compact) version that stores
codepoints that fit in a byte.


> cheers -ben
>
> Characters are represented with their corresponding unicode codepoint.
>> > If all characters in a string have codepoints < 256 then they are just
>> stored in a bytestring. Otherwise they are WideStrings.
>> >
>> > I think assuming a single representation for strings, and then encode
>> when interacting with external apps/APIs is MUCH simpler.
>>
>> Absolutely !
>>
>> (and yes I know that for outgoing FFI calls that might mean a UTF-8
>> encoding step, so be it).
>>
>

-- 



Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - *http://www.cnrs.fr
*


*Web:* *http://guillep.github.io* 

*Phone: *+33 06 52 70 66 13


Re: [Pharo-dev] [ANN] New stable VM version.

2019-01-18 Thread Guillermo Polito
On Fri, Jan 18, 2019 at 2:49 PM Esteban Lorenzano 
wrote:

> Hi,
>
> Yes, I needed to promote a new stable version because last week’s was
> having a bug (which is not fixed).
>

not? now? :)

This version includes the following:

https://github.com/OpenSmalltalk/opensmalltalk-vm/pull/355
https://github.com/OpenSmalltalk/opensmalltalk-vm/pull/352
https://github.com/OpenSmalltalk/opensmalltalk-vm/pull/354


> This one is based on the build:
>
> 201901172323-5a38b34
>
> Enjoy (and please report problems)
>
> Esteban
>


-- 



Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - *http://www.cnrs.fr
*


*Web:* *http://guillep.github.io* 

*Phone: *+33 06 52 70 66 13


Re: [Pharo-dev] DebugSession>>activePC:

2019-01-18 Thread Nicolas Cellier
Le ven. 18 janv. 2019 à 14:42, Marcus Denker  a
écrit :

>
>
> > On 18 Jan 2019, at 14:26, ducasse  wrote:
> >
> > I simply love the dynamic rewriting this is just too cool. We should
> systematically use it.
> > I will continue to use it in any deprecation.
> >
>
> On my TODO is to make it stand-alone and provide is as a “compatibility
> transform”, too.
>
> So we can add it to methods that we want to keep for compatibility, but
> they will nevertheless transform the code automatically.
> (this then might be disabled in production to not transform)
>
> > Now I have a simple question (You can explain it to me over lunch one of
> these days).
> >
> > I do not get why RBAST would not be a good representation for the
> compiler?
> > I would like to know what is the difference.
> >
> I think it is a good one. I have not yet seen a reason why not. But
> remember, Roel left Squeak because his visitor pattern for the compiler was
> rejected as a dumb idea… so there are definitely different views on core
> questions.
>
> E.g. the RB AST is annotated and the whole things for sure uses a bit more
> memory than the compiler designed for a machine from 1978.
>
> > You mean that before going from BC to AST was difficult?
>
> You need to do the mapping somehow, the compiler needs to remember the BC
> offset in the code generation phase and the AST (somehow) needs to store
> that information (either in every node or some table).
>
> > How opal performs it? It does not use the source of the method to
> recreate the AST but he can do it from the BC?
> >
>
> It uses the IR (which I still am not 100% sure about, it came from the old
> “ClosureCompiler” Design and it turned out to be quite useful, for example
> for the mapping: every IR node retains the offset of the BC it creates,
> then the IR Nodes
> retain the AST node that created them.
>
> -> so we just do a query: “IRMethod, give me the IRInstruction that
> created BC offset X. then “IR, which AST node did create you? then the AST
> Node: what is your highlight interval in the source?
>
> The devil is in the detail as one IR can produce multiple byte code
> offsets (and byte codes) and one byte code might be created by two IR
> nodes, but it does seem to work with some tricks.
> Which I want to remove by improving the mapping and even the IR more…
> there is even the question: do we need the IR? could we not do it simpler?
>
> The IR was quite nice back when we tried to do things with byte code
> manipulation (Bytesurgeon), now it feels a bit of an overkill. But it
> simplifies e.g. the bc mapping.
>
> Marcus
>
>
> agree!
IR is super when we want to manipuate the byte codes (decompiling,
instrumenting, etc...).
But if it's just for generating the CompiledMethod byte codes (compiling),
it's a bit too much and is one contributor of compilation slowdown.
Maybe we could have another polymorphic kind of IRBuilder that would be a
DirectByteCodeGenerator


Re: [Pharo-dev] Better management of encoding of environment variables

2019-01-18 Thread David T. Lewis
On Fri, Jan 18, 2019 at 01:40:26PM +0100, Sven Van Caekenberghe wrote:
> Dave,
> 
> > On 18 Jan 2019, at 01:54, David T. Lewis via Pharo-dev 
> >  wrote:
> > 
> > 
> > From: "David T. Lewis" 
> > Subject: Re: [Pharo-dev] Better management of encoding of environment 
> > variables
> > Date: 18 January 2019 at 01:54:34 GMT+1
> > To: Pharo Development List 
> > 
> > 
> > On Thu, Jan 17, 2019 at 04:57:18PM +0100, Sven Van Caekenberghe wrote:
> >> 
> >>> On 16 Jan 2019, at 23:23, Eliot Miranda  wrote:
> >>> 
> >>> On Wed, Jan 16, 2019 at 2:37 AM Sven Van Caekenberghe  
> >>> wrote:
> >>> 
> >>> The image side is perfectly capable of dealing with platform differences
> >>> in a clean/clear way, and at least we can then use the full power of our
> >>> language and our tools.
> >>> 
> >> Agreed.  At the same time I think it is very important that we don't reply
> >> on the FFI for environment variable access.  This is a basic cross-platform
> >> facility.  So I would like to see the environment accessed through 
> >> primitives,
> >> but have the image place interpretation on the result of the primitive(s),
> >> and have the primitive(s) answer a raw result, just a sequence of 
> >> uninterpreted
> >> bytes.
> >> 
> >> OK, I can understand that ENV VAR access is more fundamental than FFI
> >> (although FFI is already essential for Pharo, also during startup).
> >> 
> >>> VisualWorks takes this approach and provides a class UninterpretedBytes
> >>> that the VM is aware of.  That's always seemed like an ugly name and
> >>> overkill to me.  I would just use ByteArray and provide image level
> >>> conversion from ByteArray to String, which is what I believe we have 
> >>> anyway.
> >> 
> >> Right, bytes are always uninterpreted, else they would be something else.
> >> We got ByteArray>>#decodedWith: and ByteArray>>#utf8Decoded and our 
> >> ByteArray
> >> inspector decodes automatically if it can.
> >> 
> > 
> > Hi Sven,
> > 
> > I am the author of the getenv primitives, and I am also sadly uninformed
> > about matters of character sets and strings in a multilingual environment.
> > 
> > The primitives answer environment variable variable values as ByteString
> > rather than ByteArray. This made sense to me at the time that I wrote it,
> > because ByteString is easy to display in an inspector, and because it is
> > easily converted to ByteArray.
> > 
> > For an American English speaker this seems like a good choice, but I
> > wonder now if it is a bad decision. After all, it is also trivially easy
> > to convert a ByteArray to ByteString for display in the image.
> > 
> > Would it be helpful to have getenv primitives that answer ByteArray
> > instead, and to let all conversion (including in OSProcess) be done in
> > the image?
> > 
> > Thanks,
> > Dave
> 
> Normally, the correct way to represent uninterpreted bytes is with a 
> ByteArray. Decoding these bytes as characters is the specific task of a 
> character encoder/decoder, with a deliberate choice as to which to use.
> 
> Since the getenv() system call uses simple C strings, it is understandable 
> that this was carried over. It is probably not worth or too risky to change 
> that - as long as the receiver understands that it is a raw OS string that 
> needs more work.
> 
> Like with file path encoding/decoding, environment variable encoding/decoding 
> is plain messy and complex. IMHO it is better to manage that at the image 
> level where we are more agile and can better handle that complexity.
> 

Thanks Sven, that makes perfect sense to me.

>
> BTW: using funny Unicode chars, like  
> [https://www.fileformat.info/info/unicode/char/1f388/index.htm] is something 
> even English speakers do.
>

You are right. I wrote those getenv primitives 20 years ago and
back then we were still doing our emoticons like this:

;-)

Thanks,
Dave
 



[Pharo-dev] [ANN] New stable VM version.

2019-01-18 Thread Esteban Lorenzano
Hi, 

Yes, I needed to promote a new stable version because last week’s was having a 
bug (which is not fixed).
This one is based on the build: 

201901172323-5a38b34

Enjoy (and please report problems)

Esteban


Re: [Pharo-dev] Better management of encoding of environment variables

2019-01-18 Thread Ben Coman
On Fri, 18 Jan 2019 at 21:39, Sven Van Caekenberghe  wrote:

>
>
> > On 18 Jan 2019, at 14:23, Guillermo Polito 
> wrote:
> >
> >
> > I think that will just overcomplicate things. Right now, all Strings in
> Pharo are unicode strings.


Cool. I didn't realise that.  But to be pedantic, which unicode encoding?
Should I presume from Sven's "UTF-8 encoding step" comment below
and the WideString class comment  "This class represents the array of 32
bit wide characters"
that the WideString encoding is UTF-32?  So should its comment be updated
to advise that?

cheers -ben

Characters are represented with their corresponding unicode codepoint.
> > If all characters in a string have codepoints < 256 then they are just
> stored in a bytestring. Otherwise they are WideStrings.
> >
> > I think assuming a single representation for strings, and then encode
> when interacting with external apps/APIs is MUCH simpler.
>
> Absolutely !
>
> (and yes I know that for outgoing FFI calls that might mean a UTF-8
> encoding step, so be it).
>


Re: [Pharo-dev] DebugSession>>activePC:

2019-01-18 Thread Marcus Denker



> On 18 Jan 2019, at 14:26, ducasse  wrote:
> 
> I simply love the dynamic rewriting this is just too cool. We should 
> systematically use it. 
> I will continue to use it in any deprecation. 
> 

On my TODO is to make it stand-alone and provide is as a “compatibility 
transform”, too.

So we can add it to methods that we want to keep for compatibility, but they 
will nevertheless transform the code automatically.
(this then might be disabled in production to not transform)

> Now I have a simple question (You can explain it to me over lunch one of 
> these days).
> 
> I do not get why RBAST would not be a good representation for the compiler?
> I would like to know what is the difference.
> 
I think it is a good one. I have not yet seen a reason why not. But remember, 
Roel left Squeak because his visitor pattern for the compiler was rejected as a 
dumb idea… so there are definitely different views on core questions.

E.g. the RB AST is annotated and the whole things for sure uses a bit more 
memory than the compiler designed for a machine from 1978.

> You mean that before going from BC to AST was difficult?

You need to do the mapping somehow, the compiler needs to remember the BC 
offset in the code generation phase and the AST (somehow) needs to store that 
information (either in every node or some table).

> How opal performs it? It does not use the source of the method to recreate 
> the AST but he can do it from the BC?
> 

It uses the IR (which I still am not 100% sure about, it came from the old 
“ClosureCompiler” Design and it turned out to be quite useful, for example for 
the mapping: every IR node retains the offset of the BC it creates, then the IR 
Nodes
retain the AST node that created them. 

-> so we just do a query: “IRMethod, give me the IRInstruction that created BC 
offset X. then “IR, which AST node did create you? then the AST Node: what is 
your highlight interval in the source?

The devil is in the detail as one IR can produce multiple byte code offsets 
(and byte codes) and one byte code might be created by two IR nodes, but it 
does seem to work with some tricks. 
Which I want to remove by improving the mapping and even the IR more… there is 
even the question: do we need the IR? could we not do it simpler? 

The IR was quite nice back when we tried to do things with byte code 
manipulation (Bytesurgeon), now it feels a bit of an overkill. But it 
simplifies e.g. the bc mapping.

Marcus




Re: [Pharo-dev] Better management of encoding of environment variables

2019-01-18 Thread Sven Van Caekenberghe



> On 18 Jan 2019, at 14:23, Guillermo Polito  wrote:
> 
> 
> I think that will just overcomplicate things. Right now, all Strings in Pharo 
> are unicode strings. Characters are represented with their corresponding 
> unicode codepoint.
> If all characters in a string have codepoints < 256 then they are just stored 
> in a bytestring. Otherwise they are WideStrings.
> 
> I think assuming a single representation for strings, and then encode when 
> interacting with external apps/APIs is MUCH simpler.

Absolutely !

(and yes I know that for outgoing FFI calls that might mean a UTF-8 encoding 
step, so be it).


Re: [Pharo-dev] [Vm-dev] Better management of encoding of environment variables

2019-01-18 Thread Nicolas Cellier
Le ven. 18 janv. 2019 à 14:35, ducasse  a écrit :

>
> What's important is to create abstract layers that insulate the un-needed
> complexity in lowest layers possible.
> The VM excels at insulating of course.
> At image side we have to assume the responsibility of not leaking too much
> by ourself.
>
> As Eliot said, right now the VM (and FFI) just take sequences of
> uninterpreted bytes (ByteArray) and pass them to API.
> The conversion ByteString/WideString <-> specifically-encoded ByteArray is
> performed at image side.
>
> With FFI, we could eventually make this conversion platform specific
> instead of always UTF8.
> The purpose would be to reduce back and forth conversions in chained API
> calls for example.
> For sanity, then better follow those rules:
> - the image does not attempt direct interaction with these opaque data
> (other than thru OS API)
> - nor preserve them across snapshots.
> Beware, conversion is not platform specific, but can be library specific
> (some library on windows will take UTF8).
> So we may reify the library and always double dispatch to the library, or
> we create upper level abstract messages that may chain several low level OS
> API calls.
> We would thus let complexity creep one more level, but only if we have
> good reason to do so.
> We don't want to trade uniformity for small gains.
> BTW, note that the xxxW API is already a huge uniformisation progress
> compared to the code-page specific xxxA API!
>
>
> Hi nicolas
>
> I’m reading and trying to understand. but the xxx lost me. :)
>
>
> Sorry, I was talking of the windows API variants, W for Wide characters, A
for ASCII (or rather current-code-page in effect)

>
>
> Another strategy is to create more complex abstractions (i.e.
> parameterized) that can deal with a zoo of different underlying conventions.
> For example, this would be the EncodedString of VW.
> This strategy could be tempting, because it enables dealing with lower
> level platform-specific-encoded objects and still interact with them in the
> image transparently.
> But I strongly advise to think twice (or more) before introducing such
> complexity:
> - it breaks former invariants (thus potentially lot of code)
> - complexity tends to spread in many places
> I don't recommend it.
>
> PS: oups, sorry for out of band message, I wanted to send, but it seems
> that I did not press the button properly...
>
>>
>>> > On 16 Jan 2019, at 10:59, Guillermo Polito 
>>> wrote:
>>> >
>>> > Hi Nicolas,
>>> >
>>> > On Wed, Jan 16, 2019 at 10:25 AM Nicolas Cellier <
>>> nicolas.cellier.aka.n...@gmail.com> wrote:
>>> > IMO, windows VM (and plugins) should do the UCS2 -> UTF8 conversion
>>> because the purpose of a VM is to provide an OS independant façade.
>>> > I made progress recently in this area, but we should finish the
>>> job/test/consolidate.
>>> >
>>> > I'm following your changes for windows from the shadows and I think
>>> they are awesome :).
>>> >
>>> > If someone bypass the VM and use direct windows API thru FFI, then he
>>> takes the responsibility, but uniformity doesn't hurt.
>>> >
>>> >  So far we are using FFI for this, as you say we create first
>>> Win32WideStrings from utf8 strings and then we use ffi calls to the *W
>>> functions.
>>> > I don't think we can make it for Pharo7.0.0. The cycle to build, do
>>> some acceptance tests, and then bless a new VM as stable is far too long
>>> for our inminent release :).
>>> >
>>> > But this could be for a 7.1.0, and if you like I can surely give a
>>> hand on this.
>>> >
>>> > Guille
>>>
>>>
>>>
>>
>> --
>> _,,,^..^,,,_
>> best, Eliot
>>
>
>


Re: [Pharo-dev] DebugSession>>activePC:

2019-01-18 Thread ducasse via Pharo-dev
--- Begin Message ---
I simply love the dynamic rewriting this is just too cool. We should 
systematically use it. 
I will continue to use it in any deprecation. 

Now I have a simple question (You can explain it to me over lunch one of these 
days).

I do not get why RBAST would not be a good representation for the compiler?
I would like to know what is the difference.

You mean that before going from BC to AST was difficult?
How opal performs it? It does not use the source of the method to recreate the 
AST but he can do it from the BC?

Stef


>> 
> 
> But I like the “high level”: using a shared AST between the compiler and the 
> tools *and* having a mapping BC -> AST -> Text.
> 
> Of course I understand that the choice to use the RB AST for the compiler is 
> not a “traditional” one.. but it turned out to work very well *and* it brings 
> some amazing power, as we have now not only a mapping bc->text offset, but a 
> mapping bc->AST, too. This e.g. (just a a simple example) makes the magic of 
> the runtime transforming deprecations possible. See #transform on class 
> Deprecation, the #sourceNodeExecuted:
> 
> transform
>   | node rewriteRule aMethod |
>   self shouldTransform ifFalse: [ ^ self ].
>   self rewriterClass ifNil:[ ^ self signal ].
>   aMethod := self contextOfSender method.
>   aMethod isDoIt ifTrue:[^ self]. "no need to transform doits"
>   node := self contextOfSender sourceNodeExecuted.
>   rewriteRule := self rewriterClass new 
>   replace: rule key with: rule value.
>   (rewriteRule executeTree: node)
>   ifFalse: [ ^ self ].
>   node replaceWith: rewriteRule tree. 
>   Author 
>   useAuthor: 'AutoDeprecationRefactoring'
>   during: [aMethod origin compile: aMethod ast formattedCode 
> classified: aMethod protocol].   
> 
> 
>   Marcus

--- End Message ---


Re: [Pharo-dev] Better management of encoding of environment variables

2019-01-18 Thread Guillermo Polito
On Fri, Jan 18, 2019 at 1:48 PM Ben Coman via Pharo-dev <
pharo-dev@lists.pharo.org> wrote:

>
>
>
>
> On Wed, 16 Jan 2019 at 18:37, Sven Van Caekenberghe  wrote:
>
>> Still, one of the conclusions of previous discussions about the encoding
>> of environment variables was/is that there is no single correct solution.
>> OS's are not consistent in how the encoding is done in all (historical)
>> contexts (like sometimes,
>
>
>
>> 1 env var defines the encoding to use for others,
>
>
> ouch.  That one point nearly made my retract my comment next paragraph,
> but is there much more complexity?
> or just a case of  utf8<==>appSpecificEncoding  rather than
> ascii<==>appSpecificEncoding ?
>

It's not mch more complex. The problem is that usually the bugs that
arise from wrongly managing such conversions can be super obscure.


> Sorry if I'm rehashing past discussion (do you have a link?), but
> considering...
> * 92% of web pages are UTF8 encoded[1] such that pragmatically UTF8 *is*
> the standard for text
> * Strings so pervasive in a system
> ...would there be an overall benefit to adopt UTF8 as the encoding for
> Strings
> consistently provided across the cross-platform vm interface?
> (i.e. fixing platforms that don't comply to the standard due to their
> historical baggage)
>
> And I found it interesting Microsoft are making some moves towards UTF8
> [2]...
> "With insider build 17035 and the April 2018 update (nominal build 17134)
> for Windows 10, a "Beta: Use Unicode UTF-8 for worldwide language support"
> checkbox appeared for setting the locale code page to UTF-8.[a] This allows
> for calling "narrow" functions, including fopen and SetWindowTextA, with
> UTF-8 strings. "
>
> The approach vm-side could be similar to Section 10 How to do text on
> Windows [3]
> with the philosophy of "performing the [conversions] as close to API calls
> as possible,
> and never holding the [converted] data."
>
> [1]
> https://w3techs.com/technologies/history_overview/character_encoding/ms/y
> [2] https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows
> [3] http://utf8everywhere.org/
>
>
> different applications do different things, and other such nice stuff),
>> and certainly not across platforms.
>>
>> So this is really complex.
>>
>> Do we want to hide this in some obscure VM C code that very few people
>> can see, read, let alone help with ?
>>
>> The image side is perfectly capable of dealing with platform differences
>> in a clean/clear way, and at least we can then use the full power of our
>> language and our tools.
>>
>
> Big question... Do we currently have primitives of the same name returning
> different encodings on different platforms?  I presume that would be
> awkward.
> If the image is handle encoding differences, should separate primitives be
> used? e.g. utf8GetEnv & utf16getEnv
>
> Could I get some feedback on [4] saying... **The Single Most Important
> Fact About Encodings**
> If you completely forget everything I just explained, please remember one
> extremely important fact.
> It does not make sense to have a string without knowing what encoding it
> uses. "
>
> And so... does our String nowadays require an 'encoding' instance variable
> such that this is *always* associated?
> This might remove any need for separate utf8GetEnv & utf16getEnv (if that
> was even a reasonable idea).
>

I think that will just overcomplicate things. Right now, all Strings in
Pharo are unicode strings. Characters are represented with their
corresponding unicode codepoint.
If all characters in a string have codepoints < 256 then they are just
stored in a bytestring. Otherwise they are WideStrings.

I think assuming a single representation for strings, and then encode when
interacting with external apps/APIs is MUCH simpler.


> cheers -ben
>
> [4]
> https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
>
>
>
>> > On 16 Jan 2019, at 10:59, Guillermo Polito 
>> wrote:
>> >
>> > Hi Nicolas,
>> >
>> > On Wed, Jan 16, 2019 at 10:25 AM Nicolas Cellier <
>> nicolas.cellier.aka.n...@gmail.com> wrote:
>> > IMO, windows VM (and plugins) should do the UCS2 -> UTF8 conversion
>> because the purpose of a VM is to provide an OS independant façade.
>> > I made progress recently in this area, but we should finish the
>> job/test/consolidate.
>> >
>> > I'm following your changes for windows from the shadows and I think
>> they are awesome :).
>> >
>> > If someone bypass the VM and use direct windows API thru FFI, then he
>> takes the responsibility, but uniformity doesn't hurt.
>> >
>> >  So far we are using FFI for this, as you say we create first
>> Win32WideStrings from utf8 strings and then we use ffi calls to the *W
>> functions.
>> > I don't think we can make it for Pharo7.0.0. The cycle to build, do
>> some acceptance tests, and then bless a new VM as stable is far too long
>> for our inminent release :).
>> >
>> 

Re: [Pharo-dev] DebugSession>>activePC:

2019-01-18 Thread Marcus Denker via Pharo-dev
--- Begin Message ---


> On 11 Jan 2019, at 20:28, Eliot Miranda  wrote:
> 
> Hi Thomas,
> 
>  forgive me, my first response was too terse.  Having thought about it in the 
> shower it becomes clear :-)
> 
>> On Jan 11, 2019, at 6:49 AM, Thomas Dupriez  
>> wrote:
>> 
>> Hi,
>> 
>> Yes, my question was just of the form: "Hey there's this method in 
>> DebugSession. What is it doing? What's the intention behind it? Does someone 
>> know?". There was no hidden agenda behind it.
>> 
>> @Eliot
>> 
>> After taking another look at this method, there's something I don't 
>> understand:
>> 
>> activePC: aContext
>> ^ (self isLatestContext: aContext)
>>ifTrue: [ interruptedContext pc ]
>>ifFalse: [ self previousPC: aContext ]
>> 
>> isLatestContext: checks whether its argument is the suspended context (the 
>> context at the top of the stack of the interrupted process). And if that's 
>> true, activePC: returns the pc of **interruptedContext**, not of the 
>> suspended context. These two contexts are different when the debugger opens 
>> on an exception, so this method is potentially returning a pc for another 
>> context than its argument...
>> 
>> Another question I have to improve the comment for this method is: what's 
>> the high-level meaning of this concept of "activePC". You gave the formal 
>> definition, but what's the point of defining this so to speak? What makes 
>> this concept interesting enough to warrant defining it and giving it a name?
> 
> There are two “modes” where a pc us mapped to a source range.  One is when 
> stepping a context in the debugger (the context is on top and is actively 
> executing bytecodes).  Here the debugger stops immediately before a send or 
> assignment or return, so that for sends we can do into or over, or for 
> assignments or returns check stack top to see what will be assigned or 
> returned.  In this mode we want the pc of the send, assign or return to map 
> to the source range for the send, or the expression being assigned or 
> returned.  Since this is the “common case”, and since this is the only choice 
> that makes sense for assignments ta and returns, the bytecode compiler 
> constructs it’s pc to source range map in terms of the pc of the first byte 
> if the send, assign or return bytecode.
> 
> The second “mode” is when selecting a context below the top context.  The pc 
> for any context below the top context will be the return pc for a send, 
> because the send has already happened.  The compiler could choose to map this 
> pc to the send, but it would not match what works for the common case. 
> Another choice would appear be to have two map entries, one for the send and 
> one for the return pc, both mapping to the source range.  But this wouldn’t 
> work because the result of a send might be assigned or returned and so there 
> is a potential conflict.  I stead the reasonable solution is to select the 
> previous pc for contexts below the top of context, which will be the pc for 
> the start of the send bytecode.
> 


I checked with Thomas

-> for source mapping, we use the API of the method map. The map does the “get 
the mapping for the instruction before”, it just needs to be told that we ask 
the range for an active context:

#rangeForPC:contextIsActiveContext:

it is called

^aContext debuggerMap
rangeForPC: aContext pc
contextIsActiveContext: (self isLatestContext: aContext) ]

So the logic was move from the debugger to the Map. (I think this is even your 
design?), and thus the logic inside the debugger is not needed anymore. 

-> For the question why the AST node of the Block has no simple method to check 
if it is an “isOptimized” block (e.g. in an ifTrue:): Yes, that might be nice. 
The reason why is is not there is that the compiler internally using 
OCOptimizedBlockScope and a check on the AST level was never needed. I would 
have added it as soon as it would be needed (quite easy to do). 

On the AST level there is already a check, though, if a *send* is optimized, 
which is used for code generation (#isInlined).

The question why we did not add more compatibility layers to the old AST is 
that is is quite different.. so even with some methods here and there 99% of 
clients need changes, so I did not even investigate it too much. With the idea 
that if we end up in a situation where we need just a method, we can just add 
it as needed.

In general, I have no srtong feelings about the details of the implementation 
of Opal. It’s just what was possible, with the resources and the understanding 
that I had at the point it was done. There are for sure even mistakes. I always 
make mistakes.
In addition, it is based on a prior compiler which was done for a different 
closure model whose Design-choices influenced it too much I think, sometimes 
good, sometimes not.

But I like the “high level”: using a shared AST between the compiler and the 
tools *and* having a mapping BC -> AST -> Text.


Re: [Pharo-dev] [Vm-dev] Better management of encoding of environment variables

2019-01-18 Thread Nicolas Cellier
Le mer. 16 janv. 2019 à 23:23, Eliot Miranda  a
écrit :

>
> Hi Sven,
>
> On Wed, Jan 16, 2019 at 2:37 AM Sven Van Caekenberghe 
> wrote:
>
>> Still, one of the conclusions of previous discussions about the encoding
>> of environment variables was/is that there is no single correct solution.
>> OS's are not consistent in how the encoding is done in all (historical)
>> contexts (like sometimes, 1 env var defines the encoding to use for others,
>> different applications do different things, and other such nice stuff), and
>> certainly not across platforms.
>>
>> So this is really complex.
>>
>> Do we want to hide this in some obscure VM C code that very few people
>> can see, read, let alone help with ?
>>
>> The image side is perfectly capable of dealing with platform differences
>> in a clean/clear way, and at least we can then use the full power of our
>> language and our tools.
>>
>
> Agreed.  At the same time I think it is very important that we don't reply
> on the FFI for environment variable access.  This is a basic cross-platform
> facility.  So I would like to see the environment accessed through
> primitives, but have the image place interpretation on the result of the
> primitive(s), and have the primitive(s) answer a raw result, just a
> sequence of uninterpreted bytes.
>
> VisualWorks takes this approach and provides a class UninterpretedBytes
> that the VM is aware of.  That's always seemed like an ugly name and
> overkill to me.  I would just use ByteArray and provide image level
> conversion from ByteArray to String, which is what I believe we have anyway.
>
>
What's important is to create abstract layers that insulate the un-needed
complexity in lowest layers possible.
The VM excels at insulating of course.
At image side we have to assume the responsibility of not leaking too much
by ourself.

As Eliot said, right now the VM (and FFI) just take sequences of
uninterpreted bytes (ByteArray) and pass them to API.
The conversion ByteString/WideString <-> specifically-encoded ByteArray is
performed at image side.

With FFI, we could eventually make this conversion platform specific
instead of always UTF8.
The purpose would be to reduce back and forth conversions in chained API
calls for example.
For sanity, then better follow those rules:
- the image does not attempt direct interaction with these opaque data
(other than thru OS API)
- nor preserve them across snapshots.
Beware, conversion is not platform specific, but can be library specific
(some library on windows will take UTF8).
So we may reify the library and always double dispatch to the library, or
we create upper level abstract messages that may chain several low level OS
API calls.
We would thus let complexity creep one more level, but only if we have good
reason to do so.
We don't want to trade uniformity for small gains.
BTW, note that the xxxW API is already a huge uniformisation progress
compared to the code-page specific xxxA API!

Another strategy is to create more complex abstractions (i.e.
parameterized) that can deal with a zoo of different underlying conventions.
For example, this would be the EncodedString of VW.
This strategy could be tempting, because it enables dealing with lower
level platform-specific-encoded objects and still interact with them in the
image transparently.
But I strongly advise to think twice (or more) before introducing such
complexity:
- it breaks former invariants (thus potentially lot of code)
- complexity tends to spread in many places
I don't recommend it.

PS: oups, sorry for out of band message, I wanted to send, but it seems
that I did not press the button properly...

>
>> > On 16 Jan 2019, at 10:59, Guillermo Polito 
>> wrote:
>> >
>> > Hi Nicolas,
>> >
>> > On Wed, Jan 16, 2019 at 10:25 AM Nicolas Cellier <
>> nicolas.cellier.aka.n...@gmail.com> wrote:
>> > IMO, windows VM (and plugins) should do the UCS2 -> UTF8 conversion
>> because the purpose of a VM is to provide an OS independant façade.
>> > I made progress recently in this area, but we should finish the
>> job/test/consolidate.
>> >
>> > I'm following your changes for windows from the shadows and I think
>> they are awesome :).
>> >
>> > If someone bypass the VM and use direct windows API thru FFI, then he
>> takes the responsibility, but uniformity doesn't hurt.
>> >
>> >  So far we are using FFI for this, as you say we create first
>> Win32WideStrings from utf8 strings and then we use ffi calls to the *W
>> functions.
>> > I don't think we can make it for Pharo7.0.0. The cycle to build, do
>> some acceptance tests, and then bless a new VM as stable is far too long
>> for our inminent release :).
>> >
>> > But this could be for a 7.1.0, and if you like I can surely give a hand
>> on this.
>> >
>> > Guille
>>
>>
>>
>
> --
> _,,,^..^,,,_
> best, Eliot
>


Re: [Pharo-dev] Better management of encoding of environment variables

2019-01-18 Thread Ben Coman via Pharo-dev
--- Begin Message ---
On Wed, 16 Jan 2019 at 18:37, Sven Van Caekenberghe  wrote:

> Still, one of the conclusions of previous discussions about the encoding
> of environment variables was/is that there is no single correct solution.
> OS's are not consistent in how the encoding is done in all (historical)
> contexts (like sometimes,



> 1 env var defines the encoding to use for others,


ouch.  That one point nearly made my retract my comment next paragraph, but
is there much more complexity?
or just a case of  utf8<==>appSpecificEncoding  rather than
ascii<==>appSpecificEncoding ?

Sorry if I'm rehashing past discussion (do you have a link?), but
considering...
* 92% of web pages are UTF8 encoded[1] such that pragmatically UTF8 *is*
the standard for text
* Strings so pervasive in a system
...would there be an overall benefit to adopt UTF8 as the encoding for
Strings
consistently provided across the cross-platform vm interface?
(i.e. fixing platforms that don't comply to the standard due to their
historical baggage)

And I found it interesting Microsoft are making some moves towards UTF8
[2]...
"With insider build 17035 and the April 2018 update (nominal build 17134)
for Windows 10, a "Beta: Use Unicode UTF-8 for worldwide language support"
checkbox appeared for setting the locale code page to UTF-8.[a] This allows
for calling "narrow" functions, including fopen and SetWindowTextA, with
UTF-8 strings. "

The approach vm-side could be similar to Section 10 How to do text on
Windows [3]
with the philosophy of "performing the [conversions] as close to API calls
as possible,
and never holding the [converted] data."

[1]
https://w3techs.com/technologies/history_overview/character_encoding/ms/y
[2] https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows
[3] http://utf8everywhere.org/


different applications do different things, and other such nice stuff), and
> certainly not across platforms.
>
> So this is really complex.
>
> Do we want to hide this in some obscure VM C code that very few people can
> see, read, let alone help with ?
>
> The image side is perfectly capable of dealing with platform differences
> in a clean/clear way, and at least we can then use the full power of our
> language and our tools.
>

Big question... Do we currently have primitives of the same name returning
different encodings on different platforms?  I presume that would be
awkward.
If the image is handle encoding differences, should separate primitives be
used? e.g. utf8GetEnv & utf16getEnv

Could I get some feedback on [4] saying... **The Single Most Important Fact
About Encodings**
If you completely forget everything I just explained, please remember one
extremely important fact.
It does not make sense to have a string without knowing what encoding it
uses. "

And so... does our String nowadays require an 'encoding' instance variable
such that this is *always* associated?
This might remove any need for separate utf8GetEnv & utf16getEnv (if that
was even a reasonable idea).

cheers -ben

[4]
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/



> > On 16 Jan 2019, at 10:59, Guillermo Polito 
> wrote:
> >
> > Hi Nicolas,
> >
> > On Wed, Jan 16, 2019 at 10:25 AM Nicolas Cellier <
> nicolas.cellier.aka.n...@gmail.com> wrote:
> > IMO, windows VM (and plugins) should do the UCS2 -> UTF8 conversion
> because the purpose of a VM is to provide an OS independant façade.
> > I made progress recently in this area, but we should finish the
> job/test/consolidate.
> >
> > I'm following your changes for windows from the shadows and I think they
> are awesome :).
> >
> > If someone bypass the VM and use direct windows API thru FFI, then he
> takes the responsibility, but uniformity doesn't hurt.
> >
> >  So far we are using FFI for this, as you say we create first
> Win32WideStrings from utf8 strings and then we use ffi calls to the *W
> functions.
> > I don't think we can make it for Pharo7.0.0. The cycle to build, do some
> acceptance tests, and then bless a new VM as stable is far too long for our
> inminent release :).
> >
> > But this could be for a 7.1.0, and if you like I can surely give a hand
> on this.
> >
> > Guille
>
>
>
--- End Message ---


Re: [Pharo-dev] Better management of encoding of environment variables

2019-01-18 Thread Sven Van Caekenberghe
Dave,

> On 18 Jan 2019, at 01:54, David T. Lewis via Pharo-dev 
>  wrote:
> 
> 
> From: "David T. Lewis" 
> Subject: Re: [Pharo-dev] Better management of encoding of environment 
> variables
> Date: 18 January 2019 at 01:54:34 GMT+1
> To: Pharo Development List 
> 
> 
> On Thu, Jan 17, 2019 at 04:57:18PM +0100, Sven Van Caekenberghe wrote:
>> 
>>> On 16 Jan 2019, at 23:23, Eliot Miranda  wrote:
>>> 
>>> On Wed, Jan 16, 2019 at 2:37 AM Sven Van Caekenberghe  wrote:
>>> 
>>> The image side is perfectly capable of dealing with platform differences
>>> in a clean/clear way, and at least we can then use the full power of our
>>> language and our tools.
>>> 
>> Agreed.  At the same time I think it is very important that we don't reply
>> on the FFI for environment variable access.  This is a basic cross-platform
>> facility.  So I would like to see the environment accessed through 
>> primitives,
>> but have the image place interpretation on the result of the primitive(s),
>> and have the primitive(s) answer a raw result, just a sequence of 
>> uninterpreted
>> bytes.
>> 
>> OK, I can understand that ENV VAR access is more fundamental than FFI
>> (although FFI is already essential for Pharo, also during startup).
>> 
>>> VisualWorks takes this approach and provides a class UninterpretedBytes
>>> that the VM is aware of.  That's always seemed like an ugly name and
>>> overkill to me.  I would just use ByteArray and provide image level
>>> conversion from ByteArray to String, which is what I believe we have anyway.
>> 
>> Right, bytes are always uninterpreted, else they would be something else.
>> We got ByteArray>>#decodedWith: and ByteArray>>#utf8Decoded and our ByteArray
>> inspector decodes automatically if it can.
>> 
> 
> Hi Sven,
> 
> I am the author of the getenv primitives, and I am also sadly uninformed
> about matters of character sets and strings in a multilingual environment.
> 
> The primitives answer environment variable variable values as ByteString
> rather than ByteArray. This made sense to me at the time that I wrote it,
> because ByteString is easy to display in an inspector, and because it is
> easily converted to ByteArray.
> 
> For an American English speaker this seems like a good choice, but I
> wonder now if it is a bad decision. After all, it is also trivially easy
> to convert a ByteArray to ByteString for display in the image.
> 
> Would it be helpful to have getenv primitives that answer ByteArray
> instead, and to let all conversion (including in OSProcess) be done in
> the image?
> 
> Thanks,
> Dave

Normally, the correct way to represent uninterpreted bytes is with a ByteArray. 
Decoding these bytes as characters is the specific task of a character 
encoder/decoder, with a deliberate choice as to which to use.

Since the getenv() system call uses simple C strings, it is understandable that 
this was carried over. It is probably not worth or too risky to change that - 
as long as the receiver understands that it is a raw OS string that needs more 
work.

Like with file path encoding/decoding, environment variable encoding/decoding 
is plain messy and complex. IMHO it is better to manage that at the image level 
where we are more agile and can better handle that complexity.

Sven

BTW: using funny Unicode chars, like  
[https://www.fileformat.info/info/unicode/char/1f388/index.htm] is something 
even English speakers do.





[Pharo-dev] [Pharo 7.0] Build #130: 22896-Creating-methods-in-a-subclass-with-a-class-using-a-trait

2019-01-18 Thread ci-pharo-ci-jenkins2--- via Pharo-dev
--- Begin Message ---
There is a new Pharo build available!
  
The status of the build #130 was: FAILURE.

The Pull Request #2266 was integrated: 
"22896-Creating-methods-in-a-subclass-with-a-class-using-a-trait"
Pull request url: https://github.com/pharo-project/pharo/pull/2266

Issue Url: https://pharo.fogbugz.com/f/cases/22896
Build Url: 
https://ci.inria.fr/pharo-ci-jenkins2/job/Test%20pending%20pull%20request%20and%20branch%20Pipeline/job/Pharo7.0/130/
--- End Message ---


Re: [Pharo-dev] Better management of encoding of environment variables

2019-01-18 Thread ducasse via Pharo-dev
--- Begin Message ---
> 
> So making the primitives return ByteArray instances instead of ByteString 
> should be safe enough :).
> But this is in my opinion clearly a hack instead of fixing the real problem, 
> and we have to be careful to guard such patterns with comments everywhere 
> explaining why the bytearray conversion is really needed there…

Guillermo what is the correct way to do it?

> Would it be helpful to have getenv primitives that answer ByteArray
> instead, and to let all conversion (including in OSProcess) be done in
> the image?
> 
> Well, personally I would like that getenv/setenv and getcwd setcwd support 
> are not in a plugin but as a basic service provided by the vm.
> 
> Cheers,
> Guille
>  
> 
> Thanks,
> Dave
> 
> 
> 
> 
> -- 
>
> Guille Polito
> Research Engineer
> 
> Centre de Recherche en Informatique, Signal et Automatique de Lille
> CRIStAL - UMR 9189
> French National Center for Scientific Research - http://www.cnrs.fr 
> 
> 
> Web: http://guillep.github.io 
> Phone: +33 06 52 70 66 13

--- End Message ---


Re: [Pharo-dev] Purpose of VM [was: Re: Better management of encoding of environment variables]

2019-01-18 Thread ducasse via Pharo-dev
--- Begin Message ---

> > 
> > And if it's in the image you get to do the programming in Smalltalk rather 
> > than C or Slang, which is more fun for most of us. And, let's face it, fun 
> > is an important metric in an open-source project -- things that are fun are 
> > much more likely to get done.
> 
> +100
> 
> The VM *is* developed in Smalltalk
> https://www.researchgate.net/publication/328509577_Two_Decades_of_Smalltalk_VM_Development_Live_VM_Development_through_Simulation_Tools
>  
> 
It is not the point of the message of Martin. I imagine that Martin and Sven 
understand it perfectly that the VM is written in Slang and that there
is a simulator. Still many of us agree with their analysis. The VM logic should 
be on execution and try to delegate to the image most of the rest. 

Stef--- End Message ---


Re: [Pharo-dev] Better management of encoding of environment variables

2019-01-18 Thread Guillermo Polito
On Fri, Jan 18, 2019 at 1:58 AM David T. Lewis via Pharo-dev <
pharo-dev@lists.pharo.org> wrote:

> On Thu, Jan 17, 2019 at 04:57:18PM +0100, Sven Van Caekenberghe wrote:
> >
> > > On 16 Jan 2019, at 23:23, Eliot Miranda 
> wrote:
> > >
> > > On Wed, Jan 16, 2019 at 2:37 AM Sven Van Caekenberghe 
> wrote:
> > >
> > > The image side is perfectly capable of dealing with platform
> differences
> > > in a clean/clear way, and at least we can then use the full power of
> our
> > > language and our tools.
> > >
> > Agreed.


+1

At the same time I think it is very important that we don't reply
> > on the FFI for environment variable access.  This is a basic
> cross-platform
> > facility.  So I would like to see the environment accessed through
> primitives,
> > but have the image place interpretation on the result of the
> primitive(s),
> > and have the primitive(s) answer a raw result, just a sequence of
> uninterpreted
> >  bytes.
>

Having looked at it not so long ago, I'll add my 2cts.

Environment access is a very particular scenario.
We have in Pharo many startup actions that directly or indirectly
(FileLocator home?) require environment variable access, and thus we have
to be really careful and picky to make sure that they all work,
dependencies are installed in the right order and so on...

In Pharo6 this was specially difficult because FFI was dynamically
compiling methods,
  => which required access to argument names,
 => which required access to the sources files,
   => which required access to the env vars (because in Pharo the
source/changes files are looked up in other directories than the image/vm
ones)
  => which loops :)

In Pharo7 argument names in FFI calls are embedded in the method meta-data
so all that is avoided.

Still I'd agree that moving this support to a primitive would make it less
fragile.
I'd apply the same to getting/setting the working directory.

>
> > OK, I can understand that ENV VAR access is more fundamental than FFI
> > (although FFI is already essential for Pharo, also during startup).
> >
> > > VisualWorks takes this approach and provides a class UninterpretedBytes
> > > that the VM is aware of.  That's always seemed like an ugly name and
> > > overkill to me.  I would just use ByteArray and provide image level
> > > conversion from ByteArray to String, which is what I believe we have
> anyway.
> >
> > Right, bytes are always uninterpreted, else they would be something else.
> > We got ByteArray>>#decodedWith: and ByteArray>>#utf8Decoded and our
> ByteArray
> >  inspector decodes automatically if it can.
> >
>
> Hi Sven,
>
> I am the author of the getenv primitives, and I am also sadly uninformed
> about matters of character sets and strings in a multilingual environment.
>
> The primitives answer environment variable variable values as ByteString
> rather than ByteArray. This made sense to me at the time that I wrote it,
> because ByteString is easy to display in an inspector, and because it is
> easily converted to ByteArray.
>
> For an American English speaker this seems like a good choice, but I
> wonder now if it is a bad decision.


Well, as soon as you want to manage some internationalisation, indeed it is.
But also it is a source of bugs, because assuming ascii is not right for
english either.
Most platforms will assume utf8 by default, and it's not quite the same for
many symbols :).

For example,

Character allByteCharacters size. => 256
Character allByteCharacters utf8Encoded size. 384

Character allByteCharacters select: [ :c |
c asString utf8Encoded size > 1 ].

'€ ‚ƒ„…†‡ˆ‰Š‹Œ Ž ‘’“”•–—˜™š›œ žŸ
¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'

Of course, many of those characters may not be used in the day-to-day of
many people, but as soon as we find one of those (I'm thinking about the
not super strange case of a database storing names :)).
Also think about the poor windows users (like myself since 2 weeks ago),
that have to think about utf16!

BTW, I hope I'm not breaking anybody's mail client by pasting strange
characters here :D (and if so, you may want suggest them to review how they
manage encoding :))

After all, it is also trivially easy
> to convert a ByteArray to ByteString for display in the image.
>

Yes, but it's sometimes difficult to find such places, as there are many
primitives spread in a lot of places doing the wrong thing, which is a
source of bugs...
I'd like to fix it from the root, the question is how to do it without
breaking ^^.
In Pharo we are doing at many places,

self primitiveXXX asByteArray utf8Decoded

So making the primitives return ByteArray instances instead of ByteString
should be safe enough :).
But this is in my opinion clearly a hack instead of fixing the real
problem, and we have to be careful to guard such patterns with comments
everywhere explaining why the bytearray conversion is really needed there...



> Would it be helpful to have getenv primitives