Re: [Pharo-dev] Peek

2013-11-13 Thread Sven Van Caekenberghe

On 06 Nov 2013, at 13:56, Sven Van Caekenberghe  wrote:

> 
> On 06 Nov 2013, at 13:20, Henrik Johansen  
> wrote:
> 
>> 
>> On 04 Nov 2013, at 5:12 , Sven Van Caekenberghe  wrote:
>> 
>>> 
>>> Well, I just realised that ZnCharacterReadStream and ZnCharacterWriteStream 
>>> did not yet make use of the optimisations that I did for 
>>> ZnCharacterEncoding some time ago. More specifically, they were not yet 
>>> using #next:putAll:startingAt:toStream: and 
>>> #readInto:startingAt:count:fromStream: which are overwritten for 
>>> ZnUTF8Encoder with (super hacked) versions that assume most of the input 
>>> will be ASCII (a reasonable assumption).
>>> 
>>> I am still chasing a bug, but right now:
>>> 
>>> [ (ZnCharacterReadStream on: ('timezones.json' asFileReference readStream 
>>> binary))
>>> next: 65536; close ] bench. 
>>> 
>>> "135 per second.” BEFORE
>>> "3,310 per second.” AFTER
>>> 
>>> But of course the input file is ASCII, so YMMV.
>>> 
>>> I’ll let you know when I commit this code.
>>> 
>>> Sven
>> 
>> Yeah… sooo, I loaded the updated version, great improvement for streams on 
>> Latin1 content :D
>> 
>> Maybe it’s just me, but I tested with actual wide source as well (it was as 
>> slow as you’d expect), and I think you need a notQuiteSoOptimizedReadInto* 
>> which uses the normal Byte -> Wide become: conversion machinery.
>> Writing a ZnByteStringBecameWideString handler for every next: (and cousins) 
>> call where source may be non-latin1 is a real chore/nasty surprise for those 
>> used to dealing with legacy streams/converters... 
>> You can sort of kinda make up for the performance hit (at least on these 
>> sizes) with a faster replace:from:to:with:startingAt: in use after the fact, 
>> by using basicAt:put: to the string, thus avoiding converting the 
>> replacement value to character. 
>> 
>> Cheers,
>> Henry
>> 
>> PS: Why is ZnByteStringBecameWideString a notification and not a resumable 
>> exception? I would assume those who run into it without a handler would 
>> rather have an actual error, than a result where their string has been read 
>> up to the first wide char…
> 
> Henrik, thanks again for the feedback, it is really useful.
> 
> You are right, the ZnByteStringBecameWideString hack was introduced 
> specifically to avoid the become, because benchmarking showed that it took 
> too long. The original user is ZnStringEntity>>#readFrom: 
> 
> Now, if you know up front that your string will be wide anyway, then there is 
> #decodeBytesIntoWideString:
> 
> Yes, ZnByteStringBecameWideString should probably be a resumable exception, 
> as not handling it properly will lead to errors.
> 
> I will have to think and reconsider the generality/specificity of the 
> optimisation hack in ZnUTF8Encoder. I got some ideas, but it will take some 
> time.
> 
> At least the code works now, time to process the constructive criticism.

===
Name: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.28
Author: SvenVanCaekenberghe
Time: 13 November 2013, 3:04:00.925461 pm
UUID: f104e1ee-7fe5-4a23-a650-197dd9462783
Ancestors: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.27

ZnByteStringBecameWideString is now a resumable Error instead of a Notification 
(as suggested by henrik sperre johansen);
Added ZnByteStringBecameWideString>>#becomeForward convenience method
Fixed ZnCharacterReadStream>>#readInto:startingAt:count: to do the actual 
becomeForward when needed so that clients are not affected
===
Name: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.15
Author: SvenVanCaekenberghe
Time: 13 November 2013, 3:05:18.26391 pm
UUID: 8c41ee29-9906-4626-a9da-cb576980cd4c
Ancestors: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.14

Added a unit test to make sure 
ZnCharacterReadStream>>#readInto:startingAt:count: does an actual becomeForward 
when needed
===

At least the situation is correct now.

Sven




Re: [Pharo-dev] Peek

2013-11-08 Thread Sven Van Caekenberghe

On 06 Nov 2013, at 13:59, Sven Van Caekenberghe  wrote:

> 
> On 06 Nov 2013, at 13:29, Henrik Johansen  
> wrote:
> 
>> On 06 Nov 2013, at 10:38 , Sven Van Caekenberghe  wrote:
>> 
>>> BTW, do we still need UTF16 support ?
>>> 
>>> For those encodings that we still want to support in the future, we should 
>>> have new and more principled implementations under ZnCharacterEncoder. That 
>>> is, if we ever want to fase out TextConverter.
>> 
>> UTF16 is the encoding for string arguments to the Unicode-aware version of 
>> the Windows API's, so I’d say yes.
>> Not too sure about JNI, but I know Java also uses UTF16 internally.
>> 
>> So the short answer is yes.
>> Not as a storage encoding, but for calling foreign functions, it’s quite 
>> important.
> 
> Ah, OK.
> 
> Fancy giving it a try, to create a ‘new’ implementation under 
> ZnCharacterEncoder with some tests ?
> 
> There are only 3 abstract methods to implement, if you should already know 
> UTF16, it would not be too much work ;-)

A first attempt:

===
Name: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.27
Author: SvenVanCaekenberghe
Time: 8 November 2013, 4:18:07.642898 pm
UUID: 29824770-7b9d-4377-a934-7bb2fbeefefb
Ancestors: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.26

Added new ZnUTF16Encoder
Minor refactoring to ZnCharacterEncoder hierarchy
===
Name: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.14
Author: SvenVanCaekenberghe
Time: 8 November 2013, 4:18:53.717919 pm
UUID: 6309f553-1632-438d-825c-b7a0f89193f4
Ancestors: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.13

Added unit tests for the new ZnUTF16Encoder
===

>> Cheers,
>> Henry




Re: [Pharo-dev] Peek

2013-11-06 Thread Sven Van Caekenberghe

On 06 Nov 2013, at 13:29, Henrik Johansen  wrote:

> On 06 Nov 2013, at 10:38 , Sven Van Caekenberghe  wrote:
> 
>> BTW, do we still need UTF16 support ?
>> 
>> For those encodings that we still want to support in the future, we should 
>> have new and more principled implementations under ZnCharacterEncoder. That 
>> is, if we ever want to fase out TextConverter.
> 
> UTF16 is the encoding for string arguments to the Unicode-aware version of 
> the Windows API's, so I’d say yes.
> Not too sure about JNI, but I know Java also uses UTF16 internally.
> 
> So the short answer is yes.
> Not as a storage encoding, but for calling foreign functions, it’s quite 
> important.

Ah, OK.

Fancy giving it a try, to create a ‘new’ implementation under 
ZnCharacterEncoder with some tests ?

There are only 3 abstract methods to implement, if you should already know 
UTF16, it would not be too much work ;-)


> Cheers,
> Henry
> 




Re: [Pharo-dev] Peek

2013-11-06 Thread Sven Van Caekenberghe

On 06 Nov 2013, at 13:20, Henrik Johansen  wrote:

> 
> On 04 Nov 2013, at 5:12 , Sven Van Caekenberghe  wrote:
> 
>> 
>> Well, I just realised that ZnCharacterReadStream and ZnCharacterWriteStream 
>> did not yet make use of the optimisations that I did for ZnCharacterEncoding 
>> some time ago. More specifically, they were not yet using 
>> #next:putAll:startingAt:toStream: and #readInto:startingAt:count:fromStream: 
>> which are overwritten for ZnUTF8Encoder with (super hacked) versions that 
>> assume most of the input will be ASCII (a reasonable assumption).
>> 
>> I am still chasing a bug, but right now:
>> 
>> [ (ZnCharacterReadStream on: ('timezones.json' asFileReference readStream 
>> binary))
>>  next: 65536; close ] bench. 
>> 
>>  "135 per second.” BEFORE
>>  "3,310 per second.” AFTER
>> 
>> But of course the input file is ASCII, so YMMV.
>> 
>> I’ll let you know when I commit this code.
>> 
>> Sven
> 
> Yeah… sooo, I loaded the updated version, great improvement for streams on 
> Latin1 content :D
> 
> Maybe it’s just me, but I tested with actual wide source as well (it was as 
> slow as you’d expect), and I think you need a notQuiteSoOptimizedReadInto* 
> which uses the normal Byte -> Wide become: conversion machinery.
> Writing a ZnByteStringBecameWideString handler for every next: (and cousins) 
> call where source may be non-latin1 is a real chore/nasty surprise for those 
> used to dealing with legacy streams/converters... 
> You can sort of kinda make up for the performance hit (at least on these 
> sizes) with a faster replace:from:to:with:startingAt: in use after the fact, 
> by using basicAt:put: to the string, thus avoiding converting the replacement 
> value to character. 
> 
> Cheers,
> Henry
> 
> PS: Why is ZnByteStringBecameWideString a notification and not a resumable 
> exception? I would assume those who run into it without a handler would 
> rather have an actual error, than a result where their string has been read 
> up to the first wide char…

Henrik, thanks again for the feedback, it is really useful.

You are right, the ZnByteStringBecameWideString hack was introduced 
specifically to avoid the become, because benchmarking showed that it took too 
long. The original user is ZnStringEntity>>#readFrom: 

Now, if you know up front that your string will be wide anyway, then there is 
#decodeBytesIntoWideString:

Yes, ZnByteStringBecameWideString should probably be a resumable exception, as 
not handling it properly will lead to errors.

I will have to think and reconsider the generality/specificity of the 
optimisation hack in ZnUTF8Encoder. I got some ideas, but it will take some 
time.

At least the code works now, time to process the constructive criticism.

Regards,

Sven








Re: [Pharo-dev] Peek

2013-11-06 Thread Camillo Bruni
that would be called XStreams, no?

On 2013-11-06, at 13:43, Stéphane Ducasse  wrote:

> Hi henrik
> 
> why don't you give a try to change our lives and propose a new 
> MultuByteFileStream and friends :)
> 
> Stef
> 
> 
>> That's great!
>> Remembering that commit message was part of the reason for benching, was 
>> sort of disappointed there was no significant difference between Zn in 2.0 
>> and latest 3.0...
>> 
>> I guess with the amount of hacks accumulating, it is indeed turning into a 
>> worthy successor of MultiByteFileStream ;)
>> 
>> Cheers,
>> Henry
>> 
>> P.S: If you want another delightful one in the same vein (both from 
>> WTFy-ness and perf improvement POV), take a gander at UTF16TextConverter >> 
>> nextPutByteString:toStream: 
>> 
>> 
>> On Mon, Nov 4, 2013 at 5:12 PM, Sven Van Caekenberghe  wrote:
>> Hi Henrik,
>> 
>> Great writeup, thanks !
>> 
>> (more inline)
>> 
>> On 04 Nov 2013, at 11:58, Henrik Johansen  
>> wrote:
>> 
>> > On 04 Nov 2013, at 9:57 , Diego Lont  wrote:
>> >
>> >> Working on Petit Delphi we found a strange implementation for 
>> >> asPetitStream:
>> >> Stream>asPetitStream
>> >>  ^ self contents asPetitStream
>> >>
>> >> Further investigation showed that the basic peek was not fast enough for 
>> >> Petit Parser, as it is used a lot. So it implemented a "improved 
>> >> unchecked peek":
>> >> PPStream>peek
>> >>  "An improved version of peek, that is slightly faster than the built 
>> >> in version."
>> >>  ^ self atEnd ifFalse: [ collection at: position + 1 ]
>> >>
>> >> PPStream>uncheckedPeek
>> >>  "An unchecked version of peek that throws an error if we try to peek 
>> >> over the end of the stream, even faster than #peek."
>> >>  ^ collection at: position + 1
>> >>
>> >> But in my knowledge a basic peek should be fast. The real problem is the 
>> >> peek in the underlying peek:
>> >> PositionableStream>peek
>> >>  "Answer what would be returned if the message next were sent to the
>> >>  receiver. If the receiver is at the end, answer nil."
>> >>
>> >>  | nextObject |
>> >>  self atEnd ifTrue: [^nil].
>> >>  nextObject := self next.
>> >>  position := position - 1.
>> >>  ^nextObject
>> >>
>> >> That actually uses "self next". The least thing one should do is to cache 
>> >> the next object. But isn't there a primitive for peek in a file stream? 
>> >> Because al overriding peeks of PositionableStream have basically the same 
>> >> implementation: reading the next and restoring the state to before the 
>> >> peek (that is slow). So we would like to be able to remove PPStream 
>> >> without causing performance issues, as the only added method is the 
>> >> "improved peek".
>> >>
>> >> Stephan and Diego
>> >
>> > If you are reading from file, ZnCharacterStream should be a valid 
>> > alternative.
>> > If not, ZnBufferedReadStream on an internal collection stream also does 
>> > peek caching.
>> >
>> > Beware with files though; it’s better to bench the overall operation for 
>> > different alternatives.
>> > F.ex, ZnCharacterStream is much faster than the standard Filestream for 
>> > peek:
>> >
>> > cr := ZnCharacterReadStream on: 'PharoDebug.log' asFileReference 
>> > readStream binary.
>> > [cr peek] bench. '49,400,000 per second.'
>> > cr close.
>> >
>> > FileStream fileNamed: 'PharoDebug.log' do: [:fs | [fs peek] bench] 
>> > '535,000 per second.’
>> >
>> > but has different bulk reads characteristics (faster for small bulks, 
>> > slower for large bulks, crossover-point at around 1k chars at once);
>> > (The actual values are of course also dependent on encoder/file contents, 
>> > those given here obtained with UTF-8 and a mostly/all ascii text file)
>> >
>> > [cr := ZnCharacterReadStream on: ('PharoDebug.log' asFileReference 
>> > readStream binary ) readStream.
>> >   cr next: 65536; close] bench  '105 per second.'  '106 per second.’
>> 
>> Well, I just realised that ZnCharacterReadStream and ZnCharacterWriteStream 
>> did not yet make use of the optimisations that I did for ZnCharacterEncoding 
>> some time ago. More specifically, they were not yet using 
>> #next:putAll:startingAt:toStream: and #readInto:startingAt:count:fromStream: 
>> which are overwritten for ZnUTF8Encoder with (super hacked) versions that 
>> assume most of the input will be ASCII (a reasonable assumption).
>> 
>> I am still chasing a bug, but right now:
>> 
>> [ (ZnCharacterReadStream on: ('timezones.json' asFileReference readStream 
>> binary))
>> next: 65536; close ] bench.
>> 
>> "135 per second.” BEFORE
>> "3,310 per second.” AFTER
>> 
>> But of course the input file is ASCII, so YMMV.
>> 
>> I’ll let you know when I commit this code.
>> 
>> Sven
>> 
>> > [FileStream fileNamed: 'PharoDebug.log' do: [:fs | fs next: 65536]
>> >   ] bench  '176 per second.’
>> >
>> > If you use a StandardFilestream set to binary ( which has less overhead 
>> > for binary next’s compared to the MultiByteFileStream retu

Re: [Pharo-dev] Peek

2013-11-06 Thread Stéphane Ducasse
> 
> 
> BTW, do we still need UTF16 support ?
> 
> For those encodings that we still want to support in the future, we should 
> have new and more principled implementations under ZnCharacterEncoder. That 
> is, if we ever want to fase out TextConverter.

:)
You know what I think. 
Kill all the rotten code.

Stef




Re: [Pharo-dev] Peek

2013-11-06 Thread Stéphane Ducasse
Hi henrik

why don't you give a try to change our lives and propose a new 
MultuByteFileStream and friends :)

Stef


> That's great!
> Remembering that commit message was part of the reason for benching, was sort 
> of disappointed there was no significant difference between Zn in 2.0 and 
> latest 3.0...
> 
> I guess with the amount of hacks accumulating, it is indeed turning into a 
> worthy successor of MultiByteFileStream ;)
> 
> Cheers,
> Henry
> 
> P.S: If you want another delightful one in the same vein (both from WTFy-ness 
> and perf improvement POV), take a gander at UTF16TextConverter >> 
> nextPutByteString:toStream: 
> 
> 
> On Mon, Nov 4, 2013 at 5:12 PM, Sven Van Caekenberghe  wrote:
> Hi Henrik,
> 
> Great writeup, thanks !
> 
> (more inline)
> 
> On 04 Nov 2013, at 11:58, Henrik Johansen  
> wrote:
> 
> > On 04 Nov 2013, at 9:57 , Diego Lont  wrote:
> >
> >> Working on Petit Delphi we found a strange implementation for 
> >> asPetitStream:
> >> Stream>asPetitStream
> >>  ^ self contents asPetitStream
> >>
> >> Further investigation showed that the basic peek was not fast enough for 
> >> Petit Parser, as it is used a lot. So it implemented a "improved unchecked 
> >> peek":
> >> PPStream>peek
> >>  "An improved version of peek, that is slightly faster than the built 
> >> in version."
> >>  ^ self atEnd ifFalse: [ collection at: position + 1 ]
> >>
> >> PPStream>uncheckedPeek
> >>  "An unchecked version of peek that throws an error if we try to peek 
> >> over the end of the stream, even faster than #peek."
> >>  ^ collection at: position + 1
> >>
> >> But in my knowledge a basic peek should be fast. The real problem is the 
> >> peek in the underlying peek:
> >> PositionableStream>peek
> >>  "Answer what would be returned if the message next were sent to the
> >>  receiver. If the receiver is at the end, answer nil."
> >>
> >>  | nextObject |
> >>  self atEnd ifTrue: [^nil].
> >>  nextObject := self next.
> >>  position := position - 1.
> >>  ^nextObject
> >>
> >> That actually uses "self next". The least thing one should do is to cache 
> >> the next object. But isn't there a primitive for peek in a file stream? 
> >> Because al overriding peeks of PositionableStream have basically the same 
> >> implementation: reading the next and restoring the state to before the 
> >> peek (that is slow). So we would like to be able to remove PPStream 
> >> without causing performance issues, as the only added method is the 
> >> "improved peek".
> >>
> >> Stephan and Diego
> >
> > If you are reading from file, ZnCharacterStream should be a valid 
> > alternative.
> > If not, ZnBufferedReadStream on an internal collection stream also does 
> > peek caching.
> >
> > Beware with files though; it’s better to bench the overall operation for 
> > different alternatives.
> > F.ex, ZnCharacterStream is much faster than the standard Filestream for 
> > peek:
> >
> > cr := ZnCharacterReadStream on: 'PharoDebug.log' asFileReference readStream 
> > binary.
> > [cr peek] bench. '49,400,000 per second.'
> > cr close.
> >
> > FileStream fileNamed: 'PharoDebug.log' do: [:fs | [fs peek] bench] '535,000 
> > per second.’
> >
> > but has different bulk reads characteristics (faster for small bulks, 
> > slower for large bulks, crossover-point at around 1k chars at once);
> > (The actual values are of course also dependent on encoder/file contents, 
> > those given here obtained with UTF-8 and a mostly/all ascii text file)
> >
> > [cr := ZnCharacterReadStream on: ('PharoDebug.log' asFileReference 
> > readStream binary ) readStream.
> >   cr next: 65536; close] bench  '105 per second.'  '106 per second.’
> 
> Well, I just realised that ZnCharacterReadStream and ZnCharacterWriteStream 
> did not yet make use of the optimisations that I did for ZnCharacterEncoding 
> some time ago. More specifically, they were not yet using 
> #next:putAll:startingAt:toStream: and #readInto:startingAt:count:fromStream: 
> which are overwritten for ZnUTF8Encoder with (super hacked) versions that 
> assume most of the input will be ASCII (a reasonable assumption).
> 
> I am still chasing a bug, but right now:
> 
> [ (ZnCharacterReadStream on: ('timezones.json' asFileReference readStream 
> binary))
> next: 65536; close ] bench.
> 
> "135 per second.” BEFORE
> "3,310 per second.” AFTER
> 
> But of course the input file is ASCII, so YMMV.
> 
> I’ll let you know when I commit this code.
> 
> Sven
> 
> > [FileStream fileNamed: 'PharoDebug.log' do: [:fs | fs next: 65536]
> >   ] bench  '176 per second.’
> >
> > If you use a StandardFilestream set to binary ( which has less overhead for 
> > binary next’s compared to the MultiByteFileStream returned by 
> > asFileReference readStream)as the base stream instead, , the same general 
> > profile holds true, but with a crossover around 2k characters.
> >
> > TL;DR: Benchmark the alternatives. The best replacement option

Re: [Pharo-dev] Peek

2013-11-06 Thread Henrik Johansen
On 06 Nov 2013, at 10:38 , Sven Van Caekenberghe  wrote:

> BTW, do we still need UTF16 support ?
> 
> For those encodings that we still want to support in the future, we should 
> have new and more principled implementations under ZnCharacterEncoder. That 
> is, if we ever want to fase out TextConverter.

UTF16 is the encoding for string arguments to the Unicode-aware version of the 
Windows API's, so I’d say yes.
Not too sure about JNI, but I know Java also uses UTF16 internally.

So the short answer is yes.
Not as a storage encoding, but for calling foreign functions, it’s quite 
important.

Cheers,
Henry



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Pharo-dev] Peek

2013-11-06 Thread Henrik Johansen

On 04 Nov 2013, at 5:12 , Sven Van Caekenberghe  wrote:

> 
> Well, I just realised that ZnCharacterReadStream and ZnCharacterWriteStream 
> did not yet make use of the optimisations that I did for ZnCharacterEncoding 
> some time ago. More specifically, they were not yet using 
> #next:putAll:startingAt:toStream: and #readInto:startingAt:count:fromStream: 
> which are overwritten for ZnUTF8Encoder with (super hacked) versions that 
> assume most of the input will be ASCII (a reasonable assumption).
> 
> I am still chasing a bug, but right now:
> 
> [ (ZnCharacterReadStream on: ('timezones.json' asFileReference readStream 
> binary))
>   next: 65536; close ] bench. 
> 
>   "135 per second.” BEFORE
>   "3,310 per second.” AFTER
> 
> But of course the input file is ASCII, so YMMV.
> 
> I’ll let you know when I commit this code.
> 
> Sven

Yeah… sooo, I loaded the updated version, great improvement for streams on 
Latin1 content :D

Maybe it’s just me, but I tested with actual wide source as well (it was as 
slow as you’d expect), and I think you need a notQuiteSoOptimizedReadInto* 
which uses the normal Byte -> Wide become: conversion machinery.
Writing a ZnByteStringBecameWideString handler for every next: (and cousins) 
call where source may be non-latin1 is a real chore/nasty surprise for those 
used to dealing with legacy streams/converters... 
You can sort of kinda make up for the performance hit (at least on these sizes) 
with a faster replace:from:to:with:startingAt: in use after the fact, by using 
basicAt:put: to the string, thus avoiding converting the replacement value to 
character. 

Cheers,
Henry

PS: Why is ZnByteStringBecameWideString a notification and not a resumable 
exception? I would assume those who run into it without a handler would rather 
have an actual error, than a result where their string has been read up to the 
first wide char...


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Pharo-dev] Peek

2013-11-06 Thread Sven Van Caekenberghe

On 04 Nov 2013, at 22:25, Henrik Sperre Johansen  
wrote:

> That's great!
> Remembering that commit message was part of the reason for benching, was sort 
> of disappointed there was no significant difference between Zn in 2.0 and 
> latest 3.0...
> 
> I guess with the amount of hacks accumulating, it is indeed turning into a 
> worthy successor of MultiByteFileStream ;)

Bah, the hacking is limited to one method ( 
#optimizedReadInto:startingAt:count:fromStream: ) and the readable version is 
still present and used ( #originalReadInto:startingAt:count:fromStream: ). It 
is encapsulated and the rest of the code remains quite clear.

> Cheers,
> Henry
> 
> P.S: If you want another delightful one in the same vein (both from WTFy-ness 
> and perf improvement POV), take a gander at UTF16TextConverter >> 
> nextPutByteString:toStream: 

Pretty confusing !

BTW, do we still need UTF16 support ?

For those encodings that we still want to support in the future, we should have 
new and more principled implementations under ZnCharacterEncoder. That is, if 
we ever want to fase out TextConverter.

> On Mon, Nov 4, 2013 at 5:12 PM, Sven Van Caekenberghe  wrote:
> Hi Henrik,
> 
> Great writeup, thanks !
> 
> (more inline)
> 
> On 04 Nov 2013, at 11:58, Henrik Johansen  
> wrote:
> 
> > On 04 Nov 2013, at 9:57 , Diego Lont  wrote:
> >
> >> Working on Petit Delphi we found a strange implementation for 
> >> asPetitStream:
> >> Stream>asPetitStream
> >>  ^ self contents asPetitStream
> >>
> >> Further investigation showed that the basic peek was not fast enough for 
> >> Petit Parser, as it is used a lot. So it implemented a "improved unchecked 
> >> peek":
> >> PPStream>peek
> >>  "An improved version of peek, that is slightly faster than the built 
> >> in version."
> >>  ^ self atEnd ifFalse: [ collection at: position + 1 ]
> >>
> >> PPStream>uncheckedPeek
> >>  "An unchecked version of peek that throws an error if we try to peek 
> >> over the end of the stream, even faster than #peek."
> >>  ^ collection at: position + 1
> >>
> >> But in my knowledge a basic peek should be fast. The real problem is the 
> >> peek in the underlying peek:
> >> PositionableStream>peek
> >>  "Answer what would be returned if the message next were sent to the
> >>  receiver. If the receiver is at the end, answer nil."
> >>
> >>  | nextObject |
> >>  self atEnd ifTrue: [^nil].
> >>  nextObject := self next.
> >>  position := position - 1.
> >>  ^nextObject
> >>
> >> That actually uses "self next". The least thing one should do is to cache 
> >> the next object. But isn't there a primitive for peek in a file stream? 
> >> Because al overriding peeks of PositionableStream have basically the same 
> >> implementation: reading the next and restoring the state to before the 
> >> peek (that is slow). So we would like to be able to remove PPStream 
> >> without causing performance issues, as the only added method is the 
> >> "improved peek".
> >>
> >> Stephan and Diego
> >
> > If you are reading from file, ZnCharacterStream should be a valid 
> > alternative.
> > If not, ZnBufferedReadStream on an internal collection stream also does 
> > peek caching.
> >
> > Beware with files though; it’s better to bench the overall operation for 
> > different alternatives.
> > F.ex, ZnCharacterStream is much faster than the standard Filestream for 
> > peek:
> >
> > cr := ZnCharacterReadStream on: 'PharoDebug.log' asFileReference readStream 
> > binary.
> > [cr peek] bench. '49,400,000 per second.'
> > cr close.
> >
> > FileStream fileNamed: 'PharoDebug.log' do: [:fs | [fs peek] bench] '535,000 
> > per second.’
> >
> > but has different bulk reads characteristics (faster for small bulks, 
> > slower for large bulks, crossover-point at around 1k chars at once);
> > (The actual values are of course also dependent on encoder/file contents, 
> > those given here obtained with UTF-8 and a mostly/all ascii text file)
> >
> > [cr := ZnCharacterReadStream on: ('PharoDebug.log' asFileReference 
> > readStream binary ) readStream.
> >   cr next: 65536; close] bench  '105 per second.'  '106 per second.’
> 
> Well, I just realised that ZnCharacterReadStream and ZnCharacterWriteStream 
> did not yet make use of the optimisations that I did for ZnCharacterEncoding 
> some time ago. More specifically, they were not yet using 
> #next:putAll:startingAt:toStream: and #readInto:startingAt:count:fromStream: 
> which are overwritten for ZnUTF8Encoder with (super hacked) versions that 
> assume most of the input will be ASCII (a reasonable assumption).
> 
> I am still chasing a bug, but right now:
> 
> [ (ZnCharacterReadStream on: ('timezones.json' asFileReference readStream 
> binary))
> next: 65536; close ] bench.
> 
> "135 per second.” BEFORE
> "3,310 per second.” AFTER
> 
> But of course the input file is ASCII, so YMMV.
> 
> I’ll let you know when I commit this code.
> 
> Sven
> 
> > [FileStream

Re: [Pharo-dev] Peek

2013-11-06 Thread Sven Van Caekenberghe

On 04 Nov 2013, at 17:12, Sven Van Caekenberghe  wrote:

> Hi Henrik,
> 
> Great writeup, thanks !
> 
> (more inline)
> 
> On 04 Nov 2013, at 11:58, Henrik Johansen  
> wrote:
> 
>> On 04 Nov 2013, at 9:57 , Diego Lont  wrote:
>> 
>>> Working on Petit Delphi we found a strange implementation for asPetitStream:
>>> Stream>asPetitStream
>>> ^ self contents asPetitStream
>>> 
>>> Further investigation showed that the basic peek was not fast enough for 
>>> Petit Parser, as it is used a lot. So it implemented a "improved unchecked 
>>> peek":
>>> PPStream>peek
>>> "An improved version of peek, that is slightly faster than the built in 
>>> version."
>>> ^ self atEnd ifFalse: [ collection at: position + 1 ]
>>> 
>>> PPStream>uncheckedPeek
>>> "An unchecked version of peek that throws an error if we try to peek 
>>> over the end of the stream, even faster than #peek."
>>> ^ collection at: position + 1
>>> 
>>> But in my knowledge a basic peek should be fast. The real problem is the 
>>> peek in the underlying peek:
>>> PositionableStream>peek
>>> "Answer what would be returned if the message next were sent to the 
>>> receiver. If the receiver is at the end, answer nil."
>>> 
>>> | nextObject |
>>> self atEnd ifTrue: [^nil].
>>> nextObject := self next.
>>> position := position - 1.
>>> ^nextObject
>>> 
>>> That actually uses "self next". The least thing one should do is to cache 
>>> the next object. But isn't there a primitive for peek in a file stream? 
>>> Because al overriding peeks of PositionableStream have basically the same 
>>> implementation: reading the next and restoring the state to before the peek 
>>> (that is slow). So we would like to be able to remove PPStream without 
>>> causing performance issues, as the only added method is the "improved peek".
>>> 
>>> Stephan and Diego
>> 
>> If you are reading from file, ZnCharacterStream should be a valid 
>> alternative.
>> If not, ZnBufferedReadStream on an internal collection stream also does peek 
>> caching.
>> 
>> Beware with files though; it’s better to bench the overall operation for 
>> different alternatives.
>> F.ex, ZnCharacterStream is much faster than the standard Filestream for peek:
>> 
>> cr := ZnCharacterReadStream on: 'PharoDebug.log' asFileReference readStream 
>> binary.
>> [cr peek] bench. '49,400,000 per second.'
>> cr close.
>> 
>> FileStream fileNamed: 'PharoDebug.log' do: [:fs | [fs peek] bench] '535,000 
>> per second.’
>> 
>> but has different bulk reads characteristics (faster for small bulks, slower 
>> for large bulks, crossover-point at around 1k chars at once);
>> (The actual values are of course also dependent on encoder/file contents, 
>> those given here obtained with UTF-8 and a mostly/all ascii text file)
>> 
>> [cr := ZnCharacterReadStream on: ('PharoDebug.log' asFileReference 
>> readStream binary ) readStream.
>>  cr next: 65536; close] bench  '105 per second.'  '106 per second.’
> 
> Well, I just realised that ZnCharacterReadStream and ZnCharacterWriteStream 
> did not yet make use of the optimisations that I did for ZnCharacterEncoding 
> some time ago. More specifically, they were not yet using 
> #next:putAll:startingAt:toStream: and #readInto:startingAt:count:fromStream: 
> which are overwritten for ZnUTF8Encoder with (super hacked) versions that 
> assume most of the input will be ASCII (a reasonable assumption).
> 
> I am still chasing a bug, but right now:
> 
> [ (ZnCharacterReadStream on: ('timezones.json' asFileReference readStream 
> binary))
>   next: 65536; close ] bench. 
> 
>   "135 per second.” BEFORE
>   "3,310 per second.” AFTER
> 
> But of course the input file is ASCII, so YMMV.
> 
> I’ll let you know when I commit this code.

===
Name: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.26
Author: SvenVanCaekenberghe
Time: 6 November 2013, 10:09:58.837656 am
UUID: dcb21b30-9596-446d-801c-b091cf3f415e
Ancestors: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.25

ZnCharacterReadStream and ZnCharacterWriteStream now use the optimized bulk 
reading/writing methods from ZnCharacterEncoder 
(#readInto:startingAt:count:fromStream: and #next:putAll:startingAt:toStream: 
respectively)
Fixed a bug in ZnUTF8Encoder>>#optimizedReadInto:startingAt:count:fromStream:
Some refactoring and cleanup of ZnUTF8Encoder and ZnByteEncoder
Added 2 more unit tests 
===
Name: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.13
Author: SvenVanCaekenberghe
Time: 6 November 2013, 10:11:11.411621 am
UUID: 71b503eb-548e-4e2f-b80d-4d2b805bd11e
Ancestors: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.12

Added 2 more unit tests 
ZnCharacterReadStream and ZnCharacterWriteStream now use the optimized bulk 
reading/writing methods from ZnCharacterEncoder 
(#readInto:startingAt:count:fromStream: and #next:putAll:startingAt:toStream: 
respectively)
Fixed a bug in ZnUTF8Encoder>>#optimizedReadInto:startingAt:count:fromStream:
Some refactoring and cleanup of ZnUTF

Re: [Pharo-dev] Peek

2013-11-04 Thread Henrik Sperre Johansen
That's great!
Remembering that commit message was part of the reason for benching, was
sort of disappointed there was no significant difference between Zn in 2.0
and latest 3.0...

I guess with the amount of hacks accumulating, it is indeed turning into a
worthy successor of MultiByteFileStream ;)

Cheers,
Henry

P.S: If you want another delightful one in the same vein (both from
WTFy-ness and perf improvement POV), take a gander at UTF16TextConverter >>
nextPutByteString:toStream:


On Mon, Nov 4, 2013 at 5:12 PM, Sven Van Caekenberghe  wrote:

> Hi Henrik,
>
> Great writeup, thanks !
>
> (more inline)
>
> On 04 Nov 2013, at 11:58, Henrik Johansen 
> wrote:
>
> > On 04 Nov 2013, at 9:57 , Diego Lont  wrote:
> >
> >> Working on Petit Delphi we found a strange implementation for
> asPetitStream:
> >> Stream>asPetitStream
> >>  ^ self contents asPetitStream
> >>
> >> Further investigation showed that the basic peek was not fast enough
> for Petit Parser, as it is used a lot. So it implemented a "improved
> unchecked peek":
> >> PPStream>peek
> >>  "An improved version of peek, that is slightly faster than the
> built in version."
> >>  ^ self atEnd ifFalse: [ collection at: position + 1 ]
> >>
> >> PPStream>uncheckedPeek
> >>  "An unchecked version of peek that throws an error if we try to
> peek over the end of the stream, even faster than #peek."
> >>  ^ collection at: position + 1
> >>
> >> But in my knowledge a basic peek should be fast. The real problem is
> the peek in the underlying peek:
> >> PositionableStream>peek
> >>  "Answer what would be returned if the message next were sent to the
> >>  receiver. If the receiver is at the end, answer nil."
> >>
> >>  | nextObject |
> >>  self atEnd ifTrue: [^nil].
> >>  nextObject := self next.
> >>  position := position - 1.
> >>  ^nextObject
> >>
> >> That actually uses "self next". The least thing one should do is to
> cache the next object. But isn't there a primitive for peek in a file
> stream? Because al overriding peeks of PositionableStream have basically
> the same implementation: reading the next and restoring the state to before
> the peek (that is slow). So we would like to be able to remove PPStream
> without causing performance issues, as the only added method is the
> "improved peek".
> >>
> >> Stephan and Diego
> >
> > If you are reading from file, ZnCharacterStream should be a valid
> alternative.
> > If not, ZnBufferedReadStream on an internal collection stream also does
> peek caching.
> >
> > Beware with files though; it’s better to bench the overall operation for
> different alternatives.
> > F.ex, ZnCharacterStream is much faster than the standard Filestream for
> peek:
> >
> > cr := ZnCharacterReadStream on: 'PharoDebug.log' asFileReference
> readStream binary.
> > [cr peek] bench. '49,400,000 per second.'
> > cr close.
> >
> > FileStream fileNamed: 'PharoDebug.log' do: [:fs | [fs peek] bench]
> '535,000 per second.’
> >
> > but has different bulk reads characteristics (faster for small bulks,
> slower for large bulks, crossover-point at around 1k chars at once);
> > (The actual values are of course also dependent on encoder/file
> contents, those given here obtained with UTF-8 and a mostly/all ascii text
> file)
> >
> > [cr := ZnCharacterReadStream on: ('PharoDebug.log' asFileReference
> readStream binary ) readStream.
> >   cr next: 65536; close] bench  '105 per second.'  '106 per second.’
>
> Well, I just realised that ZnCharacterReadStream and
> ZnCharacterWriteStream did not yet make use of the optimisations that I did
> for ZnCharacterEncoding some time ago. More specifically, they were not yet
> using #next:putAll:startingAt:toStream: and
> #readInto:startingAt:count:fromStream: which are overwritten for
> ZnUTF8Encoder with (super hacked) versions that assume most of the input
> will be ASCII (a reasonable assumption).
>
> I am still chasing a bug, but right now:
>
> [ (ZnCharacterReadStream on: ('timezones.json' asFileReference readStream
> binary))
> next: 65536; close ] bench.
>
> "135 per second.” BEFORE
> "3,310 per second.” AFTER
>
> But of course the input file is ASCII, so YMMV.
>
> I’ll let you know when I commit this code.
>
> Sven
>
> > [FileStream fileNamed: 'PharoDebug.log' do: [:fs | fs next: 65536]
> >   ] bench  '176 per second.’
> >
> > If you use a StandardFilestream set to binary ( which has less overhead
> for binary next’s compared to the MultiByteFileStream returned by
> asFileReference readStream)as the base stream instead, , the same general
> profile holds true, but with a crossover around 2k characters.
> >
> > TL;DR: Benchmark the alternatives. The best replacement option depends
> on your results. Appropriately (according to source and actual use) set up
> Zn-streams are probably your best bet.
> >
> > Cheers,
> > Henry
>
>
>


Re: [Pharo-dev] Peek

2013-11-04 Thread Sven Van Caekenberghe
Hi Henrik,

Great writeup, thanks !

(more inline)

On 04 Nov 2013, at 11:58, Henrik Johansen  wrote:

> On 04 Nov 2013, at 9:57 , Diego Lont  wrote:
> 
>> Working on Petit Delphi we found a strange implementation for asPetitStream:
>> Stream>asPetitStream
>>  ^ self contents asPetitStream
>> 
>> Further investigation showed that the basic peek was not fast enough for 
>> Petit Parser, as it is used a lot. So it implemented a "improved unchecked 
>> peek":
>> PPStream>peek
>>  "An improved version of peek, that is slightly faster than the built in 
>> version."
>>  ^ self atEnd ifFalse: [ collection at: position + 1 ]
>> 
>> PPStream>uncheckedPeek
>>  "An unchecked version of peek that throws an error if we try to peek 
>> over the end of the stream, even faster than #peek."
>>  ^ collection at: position + 1
>> 
>> But in my knowledge a basic peek should be fast. The real problem is the 
>> peek in the underlying peek:
>> PositionableStream>peek
>>  "Answer what would be returned if the message next were sent to the 
>>  receiver. If the receiver is at the end, answer nil."
>> 
>>  | nextObject |
>>  self atEnd ifTrue: [^nil].
>>  nextObject := self next.
>>  position := position - 1.
>>  ^nextObject
>> 
>> That actually uses "self next". The least thing one should do is to cache 
>> the next object. But isn't there a primitive for peek in a file stream? 
>> Because al overriding peeks of PositionableStream have basically the same 
>> implementation: reading the next and restoring the state to before the peek 
>> (that is slow). So we would like to be able to remove PPStream without 
>> causing performance issues, as the only added method is the "improved peek".
>> 
>> Stephan and Diego
> 
> If you are reading from file, ZnCharacterStream should be a valid alternative.
> If not, ZnBufferedReadStream on an internal collection stream also does peek 
> caching.
> 
> Beware with files though; it’s better to bench the overall operation for 
> different alternatives.
> F.ex, ZnCharacterStream is much faster than the standard Filestream for peek:
> 
> cr := ZnCharacterReadStream on: 'PharoDebug.log' asFileReference readStream 
> binary.
> [cr peek] bench. '49,400,000 per second.'
> cr close.
> 
> FileStream fileNamed: 'PharoDebug.log' do: [:fs | [fs peek] bench] '535,000 
> per second.’
> 
> but has different bulk reads characteristics (faster for small bulks, slower 
> for large bulks, crossover-point at around 1k chars at once);
> (The actual values are of course also dependent on encoder/file contents, 
> those given here obtained with UTF-8 and a mostly/all ascii text file)
> 
> [cr := ZnCharacterReadStream on: ('PharoDebug.log' asFileReference readStream 
> binary ) readStream.
>   cr next: 65536; close] bench  '105 per second.'  '106 per second.’

Well, I just realised that ZnCharacterReadStream and ZnCharacterWriteStream did 
not yet make use of the optimisations that I did for ZnCharacterEncoding some 
time ago. More specifically, they were not yet using 
#next:putAll:startingAt:toStream: and #readInto:startingAt:count:fromStream: 
which are overwritten for ZnUTF8Encoder with (super hacked) versions that 
assume most of the input will be ASCII (a reasonable assumption).

I am still chasing a bug, but right now:

[ (ZnCharacterReadStream on: ('timezones.json' asFileReference readStream 
binary))
next: 65536; close ] bench. 

"135 per second.” BEFORE
"3,310 per second.” AFTER

But of course the input file is ASCII, so YMMV.

I’ll let you know when I commit this code.

Sven

> [FileStream fileNamed: 'PharoDebug.log' do: [:fs | fs next: 65536]
>   ] bench  '176 per second.’
> 
> If you use a StandardFilestream set to binary ( which has less overhead for 
> binary next’s compared to the MultiByteFileStream returned by asFileReference 
> readStream)as the base stream instead, , the same general profile holds true, 
> but with a crossover around 2k characters.
> 
> TL;DR: Benchmark the alternatives. The best replacement option depends on 
> your results. Appropriately (according to source and actual use) set up 
> Zn-streams are probably your best bet. 
> 
> Cheers,
> Henry




Re: [Pharo-dev] Peek

2013-11-04 Thread Henrik Johansen

On 04 Nov 2013, at 9:57 , Diego Lont  wrote:

> Working on Petit Delphi we found a strange implementation for asPetitStream:
> Stream>asPetitStream
>   ^ self contents asPetitStream
> 
> Further investigation showed that the basic peek was not fast enough for 
> Petit Parser, as it is used a lot. So it implemented a "improved unchecked 
> peek":
> PPStream>peek
>   "An improved version of peek, that is slightly faster than the built in 
> version."
>   ^ self atEnd ifFalse: [ collection at: position + 1 ]
> 
> PPStream>uncheckedPeek
>   "An unchecked version of peek that throws an error if we try to peek 
> over the end of the stream, even faster than #peek."
>   ^ collection at: position + 1
> 
> But in my knowledge a basic peek should be fast. The real problem is the peek 
> in the underlying peek:
> PositionableStream>peek
>   "Answer what would be returned if the message next were sent to the 
>   receiver. If the receiver is at the end, answer nil."
> 
>   | nextObject |
>   self atEnd ifTrue: [^nil].
>   nextObject := self next.
>   position := position - 1.
>   ^nextObject
> 
> That actually uses "self next". The least thing one should do is to cache the 
> next object. But isn't there a primitive for peek in a file stream? Because 
> al overriding peeks of PositionableStream have basically the same 
> implementation: reading the next and restoring the state to before the peek 
> (that is slow). So we would like to be able to remove PPStream without 
> causing performance issues, as the only added method is the "improved peek".
> 
> Stephan and Diego

If you are reading from file, ZnCharacterStream should be a valid alternative.
If not, ZnBufferedReadStream on an internal collection stream also does peek 
caching.

Beware with files though; it’s better to bench the overall operation for 
different alternatives.
F.ex, ZnCharacterStream is much faster than the standard Filestream for peek:

cr := ZnCharacterReadStream on: 'PharoDebug.log' asFileReference readStream 
binary.
[cr peek] bench. '49,400,000 per second.'
cr close.

FileStream fileNamed: 'PharoDebug.log' do: [:fs | [fs peek] bench] '535,000 per 
second.’

but has different bulk reads characteristics (faster for small bulks, slower 
for large bulks, crossover-point at around 1k chars at once);
(The actual values are of course also dependent on encoder/file contents, those 
given here obtained with UTF-8 and a mostly/all ascii text file)

[cr := ZnCharacterReadStream on: ('PharoDebug.log' asFileReference readStream 
binary ) readStream.
cr next: 65536; close] bench  '105 per second.'  '106 per second.'

[FileStream fileNamed: 'PharoDebug.log' do: [:fs | fs next: 65536]
] bench  '176 per second.’

If you use a StandardFilestream set to binary ( which has less overhead for 
binary next’s compared to the MultiByteFileStream returned by asFileReference 
readStream)as the base stream instead, , the same general profile holds true, 
but with a crossover around 2k characters.

TL;DR: Benchmark the alternatives. The best replacement option depends on your 
results. Appropriately (according to source and actual use) set up Zn-streams 
are probably your best bet. 

Cheers,
Henry


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Pharo-dev] Peek

2013-11-04 Thread Sven Van Caekenberghe
Hi Diego,

Streams as they are currently defined, especially ReadStreams, can hardly be 
called IO streams. Many operations assume that there is a single collection 
over which you can move in any direction without restrictions. As a consequence 
most parsing code relies this ability, looking forward and going back multiple 
times, like using #skip: with a negative argument.

Real IO ReadStreams like a SocketStreams or FileStreams cannot be assumed to 
keep more than just a buffer in memory, hence the unrestricted operations are 
not implementable.

IMO, parsing code should be written against a much more restricted API, 
assuming at most a one element peek/buffer. I always try to do that, I even 
have Mock ReadStreams that ensure this.

While benchmarking I have also found that slow #peek behaviour can be a 
bottleneck, especially on some of the more complex streams.

Sven

PS: Have a look at ZnCharacterReadStream for example, which implements this 
restricted behaviour: it is not positionable, but allows a one character peek 
using a one character internal buffer.

On 04 Nov 2013, at 09:57, Diego Lont  wrote:

> Working on Petit Delphi we found a strange implementation for asPetitStream:
> Stream>asPetitStream
>   ^ self contents asPetitStream
> 
> Further investigation showed that the basic peek was not fast enough for 
> Petit Parser, as it is used a lot. So it implemented a "improved unchecked 
> peek":
> PPStream>peek
>   "An improved version of peek, that is slightly faster than the built in 
> version."
>   ^ self atEnd ifFalse: [ collection at: position + 1 ]
> 
> PPStream>uncheckedPeek
>   "An unchecked version of peek that throws an error if we try to peek 
> over the end of the stream, even faster than #peek."
>   ^ collection at: position + 1
> 
> But in my knowledge a basic peek should be fast. The real problem is the peek 
> in the underlying peek:
> PositionableStream>peek
>   "Answer what would be returned if the message next were sent to the 
>   receiver. If the receiver is at the end, answer nil."
> 
>   | nextObject |
>   self atEnd ifTrue: [^nil].
>   nextObject := self next.
>   position := position - 1.
>   ^nextObject
> 
> That actually uses "self next". The least thing one should do is to cache the 
> next object. But isn't there a primitive for peek in a file stream? Because 
> al overriding peeks of PositionableStream have basically the same 
> implementation: reading the next and restoring the state to before the peek 
> (that is slow). So we would like to be able to remove PPStream without 
> causing performance issues, as the only added method is the "improved peek".
> 
> Stephan and Diego