Re: [Pharo-dev] Peek

2013-11-08 Thread Sven Van Caekenberghe

On 06 Nov 2013, at 13:59, Sven Van Caekenberghe s...@stfx.eu wrote:

 
 On 06 Nov 2013, at 13:29, Henrik Johansen henrik.s.johan...@veloxit.no 
 wrote:
 
 On 06 Nov 2013, at 10:38 , Sven Van Caekenberghe s...@stfx.eu wrote:
 
 BTW, do we still need UTF16 support ?
 
 For those encodings that we still want to support in the future, we should 
 have new and more principled implementations under ZnCharacterEncoder. That 
 is, if we ever want to fase out TextConverter.
 
 UTF16 is the encoding for string arguments to the Unicode-aware version of 
 the Windows API's, so I’d say yes.
 Not too sure about JNI, but I know Java also uses UTF16 internally.
 
 So the short answer is yes.
 Not as a storage encoding, but for calling foreign functions, it’s quite 
 important.
 
 Ah, OK.
 
 Fancy giving it a try, to create a ‘new’ implementation under 
 ZnCharacterEncoder with some tests ?
 
 There are only 3 abstract methods to implement, if you should already know 
 UTF16, it would not be too much work ;-)

A first attempt:

===
Name: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.27
Author: SvenVanCaekenberghe
Time: 8 November 2013, 4:18:07.642898 pm
UUID: 29824770-7b9d-4377-a934-7bb2fbeefefb
Ancestors: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.26

Added new ZnUTF16Encoder
Minor refactoring to ZnCharacterEncoder hierarchy
===
Name: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.14
Author: SvenVanCaekenberghe
Time: 8 November 2013, 4:18:53.717919 pm
UUID: 6309f553-1632-438d-825c-b7a0f89193f4
Ancestors: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.13

Added unit tests for the new ZnUTF16Encoder
===

 Cheers,
 Henry




Re: [Pharo-dev] Peek

2013-11-06 Thread Sven Van Caekenberghe

On 04 Nov 2013, at 17:12, Sven Van Caekenberghe s...@stfx.eu wrote:

 Hi Henrik,
 
 Great writeup, thanks !
 
 (more inline)
 
 On 04 Nov 2013, at 11:58, Henrik Johansen henrik.s.johan...@veloxit.no 
 wrote:
 
 On 04 Nov 2013, at 9:57 , Diego Lont diego.l...@delware.nl wrote:
 
 Working on Petit Delphi we found a strange implementation for asPetitStream:
 StreamasPetitStream
 ^ self contents asPetitStream
 
 Further investigation showed that the basic peek was not fast enough for 
 Petit Parser, as it is used a lot. So it implemented a improved unchecked 
 peek:
 PPStreampeek
 An improved version of peek, that is slightly faster than the built in 
 version.
 ^ self atEnd ifFalse: [ collection at: position + 1 ]
 
 PPStreamuncheckedPeek
 An unchecked version of peek that throws an error if we try to peek 
 over the end of the stream, even faster than #peek.
 ^ collection at: position + 1
 
 But in my knowledge a basic peek should be fast. The real problem is the 
 peek in the underlying peek:
 PositionableStreampeek
 Answer what would be returned if the message next were sent to the 
 receiver. If the receiver is at the end, answer nil.
 
 | nextObject |
 self atEnd ifTrue: [^nil].
 nextObject := self next.
 position := position - 1.
 ^nextObject
 
 That actually uses self next. The least thing one should do is to cache 
 the next object. But isn't there a primitive for peek in a file stream? 
 Because al overriding peeks of PositionableStream have basically the same 
 implementation: reading the next and restoring the state to before the peek 
 (that is slow). So we would like to be able to remove PPStream without 
 causing performance issues, as the only added method is the improved peek.
 
 Stephan and Diego
 
 If you are reading from file, ZnCharacterStream should be a valid 
 alternative.
 If not, ZnBufferedReadStream on an internal collection stream also does peek 
 caching.
 
 Beware with files though; it’s better to bench the overall operation for 
 different alternatives.
 F.ex, ZnCharacterStream is much faster than the standard Filestream for peek:
 
 cr := ZnCharacterReadStream on: 'PharoDebug.log' asFileReference readStream 
 binary.
 [cr peek] bench. '49,400,000 per second.'
 cr close.
 
 FileStream fileNamed: 'PharoDebug.log' do: [:fs | [fs peek] bench] '535,000 
 per second.’
 
 but has different bulk reads characteristics (faster for small bulks, slower 
 for large bulks, crossover-point at around 1k chars at once);
 (The actual values are of course also dependent on encoder/file contents, 
 those given here obtained with UTF-8 and a mostly/all ascii text file)
 
 [cr := ZnCharacterReadStream on: ('PharoDebug.log' asFileReference 
 readStream binary ) readStream.
  cr next: 65536; close] bench  '105 per second.'  '106 per second.’
 
 Well, I just realised that ZnCharacterReadStream and ZnCharacterWriteStream 
 did not yet make use of the optimisations that I did for ZnCharacterEncoding 
 some time ago. More specifically, they were not yet using 
 #next:putAll:startingAt:toStream: and #readInto:startingAt:count:fromStream: 
 which are overwritten for ZnUTF8Encoder with (super hacked) versions that 
 assume most of the input will be ASCII (a reasonable assumption).
 
 I am still chasing a bug, but right now:
 
 [ (ZnCharacterReadStream on: ('timezones.json' asFileReference readStream 
 binary))
   next: 65536; close ] bench. 
 
   135 per second.” BEFORE
   3,310 per second.” AFTER
 
 But of course the input file is ASCII, so YMMV.
 
 I’ll let you know when I commit this code.

===
Name: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.26
Author: SvenVanCaekenberghe
Time: 6 November 2013, 10:09:58.837656 am
UUID: dcb21b30-9596-446d-801c-b091cf3f415e
Ancestors: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.25

ZnCharacterReadStream and ZnCharacterWriteStream now use the optimized bulk 
reading/writing methods from ZnCharacterEncoder 
(#readInto:startingAt:count:fromStream: and #next:putAll:startingAt:toStream: 
respectively)
Fixed a bug in ZnUTF8Encoder#optimizedReadInto:startingAt:count:fromStream:
Some refactoring and cleanup of ZnUTF8Encoder and ZnByteEncoder
Added 2 more unit tests 
===
Name: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.13
Author: SvenVanCaekenberghe
Time: 6 November 2013, 10:11:11.411621 am
UUID: 71b503eb-548e-4e2f-b80d-4d2b805bd11e
Ancestors: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.12

Added 2 more unit tests 
ZnCharacterReadStream and ZnCharacterWriteStream now use the optimized bulk 
reading/writing methods from ZnCharacterEncoder 
(#readInto:startingAt:count:fromStream: and #next:putAll:startingAt:toStream: 
respectively)
Fixed a bug in ZnUTF8Encoder#optimizedReadInto:startingAt:count:fromStream:
Some refactoring and cleanup of ZnUTF8Encoder and ZnByteEncoder
===

 Sven
 
 [FileStream fileNamed: 'PharoDebug.log' do: [:fs | fs next: 65536]
  ] bench  '176 per second.’
 
 If you use a 

Re: [Pharo-dev] Peek

2013-11-06 Thread Sven Van Caekenberghe

On 04 Nov 2013, at 22:25, Henrik Sperre Johansen henrik.s.johan...@veloxit.no 
wrote:

 That's great!
 Remembering that commit message was part of the reason for benching, was sort 
 of disappointed there was no significant difference between Zn in 2.0 and 
 latest 3.0...
 
 I guess with the amount of hacks accumulating, it is indeed turning into a 
 worthy successor of MultiByteFileStream ;)

Bah, the hacking is limited to one method ( 
#optimizedReadInto:startingAt:count:fromStream: ) and the readable version is 
still present and used ( #originalReadInto:startingAt:count:fromStream: ). It 
is encapsulated and the rest of the code remains quite clear.

 Cheers,
 Henry
 
 P.S: If you want another delightful one in the same vein (both from WTFy-ness 
 and perf improvement POV), take a gander at UTF16TextConverter  
 nextPutByteString:toStream: 

Pretty confusing !

BTW, do we still need UTF16 support ?

For those encodings that we still want to support in the future, we should have 
new and more principled implementations under ZnCharacterEncoder. That is, if 
we ever want to fase out TextConverter.

 On Mon, Nov 4, 2013 at 5:12 PM, Sven Van Caekenberghe s...@stfx.eu wrote:
 Hi Henrik,
 
 Great writeup, thanks !
 
 (more inline)
 
 On 04 Nov 2013, at 11:58, Henrik Johansen henrik.s.johan...@veloxit.no 
 wrote:
 
  On 04 Nov 2013, at 9:57 , Diego Lont diego.l...@delware.nl wrote:
 
  Working on Petit Delphi we found a strange implementation for 
  asPetitStream:
  StreamasPetitStream
   ^ self contents asPetitStream
 
  Further investigation showed that the basic peek was not fast enough for 
  Petit Parser, as it is used a lot. So it implemented a improved unchecked 
  peek:
  PPStreampeek
   An improved version of peek, that is slightly faster than the built 
  in version.
   ^ self atEnd ifFalse: [ collection at: position + 1 ]
 
  PPStreamuncheckedPeek
   An unchecked version of peek that throws an error if we try to peek 
  over the end of the stream, even faster than #peek.
   ^ collection at: position + 1
 
  But in my knowledge a basic peek should be fast. The real problem is the 
  peek in the underlying peek:
  PositionableStreampeek
   Answer what would be returned if the message next were sent to the
   receiver. If the receiver is at the end, answer nil.
 
   | nextObject |
   self atEnd ifTrue: [^nil].
   nextObject := self next.
   position := position - 1.
   ^nextObject
 
  That actually uses self next. The least thing one should do is to cache 
  the next object. But isn't there a primitive for peek in a file stream? 
  Because al overriding peeks of PositionableStream have basically the same 
  implementation: reading the next and restoring the state to before the 
  peek (that is slow). So we would like to be able to remove PPStream 
  without causing performance issues, as the only added method is the 
  improved peek.
 
  Stephan and Diego
 
  If you are reading from file, ZnCharacterStream should be a valid 
  alternative.
  If not, ZnBufferedReadStream on an internal collection stream also does 
  peek caching.
 
  Beware with files though; it’s better to bench the overall operation for 
  different alternatives.
  F.ex, ZnCharacterStream is much faster than the standard Filestream for 
  peek:
 
  cr := ZnCharacterReadStream on: 'PharoDebug.log' asFileReference readStream 
  binary.
  [cr peek] bench. '49,400,000 per second.'
  cr close.
 
  FileStream fileNamed: 'PharoDebug.log' do: [:fs | [fs peek] bench] '535,000 
  per second.’
 
  but has different bulk reads characteristics (faster for small bulks, 
  slower for large bulks, crossover-point at around 1k chars at once);
  (The actual values are of course also dependent on encoder/file contents, 
  those given here obtained with UTF-8 and a mostly/all ascii text file)
 
  [cr := ZnCharacterReadStream on: ('PharoDebug.log' asFileReference 
  readStream binary ) readStream.
cr next: 65536; close] bench  '105 per second.'  '106 per second.’
 
 Well, I just realised that ZnCharacterReadStream and ZnCharacterWriteStream 
 did not yet make use of the optimisations that I did for ZnCharacterEncoding 
 some time ago. More specifically, they were not yet using 
 #next:putAll:startingAt:toStream: and #readInto:startingAt:count:fromStream: 
 which are overwritten for ZnUTF8Encoder with (super hacked) versions that 
 assume most of the input will be ASCII (a reasonable assumption).
 
 I am still chasing a bug, but right now:
 
 [ (ZnCharacterReadStream on: ('timezones.json' asFileReference readStream 
 binary))
 next: 65536; close ] bench.
 
 135 per second.” BEFORE
 3,310 per second.” AFTER
 
 But of course the input file is ASCII, so YMMV.
 
 I’ll let you know when I commit this code.
 
 Sven
 
  [FileStream fileNamed: 'PharoDebug.log' do: [:fs | fs next: 65536]
] bench  '176 per second.’
 
  If you use a StandardFilestream set to binary ( which has 

Re: [Pharo-dev] Peek

2013-11-06 Thread Henrik Johansen

On 04 Nov 2013, at 5:12 , Sven Van Caekenberghe s...@stfx.eu wrote:

 
 Well, I just realised that ZnCharacterReadStream and ZnCharacterWriteStream 
 did not yet make use of the optimisations that I did for ZnCharacterEncoding 
 some time ago. More specifically, they were not yet using 
 #next:putAll:startingAt:toStream: and #readInto:startingAt:count:fromStream: 
 which are overwritten for ZnUTF8Encoder with (super hacked) versions that 
 assume most of the input will be ASCII (a reasonable assumption).
 
 I am still chasing a bug, but right now:
 
 [ (ZnCharacterReadStream on: ('timezones.json' asFileReference readStream 
 binary))
   next: 65536; close ] bench. 
 
   135 per second.” BEFORE
   3,310 per second.” AFTER
 
 But of course the input file is ASCII, so YMMV.
 
 I’ll let you know when I commit this code.
 
 Sven

Yeah… sooo, I loaded the updated version, great improvement for streams on 
Latin1 content :D

Maybe it’s just me, but I tested with actual wide source as well (it was as 
slow as you’d expect), and I think you need a notQuiteSoOptimizedReadInto* 
which uses the normal Byte - Wide become: conversion machinery.
Writing a ZnByteStringBecameWideString handler for every next: (and cousins) 
call where source may be non-latin1 is a real chore/nasty surprise for those 
used to dealing with legacy streams/converters... 
You can sort of kinda make up for the performance hit (at least on these sizes) 
with a faster replace:from:to:with:startingAt: in use after the fact, by using 
basicAt:put: to the string, thus avoiding converting the replacement value to 
character. 

Cheers,
Henry

PS: Why is ZnByteStringBecameWideString a notification and not a resumable 
exception? I would assume those who run into it without a handler would rather 
have an actual error, than a result where their string has been read up to the 
first wide char...


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Pharo-dev] Peek

2013-11-06 Thread Henrik Johansen
On 06 Nov 2013, at 10:38 , Sven Van Caekenberghe s...@stfx.eu wrote:

 BTW, do we still need UTF16 support ?
 
 For those encodings that we still want to support in the future, we should 
 have new and more principled implementations under ZnCharacterEncoder. That 
 is, if we ever want to fase out TextConverter.

UTF16 is the encoding for string arguments to the Unicode-aware version of the 
Windows API's, so I’d say yes.
Not too sure about JNI, but I know Java also uses UTF16 internally.

So the short answer is yes.
Not as a storage encoding, but for calling foreign functions, it’s quite 
important.

Cheers,
Henry



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Pharo-dev] Peek

2013-11-06 Thread Stéphane Ducasse
Hi henrik

why don't you give a try to change our lives and propose a new 
MultuByteFileStream and friends :)

Stef


 That's great!
 Remembering that commit message was part of the reason for benching, was sort 
 of disappointed there was no significant difference between Zn in 2.0 and 
 latest 3.0...
 
 I guess with the amount of hacks accumulating, it is indeed turning into a 
 worthy successor of MultiByteFileStream ;)
 
 Cheers,
 Henry
 
 P.S: If you want another delightful one in the same vein (both from WTFy-ness 
 and perf improvement POV), take a gander at UTF16TextConverter  
 nextPutByteString:toStream: 
 
 
 On Mon, Nov 4, 2013 at 5:12 PM, Sven Van Caekenberghe s...@stfx.eu wrote:
 Hi Henrik,
 
 Great writeup, thanks !
 
 (more inline)
 
 On 04 Nov 2013, at 11:58, Henrik Johansen henrik.s.johan...@veloxit.no 
 wrote:
 
  On 04 Nov 2013, at 9:57 , Diego Lont diego.l...@delware.nl wrote:
 
  Working on Petit Delphi we found a strange implementation for 
  asPetitStream:
  StreamasPetitStream
   ^ self contents asPetitStream
 
  Further investigation showed that the basic peek was not fast enough for 
  Petit Parser, as it is used a lot. So it implemented a improved unchecked 
  peek:
  PPStreampeek
   An improved version of peek, that is slightly faster than the built 
  in version.
   ^ self atEnd ifFalse: [ collection at: position + 1 ]
 
  PPStreamuncheckedPeek
   An unchecked version of peek that throws an error if we try to peek 
  over the end of the stream, even faster than #peek.
   ^ collection at: position + 1
 
  But in my knowledge a basic peek should be fast. The real problem is the 
  peek in the underlying peek:
  PositionableStreampeek
   Answer what would be returned if the message next were sent to the
   receiver. If the receiver is at the end, answer nil.
 
   | nextObject |
   self atEnd ifTrue: [^nil].
   nextObject := self next.
   position := position - 1.
   ^nextObject
 
  That actually uses self next. The least thing one should do is to cache 
  the next object. But isn't there a primitive for peek in a file stream? 
  Because al overriding peeks of PositionableStream have basically the same 
  implementation: reading the next and restoring the state to before the 
  peek (that is slow). So we would like to be able to remove PPStream 
  without causing performance issues, as the only added method is the 
  improved peek.
 
  Stephan and Diego
 
  If you are reading from file, ZnCharacterStream should be a valid 
  alternative.
  If not, ZnBufferedReadStream on an internal collection stream also does 
  peek caching.
 
  Beware with files though; it’s better to bench the overall operation for 
  different alternatives.
  F.ex, ZnCharacterStream is much faster than the standard Filestream for 
  peek:
 
  cr := ZnCharacterReadStream on: 'PharoDebug.log' asFileReference readStream 
  binary.
  [cr peek] bench. '49,400,000 per second.'
  cr close.
 
  FileStream fileNamed: 'PharoDebug.log' do: [:fs | [fs peek] bench] '535,000 
  per second.’
 
  but has different bulk reads characteristics (faster for small bulks, 
  slower for large bulks, crossover-point at around 1k chars at once);
  (The actual values are of course also dependent on encoder/file contents, 
  those given here obtained with UTF-8 and a mostly/all ascii text file)
 
  [cr := ZnCharacterReadStream on: ('PharoDebug.log' asFileReference 
  readStream binary ) readStream.
cr next: 65536; close] bench  '105 per second.'  '106 per second.’
 
 Well, I just realised that ZnCharacterReadStream and ZnCharacterWriteStream 
 did not yet make use of the optimisations that I did for ZnCharacterEncoding 
 some time ago. More specifically, they were not yet using 
 #next:putAll:startingAt:toStream: and #readInto:startingAt:count:fromStream: 
 which are overwritten for ZnUTF8Encoder with (super hacked) versions that 
 assume most of the input will be ASCII (a reasonable assumption).
 
 I am still chasing a bug, but right now:
 
 [ (ZnCharacterReadStream on: ('timezones.json' asFileReference readStream 
 binary))
 next: 65536; close ] bench.
 
 135 per second.” BEFORE
 3,310 per second.” AFTER
 
 But of course the input file is ASCII, so YMMV.
 
 I’ll let you know when I commit this code.
 
 Sven
 
  [FileStream fileNamed: 'PharoDebug.log' do: [:fs | fs next: 65536]
] bench  '176 per second.’
 
  If you use a StandardFilestream set to binary ( which has less overhead for 
  binary next’s compared to the MultiByteFileStream returned by 
  asFileReference readStream)as the base stream instead, , the same general 
  profile holds true, but with a crossover around 2k characters.
 
  TL;DR: Benchmark the alternatives. The best replacement option depends on 
  your results. Appropriately (according to source and actual use) set up 
  Zn-streams are probably your best bet.
 
  Cheers,
  Henry
 
 
 



Re: [Pharo-dev] Peek

2013-11-06 Thread Stéphane Ducasse
 
 
 BTW, do we still need UTF16 support ?
 
 For those encodings that we still want to support in the future, we should 
 have new and more principled implementations under ZnCharacterEncoder. That 
 is, if we ever want to fase out TextConverter.

:)
You know what I think. 
Kill all the rotten code.

Stef




Re: [Pharo-dev] Peek

2013-11-06 Thread Camillo Bruni
that would be called XStreams, no?

On 2013-11-06, at 13:43, Stéphane Ducasse stephane.duca...@inria.fr wrote:

 Hi henrik
 
 why don't you give a try to change our lives and propose a new 
 MultuByteFileStream and friends :)
 
 Stef
 
 
 That's great!
 Remembering that commit message was part of the reason for benching, was 
 sort of disappointed there was no significant difference between Zn in 2.0 
 and latest 3.0...
 
 I guess with the amount of hacks accumulating, it is indeed turning into a 
 worthy successor of MultiByteFileStream ;)
 
 Cheers,
 Henry
 
 P.S: If you want another delightful one in the same vein (both from 
 WTFy-ness and perf improvement POV), take a gander at UTF16TextConverter  
 nextPutByteString:toStream: 
 
 
 On Mon, Nov 4, 2013 at 5:12 PM, Sven Van Caekenberghe s...@stfx.eu wrote:
 Hi Henrik,
 
 Great writeup, thanks !
 
 (more inline)
 
 On 04 Nov 2013, at 11:58, Henrik Johansen henrik.s.johan...@veloxit.no 
 wrote:
 
  On 04 Nov 2013, at 9:57 , Diego Lont diego.l...@delware.nl wrote:
 
  Working on Petit Delphi we found a strange implementation for 
  asPetitStream:
  StreamasPetitStream
   ^ self contents asPetitStream
 
  Further investigation showed that the basic peek was not fast enough for 
  Petit Parser, as it is used a lot. So it implemented a improved 
  unchecked peek:
  PPStreampeek
   An improved version of peek, that is slightly faster than the built 
  in version.
   ^ self atEnd ifFalse: [ collection at: position + 1 ]
 
  PPStreamuncheckedPeek
   An unchecked version of peek that throws an error if we try to peek 
  over the end of the stream, even faster than #peek.
   ^ collection at: position + 1
 
  But in my knowledge a basic peek should be fast. The real problem is the 
  peek in the underlying peek:
  PositionableStreampeek
   Answer what would be returned if the message next were sent to the
   receiver. If the receiver is at the end, answer nil.
 
   | nextObject |
   self atEnd ifTrue: [^nil].
   nextObject := self next.
   position := position - 1.
   ^nextObject
 
  That actually uses self next. The least thing one should do is to cache 
  the next object. But isn't there a primitive for peek in a file stream? 
  Because al overriding peeks of PositionableStream have basically the same 
  implementation: reading the next and restoring the state to before the 
  peek (that is slow). So we would like to be able to remove PPStream 
  without causing performance issues, as the only added method is the 
  improved peek.
 
  Stephan and Diego
 
  If you are reading from file, ZnCharacterStream should be a valid 
  alternative.
  If not, ZnBufferedReadStream on an internal collection stream also does 
  peek caching.
 
  Beware with files though; it’s better to bench the overall operation for 
  different alternatives.
  F.ex, ZnCharacterStream is much faster than the standard Filestream for 
  peek:
 
  cr := ZnCharacterReadStream on: 'PharoDebug.log' asFileReference 
  readStream binary.
  [cr peek] bench. '49,400,000 per second.'
  cr close.
 
  FileStream fileNamed: 'PharoDebug.log' do: [:fs | [fs peek] bench] 
  '535,000 per second.’
 
  but has different bulk reads characteristics (faster for small bulks, 
  slower for large bulks, crossover-point at around 1k chars at once);
  (The actual values are of course also dependent on encoder/file contents, 
  those given here obtained with UTF-8 and a mostly/all ascii text file)
 
  [cr := ZnCharacterReadStream on: ('PharoDebug.log' asFileReference 
  readStream binary ) readStream.
cr next: 65536; close] bench  '105 per second.'  '106 per second.’
 
 Well, I just realised that ZnCharacterReadStream and ZnCharacterWriteStream 
 did not yet make use of the optimisations that I did for ZnCharacterEncoding 
 some time ago. More specifically, they were not yet using 
 #next:putAll:startingAt:toStream: and #readInto:startingAt:count:fromStream: 
 which are overwritten for ZnUTF8Encoder with (super hacked) versions that 
 assume most of the input will be ASCII (a reasonable assumption).
 
 I am still chasing a bug, but right now:
 
 [ (ZnCharacterReadStream on: ('timezones.json' asFileReference readStream 
 binary))
 next: 65536; close ] bench.
 
 135 per second.” BEFORE
 3,310 per second.” AFTER
 
 But of course the input file is ASCII, so YMMV.
 
 I’ll let you know when I commit this code.
 
 Sven
 
  [FileStream fileNamed: 'PharoDebug.log' do: [:fs | fs next: 65536]
] bench  '176 per second.’
 
  If you use a StandardFilestream set to binary ( which has less overhead 
  for binary next’s compared to the MultiByteFileStream returned by 
  asFileReference readStream)as the base stream instead, , the same general 
  profile holds true, but with a crossover around 2k characters.
 
  TL;DR: Benchmark the alternatives. The best replacement option depends on 
  your results. Appropriately (according to source and actual use) 

Re: [Pharo-dev] Peek

2013-11-06 Thread Sven Van Caekenberghe

On 06 Nov 2013, at 13:20, Henrik Johansen henrik.s.johan...@veloxit.no wrote:

 
 On 04 Nov 2013, at 5:12 , Sven Van Caekenberghe s...@stfx.eu wrote:
 
 
 Well, I just realised that ZnCharacterReadStream and ZnCharacterWriteStream 
 did not yet make use of the optimisations that I did for ZnCharacterEncoding 
 some time ago. More specifically, they were not yet using 
 #next:putAll:startingAt:toStream: and #readInto:startingAt:count:fromStream: 
 which are overwritten for ZnUTF8Encoder with (super hacked) versions that 
 assume most of the input will be ASCII (a reasonable assumption).
 
 I am still chasing a bug, but right now:
 
 [ (ZnCharacterReadStream on: ('timezones.json' asFileReference readStream 
 binary))
  next: 65536; close ] bench. 
 
  135 per second.” BEFORE
  3,310 per second.” AFTER
 
 But of course the input file is ASCII, so YMMV.
 
 I’ll let you know when I commit this code.
 
 Sven
 
 Yeah… sooo, I loaded the updated version, great improvement for streams on 
 Latin1 content :D
 
 Maybe it’s just me, but I tested with actual wide source as well (it was as 
 slow as you’d expect), and I think you need a notQuiteSoOptimizedReadInto* 
 which uses the normal Byte - Wide become: conversion machinery.
 Writing a ZnByteStringBecameWideString handler for every next: (and cousins) 
 call where source may be non-latin1 is a real chore/nasty surprise for those 
 used to dealing with legacy streams/converters... 
 You can sort of kinda make up for the performance hit (at least on these 
 sizes) with a faster replace:from:to:with:startingAt: in use after the fact, 
 by using basicAt:put: to the string, thus avoiding converting the replacement 
 value to character. 
 
 Cheers,
 Henry
 
 PS: Why is ZnByteStringBecameWideString a notification and not a resumable 
 exception? I would assume those who run into it without a handler would 
 rather have an actual error, than a result where their string has been read 
 up to the first wide char…

Henrik, thanks again for the feedback, it is really useful.

You are right, the ZnByteStringBecameWideString hack was introduced 
specifically to avoid the become, because benchmarking showed that it took too 
long. The original user is ZnStringEntity#readFrom: 

Now, if you know up front that your string will be wide anyway, then there is 
#decodeBytesIntoWideString:

Yes, ZnByteStringBecameWideString should probably be a resumable exception, as 
not handling it properly will lead to errors.

I will have to think and reconsider the generality/specificity of the 
optimisation hack in ZnUTF8Encoder. I got some ideas, but it will take some 
time.

At least the code works now, time to process the constructive criticism.

Regards,

Sven








Re: [Pharo-dev] Peek

2013-11-06 Thread Sven Van Caekenberghe

On 06 Nov 2013, at 13:29, Henrik Johansen henrik.s.johan...@veloxit.no wrote:

 On 06 Nov 2013, at 10:38 , Sven Van Caekenberghe s...@stfx.eu wrote:
 
 BTW, do we still need UTF16 support ?
 
 For those encodings that we still want to support in the future, we should 
 have new and more principled implementations under ZnCharacterEncoder. That 
 is, if we ever want to fase out TextConverter.
 
 UTF16 is the encoding for string arguments to the Unicode-aware version of 
 the Windows API's, so I’d say yes.
 Not too sure about JNI, but I know Java also uses UTF16 internally.
 
 So the short answer is yes.
 Not as a storage encoding, but for calling foreign functions, it’s quite 
 important.

Ah, OK.

Fancy giving it a try, to create a ‘new’ implementation under 
ZnCharacterEncoder with some tests ?

There are only 3 abstract methods to implement, if you should already know 
UTF16, it would not be too much work ;-)


 Cheers,
 Henry
 




Re: [Pharo-dev] Peek

2013-11-04 Thread Sven Van Caekenberghe
Hi Diego,

Streams as they are currently defined, especially ReadStreams, can hardly be 
called IO streams. Many operations assume that there is a single collection 
over which you can move in any direction without restrictions. As a consequence 
most parsing code relies this ability, looking forward and going back multiple 
times, like using #skip: with a negative argument.

Real IO ReadStreams like a SocketStreams or FileStreams cannot be assumed to 
keep more than just a buffer in memory, hence the unrestricted operations are 
not implementable.

IMO, parsing code should be written against a much more restricted API, 
assuming at most a one element peek/buffer. I always try to do that, I even 
have Mock ReadStreams that ensure this.

While benchmarking I have also found that slow #peek behaviour can be a 
bottleneck, especially on some of the more complex streams.

Sven

PS: Have a look at ZnCharacterReadStream for example, which implements this 
restricted behaviour: it is not positionable, but allows a one character peek 
using a one character internal buffer.

On 04 Nov 2013, at 09:57, Diego Lont diego.l...@delware.nl wrote:

 Working on Petit Delphi we found a strange implementation for asPetitStream:
 StreamasPetitStream
   ^ self contents asPetitStream
 
 Further investigation showed that the basic peek was not fast enough for 
 Petit Parser, as it is used a lot. So it implemented a improved unchecked 
 peek:
 PPStreampeek
   An improved version of peek, that is slightly faster than the built in 
 version.
   ^ self atEnd ifFalse: [ collection at: position + 1 ]
 
 PPStreamuncheckedPeek
   An unchecked version of peek that throws an error if we try to peek 
 over the end of the stream, even faster than #peek.
   ^ collection at: position + 1
 
 But in my knowledge a basic peek should be fast. The real problem is the peek 
 in the underlying peek:
 PositionableStreampeek
   Answer what would be returned if the message next were sent to the 
   receiver. If the receiver is at the end, answer nil.
 
   | nextObject |
   self atEnd ifTrue: [^nil].
   nextObject := self next.
   position := position - 1.
   ^nextObject
 
 That actually uses self next. The least thing one should do is to cache the 
 next object. But isn't there a primitive for peek in a file stream? Because 
 al overriding peeks of PositionableStream have basically the same 
 implementation: reading the next and restoring the state to before the peek 
 (that is slow). So we would like to be able to remove PPStream without 
 causing performance issues, as the only added method is the improved peek.
 
 Stephan and Diego




Re: [Pharo-dev] Peek

2013-11-04 Thread Henrik Johansen

On 04 Nov 2013, at 9:57 , Diego Lont diego.l...@delware.nl wrote:

 Working on Petit Delphi we found a strange implementation for asPetitStream:
 StreamasPetitStream
   ^ self contents asPetitStream
 
 Further investigation showed that the basic peek was not fast enough for 
 Petit Parser, as it is used a lot. So it implemented a improved unchecked 
 peek:
 PPStreampeek
   An improved version of peek, that is slightly faster than the built in 
 version.
   ^ self atEnd ifFalse: [ collection at: position + 1 ]
 
 PPStreamuncheckedPeek
   An unchecked version of peek that throws an error if we try to peek 
 over the end of the stream, even faster than #peek.
   ^ collection at: position + 1
 
 But in my knowledge a basic peek should be fast. The real problem is the peek 
 in the underlying peek:
 PositionableStreampeek
   Answer what would be returned if the message next were sent to the 
   receiver. If the receiver is at the end, answer nil.
 
   | nextObject |
   self atEnd ifTrue: [^nil].
   nextObject := self next.
   position := position - 1.
   ^nextObject
 
 That actually uses self next. The least thing one should do is to cache the 
 next object. But isn't there a primitive for peek in a file stream? Because 
 al overriding peeks of PositionableStream have basically the same 
 implementation: reading the next and restoring the state to before the peek 
 (that is slow). So we would like to be able to remove PPStream without 
 causing performance issues, as the only added method is the improved peek.
 
 Stephan and Diego

If you are reading from file, ZnCharacterStream should be a valid alternative.
If not, ZnBufferedReadStream on an internal collection stream also does peek 
caching.

Beware with files though; it’s better to bench the overall operation for 
different alternatives.
F.ex, ZnCharacterStream is much faster than the standard Filestream for peek:

cr := ZnCharacterReadStream on: 'PharoDebug.log' asFileReference readStream 
binary.
[cr peek] bench. '49,400,000 per second.'
cr close.

FileStream fileNamed: 'PharoDebug.log' do: [:fs | [fs peek] bench] '535,000 per 
second.’

but has different bulk reads characteristics (faster for small bulks, slower 
for large bulks, crossover-point at around 1k chars at once);
(The actual values are of course also dependent on encoder/file contents, those 
given here obtained with UTF-8 and a mostly/all ascii text file)

[cr := ZnCharacterReadStream on: ('PharoDebug.log' asFileReference readStream 
binary ) readStream.
cr next: 65536; close] bench  '105 per second.'  '106 per second.'

[FileStream fileNamed: 'PharoDebug.log' do: [:fs | fs next: 65536]
] bench  '176 per second.’

If you use a StandardFilestream set to binary ( which has less overhead for 
binary next’s compared to the MultiByteFileStream returned by asFileReference 
readStream)as the base stream instead, , the same general profile holds true, 
but with a crossover around 2k characters.

TL;DR: Benchmark the alternatives. The best replacement option depends on your 
results. Appropriately (according to source and actual use) set up Zn-streams 
are probably your best bet. 

Cheers,
Henry


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Pharo-dev] Peek

2013-11-04 Thread Sven Van Caekenberghe
Hi Henrik,

Great writeup, thanks !

(more inline)

On 04 Nov 2013, at 11:58, Henrik Johansen henrik.s.johan...@veloxit.no wrote:

 On 04 Nov 2013, at 9:57 , Diego Lont diego.l...@delware.nl wrote:
 
 Working on Petit Delphi we found a strange implementation for asPetitStream:
 StreamasPetitStream
  ^ self contents asPetitStream
 
 Further investigation showed that the basic peek was not fast enough for 
 Petit Parser, as it is used a lot. So it implemented a improved unchecked 
 peek:
 PPStreampeek
  An improved version of peek, that is slightly faster than the built in 
 version.
  ^ self atEnd ifFalse: [ collection at: position + 1 ]
 
 PPStreamuncheckedPeek
  An unchecked version of peek that throws an error if we try to peek 
 over the end of the stream, even faster than #peek.
  ^ collection at: position + 1
 
 But in my knowledge a basic peek should be fast. The real problem is the 
 peek in the underlying peek:
 PositionableStreampeek
  Answer what would be returned if the message next were sent to the 
  receiver. If the receiver is at the end, answer nil.
 
  | nextObject |
  self atEnd ifTrue: [^nil].
  nextObject := self next.
  position := position - 1.
  ^nextObject
 
 That actually uses self next. The least thing one should do is to cache 
 the next object. But isn't there a primitive for peek in a file stream? 
 Because al overriding peeks of PositionableStream have basically the same 
 implementation: reading the next and restoring the state to before the peek 
 (that is slow). So we would like to be able to remove PPStream without 
 causing performance issues, as the only added method is the improved peek.
 
 Stephan and Diego
 
 If you are reading from file, ZnCharacterStream should be a valid alternative.
 If not, ZnBufferedReadStream on an internal collection stream also does peek 
 caching.
 
 Beware with files though; it’s better to bench the overall operation for 
 different alternatives.
 F.ex, ZnCharacterStream is much faster than the standard Filestream for peek:
 
 cr := ZnCharacterReadStream on: 'PharoDebug.log' asFileReference readStream 
 binary.
 [cr peek] bench. '49,400,000 per second.'
 cr close.
 
 FileStream fileNamed: 'PharoDebug.log' do: [:fs | [fs peek] bench] '535,000 
 per second.’
 
 but has different bulk reads characteristics (faster for small bulks, slower 
 for large bulks, crossover-point at around 1k chars at once);
 (The actual values are of course also dependent on encoder/file contents, 
 those given here obtained with UTF-8 and a mostly/all ascii text file)
 
 [cr := ZnCharacterReadStream on: ('PharoDebug.log' asFileReference readStream 
 binary ) readStream.
   cr next: 65536; close] bench  '105 per second.'  '106 per second.’

Well, I just realised that ZnCharacterReadStream and ZnCharacterWriteStream did 
not yet make use of the optimisations that I did for ZnCharacterEncoding some 
time ago. More specifically, they were not yet using 
#next:putAll:startingAt:toStream: and #readInto:startingAt:count:fromStream: 
which are overwritten for ZnUTF8Encoder with (super hacked) versions that 
assume most of the input will be ASCII (a reasonable assumption).

I am still chasing a bug, but right now:

[ (ZnCharacterReadStream on: ('timezones.json' asFileReference readStream 
binary))
next: 65536; close ] bench. 

135 per second.” BEFORE
3,310 per second.” AFTER

But of course the input file is ASCII, so YMMV.

I’ll let you know when I commit this code.

Sven

 [FileStream fileNamed: 'PharoDebug.log' do: [:fs | fs next: 65536]
   ] bench  '176 per second.’
 
 If you use a StandardFilestream set to binary ( which has less overhead for 
 binary next’s compared to the MultiByteFileStream returned by asFileReference 
 readStream)as the base stream instead, , the same general profile holds true, 
 but with a crossover around 2k characters.
 
 TL;DR: Benchmark the alternatives. The best replacement option depends on 
 your results. Appropriately (according to source and actual use) set up 
 Zn-streams are probably your best bet. 
 
 Cheers,
 Henry




Re: [Pharo-dev] Peek

2013-11-04 Thread Henrik Sperre Johansen
That's great!
Remembering that commit message was part of the reason for benching, was
sort of disappointed there was no significant difference between Zn in 2.0
and latest 3.0...

I guess with the amount of hacks accumulating, it is indeed turning into a
worthy successor of MultiByteFileStream ;)

Cheers,
Henry

P.S: If you want another delightful one in the same vein (both from
WTFy-ness and perf improvement POV), take a gander at UTF16TextConverter 
nextPutByteString:toStream:


On Mon, Nov 4, 2013 at 5:12 PM, Sven Van Caekenberghe s...@stfx.eu wrote:

 Hi Henrik,

 Great writeup, thanks !

 (more inline)

 On 04 Nov 2013, at 11:58, Henrik Johansen henrik.s.johan...@veloxit.no
 wrote:

  On 04 Nov 2013, at 9:57 , Diego Lont diego.l...@delware.nl wrote:
 
  Working on Petit Delphi we found a strange implementation for
 asPetitStream:
  StreamasPetitStream
   ^ self contents asPetitStream
 
  Further investigation showed that the basic peek was not fast enough
 for Petit Parser, as it is used a lot. So it implemented a improved
 unchecked peek:
  PPStreampeek
   An improved version of peek, that is slightly faster than the
 built in version.
   ^ self atEnd ifFalse: [ collection at: position + 1 ]
 
  PPStreamuncheckedPeek
   An unchecked version of peek that throws an error if we try to
 peek over the end of the stream, even faster than #peek.
   ^ collection at: position + 1
 
  But in my knowledge a basic peek should be fast. The real problem is
 the peek in the underlying peek:
  PositionableStreampeek
   Answer what would be returned if the message next were sent to the
   receiver. If the receiver is at the end, answer nil.
 
   | nextObject |
   self atEnd ifTrue: [^nil].
   nextObject := self next.
   position := position - 1.
   ^nextObject
 
  That actually uses self next. The least thing one should do is to
 cache the next object. But isn't there a primitive for peek in a file
 stream? Because al overriding peeks of PositionableStream have basically
 the same implementation: reading the next and restoring the state to before
 the peek (that is slow). So we would like to be able to remove PPStream
 without causing performance issues, as the only added method is the
 improved peek.
 
  Stephan and Diego
 
  If you are reading from file, ZnCharacterStream should be a valid
 alternative.
  If not, ZnBufferedReadStream on an internal collection stream also does
 peek caching.
 
  Beware with files though; it’s better to bench the overall operation for
 different alternatives.
  F.ex, ZnCharacterStream is much faster than the standard Filestream for
 peek:
 
  cr := ZnCharacterReadStream on: 'PharoDebug.log' asFileReference
 readStream binary.
  [cr peek] bench. '49,400,000 per second.'
  cr close.
 
  FileStream fileNamed: 'PharoDebug.log' do: [:fs | [fs peek] bench]
 '535,000 per second.’
 
  but has different bulk reads characteristics (faster for small bulks,
 slower for large bulks, crossover-point at around 1k chars at once);
  (The actual values are of course also dependent on encoder/file
 contents, those given here obtained with UTF-8 and a mostly/all ascii text
 file)
 
  [cr := ZnCharacterReadStream on: ('PharoDebug.log' asFileReference
 readStream binary ) readStream.
cr next: 65536; close] bench  '105 per second.'  '106 per second.’

 Well, I just realised that ZnCharacterReadStream and
 ZnCharacterWriteStream did not yet make use of the optimisations that I did
 for ZnCharacterEncoding some time ago. More specifically, they were not yet
 using #next:putAll:startingAt:toStream: and
 #readInto:startingAt:count:fromStream: which are overwritten for
 ZnUTF8Encoder with (super hacked) versions that assume most of the input
 will be ASCII (a reasonable assumption).

 I am still chasing a bug, but right now:

 [ (ZnCharacterReadStream on: ('timezones.json' asFileReference readStream
 binary))
 next: 65536; close ] bench.

 135 per second.” BEFORE
 3,310 per second.” AFTER

 But of course the input file is ASCII, so YMMV.

 I’ll let you know when I commit this code.

 Sven

  [FileStream fileNamed: 'PharoDebug.log' do: [:fs | fs next: 65536]
] bench  '176 per second.’
 
  If you use a StandardFilestream set to binary ( which has less overhead
 for binary next’s compared to the MultiByteFileStream returned by
 asFileReference readStream)as the base stream instead, , the same general
 profile holds true, but with a crossover around 2k characters.
 
  TL;DR: Benchmark the alternatives. The best replacement option depends
 on your results. Appropriately (according to source and actual use) set up
 Zn-streams are probably your best bet.
 
  Cheers,
  Henry