Re: maketrans and translate

2012-02-15 Thread bearophile
Jonathan M Davis:

 I'll look into the best way to handle it. Those functions aren't being 
 deprecated this release regardless.

I have added the topic to Bugzilla to avoid its oblivion:
http://d.puremagic.com/issues/show_bug.cgi?id=7515

Bye,
bearophile


Re: maketrans and translate

2012-02-13 Thread Stewart Gordon

On 13/02/2012 03:04, bearophile wrote:

In the online docs I've seen that std.string.maketrans() is (going to be) 
deprecated.
How do you adapt this code to the new regime?

snip

Use an associative array for the translation table.

Or write your own functions that work in the same way as maketrans/translate.

Stewart.


Re: maketrans and translate

2012-02-13 Thread bearophile
Stewart Gordon:

 Use an associative array for the translation table.

A new version of the code:

import std.stdio, std.string, std.range;
void main() {
char[] text = dssdadsdasdas.dup; // lots of MBs of pure 7 bit ASCII text
dchar[dchar] aa = ['A':'5', 'a':'5', 'B':'7', 'b':'7', 'C':'6', 'c':'6',
'D':'3', 'd':'3', 'E':'0', 'e':'0', 'F':'4', 'f':'4', 'G':'9', 'g':'9',
'H':'9', 'h':'9', 'I':'6', 'i':'6', 'J':'1', 'j':'1', 'K':'7', 'k':'7',
'L':'8', 'l':'8', 'M':'5', 'm':'5', 'N':'1', 'n':'1', 'O':'8', 'o':'8',
'P':'8', 'p':'8', 'Q':'1', 'q':'1', 'R':'2', 'r':'2', 'S':'3', 's':'3',
'T':'4', 't':'4', 'U':'7', 'u':'7', 'V':'6', 'v':'6', 'W':'2', 'w':'2',
'X':'2', 'x':'2', 'Y':'3', 'y':'3', 'Z':'9', 'z':'9'];
auto text3 = text.translate(aa, \);
writeln(text3);
}


 Or write your own functions that work in the same way as maketrans/translate.

The code with the associative array is longer, and every char in the original 
string requires one or two associative array lookups. This is quite inefficient 
for ASCII strings. 

Instead of writing my own string functions, or copying them from an older 
Phobos, what about the idea of keeping both the old maketrans and translate 
functions too and to un-deprecate them (maybe moving them in std.ascii)? They 
are not UTF-safe, but I have large amounts of biological data text that is not 
Unicode.

Bye,
bearophile


Re: maketrans and translate

2012-02-13 Thread Jonathan M Davis
On Monday, February 13, 2012 08:09:06 bearophile wrote:
 Stewart Gordon:
  Use an associative array for the translation table.
 
 A new version of the code:
 
 import std.stdio, std.string, std.range;
 void main() {
 char[] text = dssdadsdasdas.dup; // lots of MBs of pure 7 bit ASCII
 text dchar[dchar] aa = ['A':'5', 'a':'5', 'B':'7', 'b':'7', 'C':'6',
 'c':'6', 'D':'3', 'd':'3', 'E':'0', 'e':'0', 'F':'4', 'f':'4', 'G':'9',
 'g':'9', 'H':'9', 'h':'9', 'I':'6', 'i':'6', 'J':'1', 'j':'1', 'K':'7',
 'k':'7', 'L':'8', 'l':'8', 'M':'5', 'm':'5', 'N':'1', 'n':'1', 'O':'8',
 'o':'8', 'P':'8', 'p':'8', 'Q':'1', 'q':'1', 'R':'2', 'r':'2', 'S':'3',
 's':'3', 'T':'4', 't':'4', 'U':'7', 'u':'7', 'V':'6', 'v':'6', 'W':'2',
 'w':'2', 'X':'2', 'x':'2', 'Y':'3', 'y':'3', 'Z':'9', 'z':'9'];
 auto text3 = text.translate(aa, \);
 writeln(text3);
 }
 
  Or write your own functions that work in the same way as
  maketrans/translate.
 The code with the associative array is longer, and every char in the
 original string requires one or two associative array lookups. This is
 quite inefficient for ASCII strings.
 
 Instead of writing my own string functions, or copying them from an older
 Phobos, what about the idea of keeping both the old maketrans and translate
 functions too and to un-deprecate them (maybe moving them in std.ascii)?
 They are not UTF-safe, but I have large amounts of biological data text
 that is not Unicode.

Do you have data to backup that there is a significant speed difference? We're 
trying to not have any ASCII-only string stuff.

- Jonathan M Davis


Re: maketrans and translate

2012-02-13 Thread bearophile
Jonathan M Davis:

 Do you have data to backup that there is a significant speed difference?

I have just written a small benchmark, it contains both the URL to the test 
data and the timings I am seeing on a slow PC. If your PC is faster feel free 
to use it on more data (creating a larger input file on disk, concatenating 
many copies of the same test data). If you see bugs in the benchmark please 
tell me.

version3 uses the old translate. version4 uses the new translate. 
version1InPlace is a reference point, it works in-place. version2InPlace is 
similar, uses a linear scan for the chars to remove, and it seems a bit slower 
than version1InPlace (probably because the small deltab array fits well in the 
L1 cache).

If both the benchmark and the timings are correct, the old translate seems 
almost 5 times faster than the new one (and this is not strange, the new 
translate performs two associative array lookups for each char of the input 
string).


import std.stdio, std.string, std.file, std.exception;

char[] version1InPlace(char[] text) {
immutable t1 = ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ;
immutable t2 = 0111222333445566677788899901112223334455666777888999;
immutable delchars = \;

char[256] transtab;
foreach (i, ref char c; transtab)
c = cast(char)i;

foreach (i, char c; t1)
transtab[c] = t2[i];

bool[256] deltab = false;
foreach (char c; delchars)
deltab[c] = true;

int count;
foreach (char c; text)
if (!deltab[c]) {
text[count] = transtab[c];
count++;
}

return text[0 .. count];
}


char[] version2InPlace(char[] text) {
immutable t1 = ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ;
immutable t2 = 0111222333445566677788899901112223334455666777888999;
immutable delchars = \;

char[256] transtab;
foreach (i, ref char c; transtab)
c = cast(char)i;

foreach (i, char c; t1)
transtab[c] = t2[i];

int count;
outer: foreach (char c; text) {
foreach (char d; delchars)
if (c == d)
continue outer;
text[count] = transtab[c];
count++;
}

return text[0 .. count];
}


string version3(char[] text) {
auto tab = maketrans(ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ,
 
0111222333445566677788899901112223334455666777888999);
return text.translate(tab, \);
}


string version4(string text) {
dchar[dchar] aa = ['A':'5', 'a':'5', 'B':'7', 'b':'7', 'C':'6', 'c':'6',
'D':'3', 'd':'3', 'E':'0', 'e':'0', 'F':'4', 'f':'4', 'G':'9', 'g':'9',
'H':'9', 'h':'9', 'I':'6', 'i':'6', 'J':'1', 'j':'1', 'K':'7', 'k':'7',
'L':'8', 'l':'8', 'M':'5', 'm':'5', 'N':'1', 'n':'1', 'O':'8', 'o':'8',
'P':'8', 'p':'8', 'Q':'1', 'q':'1', 'R':'2', 'r':'2', 'S':'3', 's':'3',
'T':'4', 't':'4', 'U':'7', 'u':'7', 'V':'6', 'v':'6', 'W':'2', 'w':'2',
'X':'2', 'x':'2', 'Y':'3', 'y':'3', 'Z':'9', 'z':'9'];
return text.translate(aa, \);
}


void main() {
// http://patriot.net/~bmcgin/kjv12.txt
auto txt = cast(char[])read(kjv12.txt);

assert(version1InPlace(txt.dup) == version2InPlace(txt.dup));
assert(version1InPlace(txt.dup) == version3(txt.dup));
assert(version1InPlace(txt.dup) == version4(txt.idup));

static if (1) version1InPlace(txt); // 0.16 seconds
static if (0) version2InPlace(txt); // 0.18 seconds
static if (0) version3(txt); // 0.20 seconds
static if (0) version4(assumeUnique(txt)); // 0.97 seconds
}


 We're trying to not have any ASCII-only string stuff.

Around there is a large amount of data that is essentially text, but it's 
ASCII, it's not Unicode. Like biological data, text data coming out of 
instruments, etc. Handling it as binary data is not good.

Bye,
bearophile


Re: maketrans and translate

2012-02-13 Thread Jonathan M Davis
On Monday, February 13, 2012 14:08:16 bearophile wrote:
 Jonathan M Davis:
  Do you have data to backup that there is a significant speed difference?
 
 I have just written a small benchmark, it contains both the URL to the test
 data and the timings I am seeing on a slow PC. If your PC is faster feel
 free to use it on more data (creating a larger input file on disk,
 concatenating many copies of the same test data). If you see bugs in the
 benchmark please tell me.
 
 version3 uses the old translate. version4 uses the new translate.
 version1InPlace is a reference point, it works in-place. version2InPlace is
 similar, uses a linear scan for the chars to remove, and it seems a bit
 slower than version1InPlace (probably because the small deltab array fits
 well in the L1 cache).
 
 If both the benchmark and the timings are correct, the old translate seems
 almost 5 times faster than the new one (and this is not strange, the new
 translate performs two associative array lookups for each char of the input
 string).

I'll look into the best way to handle it. Those functions aren't being 
deprecated this release regardless.

  We're trying to not have any ASCII-only string stuff.
 
 Around there is a large amount of data that is essentially text, but it's
 ASCII, it's not Unicode. Like biological data, text data coming out of
 instruments, etc. Handling it as binary data is not good.

In most cases, it should make no difference. You could use 1, 2, 3, and 4 as 
easily as you could use A, T, C, and G. Most algorithms don't care one whit 
whether you're dealing with characters or numbers. The only problem that I see 
is cases where there's an algorithm which would normally only be done with 
characters and not numbers and which therefore is in std.string and not 
std.array or std.algorithm - and translate falls into that category.

Regardless, in general, Phobos is not going to have ASCII-specific string 
processing.

- Jonathan M Davis


maketrans and translate

2012-02-12 Thread bearophile
In the online docs I've seen that std.string.maketrans() is (going to be) 
deprecated.
How do you adapt this code to the new regime?


import std.stdio, std.string;
void main() {
char[] text = dssdadsdasdas.dup; // lots of MBs of pure 7 bit ASCII text
auto tab = maketrans(ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ,
 
0111222333445566677788899901112223334455666777888999);
auto text2 = text.translate(tab, \);
writeln(text2);
}

Bye,
bearophile