Re: is type checking in D undecidable?

2020-10-23 Thread Bruce Carneal via Digitalmars-d-learn

On Friday, 23 October 2020 at 16:56:46 UTC, Kagamin wrote:
On Thursday, 22 October 2020 at 18:24:47 UTC, Bruce Carneal 
wrote:
Per the wiki on termination analysis some languages with 
dependent types (Agda, Coq) have built-in termination checkers.


What they do with code that does, say, a hash preimage attack?


Not my area but after a quick wiki skim my guess is that the 
termination checkers would not be helpful at all.


If you do pick up one of the termination checker languages and 
experiment, please post back with the results.





Re: is type checking in D undecidable?

2020-10-22 Thread Bruce Carneal via Digitalmars-d-learn

On Friday, 23 October 2020 at 04:24:09 UTC, Paul Backus wrote:

On Friday, 23 October 2020 at 00:53:19 UTC, Bruce Carneal wrote:
When you write functions, the compiler helps you out with 
fully automated constraint checking.  When you write templates 
you can write them so that they look like simple functions, in 
which case you're on pretty solid ground.  Your manual 
constraints will probably work.  Hard to screw up a four line 
eponymous template with constraints.  Hard to screw up a 
"leaf" template with a small number of template args.  
Possible but hard.  Not so hard to screw up 
"wanna-be-as-general-as-possible-but-special-case-performant-and-sometimes-wierdly-recursive-cuz-otherwise-the-compiler-blows-up" templates.


This is true, but it has nothing at all to do with 
decidability--which is a term with a precise technical 
definition in computer science.


Yep.  The thread started with the technical definition, as you'll 
note in the wiki articles that I cited, and then moved on.  I 
probably should have started another thread.




The reason writing correct generic code using templates (or any 
macro system) is so difficult is that templates (and macros in 
general) are, effectively, dynamically typed. Unlike regular 
functions, templates are not type checked when they are 
declared, but when they are "executed" (that is, instantiated). 
In that sense, writing templates in D is very similar to 
writing code in a dynamically-typed language like Python or 
Javascript.


Yep. Good observations.  Functions offer some nice benefits.  I'd 
like to see their use increase (type functions displacing 
templates wherever appropriate).






Re: is type checking in D undecidable?

2020-10-22 Thread Bruce Carneal via Digitalmars-d-learn

On Thursday, 22 October 2020 at 20:37:22 UTC, Paul Backus wrote:
On Thursday, 22 October 2020 at 19:24:53 UTC, Bruce Carneal 
wrote:
On a related topic, I believe that type functions enable a 
large amount of code in the "may be hard to prove decidable" 
category (templates) to be (re)written as clearly decidable 
code.  Easier for the compiler to deal with and, more 
importantly, easier to teach.


Type functions, like regular functions, are Turing-complete, 
and therefore undecidable in the general case.


Sure, most languages are Turing complete at run time.  A few are 
Turing complete at compile time but templates are also pattern 
matching code expanders. Polymorphic.


By design templates are open ended and powerful (in both the 
practical and theoretical sense).  In some situations they're 
exactly what you need.  They are also harder to employ correctly 
than functions.


When you write functions, the compiler helps you out with fully 
automated constraint checking.  When you write templates you can 
write them so that they look like simple functions, in which case 
you're on pretty solid ground.  Your manual constraints will 
probably work.  Hard to screw up a four line eponymous template 
with constraints.  Hard to screw up a "leaf" template with a 
small number of template args.  Possible but hard.  Not so hard 
to screw up 
"wanna-be-as-general-as-possible-but-special-case-performant-and-sometimes-wierdly-recursive-cuz-otherwise-the-compiler-blows-up" templates.


It'd be great if we could get rid of many such templates, and, 
even more importantly, avoid writing a lot more.  That is likely 
when we start asking if type functions can reduce bugs long term, 
both those experienced in the currently tortured template 
subsystem and those experienced in user code.  The performance 
gains exhibited by type functions are, to me, just gravy.




Re: is type checking in D undecidable?

2020-10-22 Thread Bruce Carneal via Digitalmars-d-learn
On Thursday, 22 October 2020 at 18:46:07 UTC, Ola Fosheim Grøstad 
wrote:

On Thursday, 22 October 2020 at 18:38:12 UTC, Stefan Koch wrote:
On Thursday, 22 October 2020 at 18:33:52 UTC, Ola Fosheim 
Grøstad wrote:


In general, it is hard to tell if a computation is 
long-running or unsolvable.


You could even say ... it's undecidable :)


:-) Yes, although ...


[...]

Also, some advanced systems might be able to detect that no 
real progress is possible. For example being able to prove that 
the number of "subqueries" to be tried will increase faster 
than the number of "subqueries" that can be resolved.


I dont think it is any easier to prove the "will increase faster" 
proposition than it is to prove the whole thing.




But this is really the frontier of programming language 
design...


Agree. As I see it, D is on the frontier of practical (meta) 
programming.  A very exciting place to be.


On a related topic, I believe that type functions enable a large 
amount of code in the "may be hard to prove decidable" category 
(templates) to be (re)written as clearly decidable code.  Easier 
for the compiler to deal with and, more importantly, easier to 
teach.




Re: is type checking in D undecidable?

2020-10-22 Thread Bruce Carneal via Digitalmars-d-learn
On Thursday, 22 October 2020 at 18:04:32 UTC, Ola Fosheim Grøstad 
wrote:
On Thursday, 22 October 2020 at 17:25:44 UTC, Bruce Carneal 
wrote:
Is type checking in D undecidable?  Per the wiki on dependent 
types it sure looks like it is.


Even if it is, you can still write something that is decidable 
in D, but impractical in terms of compile time.


Yep.  Within some non-exponential CTFE speed factor that's 
equivalent to saying your program might run too long.




You probably mean more advanced type systems where these things 
are expressed more implicitly. Basically type systems where you 
can express and resolve properties related to infinite sizes. D 
does not have such capabilities, so you have to go out of your 
way to end up in that territory in D.


Where "out of your way" means what?  Use of templates with 
potentially unbounded recursive expression?  Other?


Per the wiki on termination analysis some languages with 
dependent types (Agda, Coq) have built-in termination checkers.  
I assume that DMD employs something short of such a checker, some 
combination of cycle detection backed up by resource bounds?




is type checking in D undecidable?

2020-10-22 Thread Bruce Carneal via Digitalmars-d-learn
Is type checking in D undecidable?  Per the wiki on dependent 
types it sure looks like it is.


I assume that it's well known to the compiler contributors that D 
type checking is undecidable which, among other reasons, is why 
we have things like template recursion limits.


Confirmation of the assumption or refutation would be most 
welcome.




Re: __vector(ubyte[32]) misalignment

2020-08-10 Thread Bruce Carneal via Digitalmars-d-learn
On Monday, 10 August 2020 at 13:52:46 UTC, Steven Schveighoffer 
wrote:

On 8/9/20 8:46 AM, Steven Schveighoffer wrote:

On 8/9/20 8:37 AM, Steven Schveighoffer wrote:

I think this has come up before, there may even be a bug 
report on it.


Found one, I'll see if I can fix the array runtime:

https://issues.dlang.org/show_bug.cgi?id=10826



Bruce, I have a PR to hopefully fix these issues, if you want 
to test against it:


https://github.com/dlang/druntime/pull/3192

-Steve


No biggee but it looks like there is some duplicate code at the 
end of the __alignPad unittest.






Re: __vector(ubyte[32]) misalignment

2020-08-10 Thread Bruce Carneal via Digitalmars-d-learn
On Monday, 10 August 2020 at 13:52:46 UTC, Steven Schveighoffer 
wrote:

On 8/9/20 8:46 AM, Steven Schveighoffer wrote:

On 8/9/20 8:37 AM, Steven Schveighoffer wrote:

I think this has come up before, there may even be a bug 
report on it.


Found one, I'll see if I can fix the array runtime:

https://issues.dlang.org/show_bug.cgi?id=10826



Bruce, I have a PR to hopefully fix these issues, if you want 
to test against it:


https://github.com/dlang/druntime/pull/3192

-Steve


The "fix issue 10826" reading was interesting.  Thanks for 
pushing this one through.


Re: __vector(ubyte[32]) misalignment

2020-08-09 Thread Bruce Carneal via Digitalmars-d-learn
On Sunday, 9 August 2020 at 12:37:06 UTC, Steven Schveighoffer 
wrote:

On 8/9/20 8:09 AM, Bruce Carneal wrote:

[...]


All blocks in the GC that are more than 16 bytes are aligned by 
32 bytes. You shouldn't have any 16 byte blocks here, because 
each element is 32 bytes long.


However, if your block grows to a page size, the alignment will 
be 16 bytes off (due to the metadata stored at the front of the 
block).


A page size is 4096 bytes. So anything larger than 2048 will 
require a page-sized block or larger.


I would guess that once your array gets longer than 63 
elements, it's always misaligned?


Quality sleuthing Steve. The program says it found misalignments 
with 37 out of 100 attempts of length [1..100].




Re: __vector(ubyte[32]) misalignment

2020-08-09 Thread Bruce Carneal via Digitalmars-d-learn

On Sunday, 9 August 2020 at 10:02:32 UTC, kinke wrote:

On Sunday, 9 August 2020 at 01:03:51 UTC, Bruce Carneal wrote:
Is sub .alignof alignment expected here?  IOW, do I have to 
manually manage memory if I want alignments above 16?


IIRC, yes when using the GC, as that only guarantees 16-bytes 
alignment. Static arrays on the stack should be aligned just 
fine with LDC.


Yes, it presents as a GC limitation.

Many thanks for your LDC work.





Re: __vector(ubyte[32]) misalignment

2020-08-09 Thread Bruce Carneal via Digitalmars-d-learn

On Sunday, 9 August 2020 at 09:58:18 UTC, Johan wrote:

On Sunday, 9 August 2020 at 01:03:51 UTC, Bruce Carneal wrote:
The .alignof attribute of __vector(ubyte[32]) is 32 but 
initializing an array of such vectors via an assignment to 
.length has given me 16 byte alignment (and subsequent seg 
faults which I suspect are related).


Is sub .alignof alignment expected here?  IOW, do I have to 
manually manage memory if I want alignments above 16?


Do you have a code example?
And what compiler are you using?

-Johan


At run.dlang.io recent runs of both dmd and lcd compilations of 
the below revealed misalignment.


import std;

void main() @safe
{
alias V = __vector(ubyte[32]); // requires -mcpu=native or 
other on cmd line

V[] va;
size_t misalignments;
foreach(N; 1..101) {
va.length = N;
const uptr = cast(ulong)va.ptr;
misalignments += (uptr % V.alignof) != 0;
}
writefln("misaligned %s per cent of the time", misalignments);
}


Re: __vector(ubyte[32]) misalignment

2020-08-09 Thread Bruce Carneal via Digitalmars-d-learn

On Sunday, 9 August 2020 at 05:49:23 UTC, user1234 wrote:

On Sunday, 9 August 2020 at 01:56:54 UTC, Bruce Carneal wrote:

On Sunday, 9 August 2020 at 01:03:51 UTC, Bruce Carneal wrote:




Manually managing the alignment eliminated the seg faulting.

Additionally, I found that std.experimental.mallocator 
Mallocator.alignment is 16.


So, is the misalignment baked in at this point?


there's AlignedMallocator that allows to overrides the 
"platformAlignment".

To get vector spece Mallocator is indeed a bad choice.


Yep.


Re: __vector(ubyte[32]) misalignment

2020-08-08 Thread Bruce Carneal via Digitalmars-d-learn

On Sunday, 9 August 2020 at 01:03:51 UTC, Bruce Carneal wrote:
The .alignof attribute of __vector(ubyte[32]) is 32 but 
initializing an array of such vectors via an assignment to 
.length has given me 16 byte alignment (and subsequent seg 
faults which I suspect are related).


Is sub .alignof alignment expected here?  IOW, do I have to 
manually manage memory if I want alignments above 16?


Manually managing the alignment eliminated the seg faulting.

Additionally, I found that std.experimental.mallocator 
Mallocator.alignment is 16.


So, is the misalignment baked in at this point?





__vector(ubyte[32]) misalignment

2020-08-08 Thread Bruce Carneal via Digitalmars-d-learn
The .alignof attribute of __vector(ubyte[32]) is 32 but 
initializing an array of such vectors via an assignment to 
.length has given me 16 byte alignment (and subsequent seg faults 
which I suspect are related).


Is sub .alignof alignment expected here?  IOW, do I have to 
manually manage memory if I want alignments above 16?




Re: safety and auto vectorization

2020-08-03 Thread Bruce Carneal via Digitalmars-d-learn
On Monday, 3 August 2020 at 18:55:36 UTC, Steven Schveighoffer 
wrote:

On 8/2/20 1:31 PM, Bruce Carneal wrote:

import std;

void f0(int[] a, int[] b, int[] dst) @safe {
     dst[] = a[] + b[];
}


[snip of auto-vectorization example]


I was surprised that f0 ran just fine with a.length and 
b.length geq dst.length.  Is that a bug or a feature?




First, I think this is a bug. A regression in fact. As of 2.077 
this works, and before it did not. There is nothing in the spec 
that says the behavior is defined for this case.


Second, it's more than just that. This also runs currently:

void main()
{
auto a = [1, 2, 3];
auto b = [4, 5, 6];
int[] dst = new int[4]; // note the extra element
dst[] = a[] + b[];
writeln(dst[3]);
}

Prior to 2.077, this fails with array length problems.

After that it prints (at the moment): 402653184

If I up the size to 5, it fails with a range violation. I 
strongly suspect some off-by-one errors, but this looks unsafe.


-Steve


Thanks Steve (and Chad).  Summary: underspecified, varying 
behavior across versions, buggy.


Steve, what's the best way for me to report this?  Are spec 
issues lumped in with the other bugzilla reports?









safety and auto vectorization

2020-08-02 Thread Bruce Carneal via Digitalmars-d-learn

import std;

void f0(int[] a, int[] b, int[] dst) @safe {
dst[] = a[] + b[];
}

void f1(int[] a, int[] b, int[] dst) @trusted {
const minLen = min(a.length, b.length, dst.length);
dst[0..minLen] = a[0..minLen] + b[0..minLen];
assert(dst.length == minLen);
}

I was surprised that f0 ran just fine with a.length and b.length 
geq dst.length.  Is that a bug or a feature?


Assuming it's a feature, are f0 and f1 morally equivalent?  I ask 
because f1 auto-vectorizes in ldc while f0 does not.  Not sure 
why.  As a guess I'd say that the front end doesn't hoist bounds 
checks in f0 or at least doesn't convey the info to the back end 
in a comprehensible fashion.  Non-guesses welcome.







Re: idiomatic output given -preview=nosharedaccess ,

2020-07-01 Thread Bruce Carneal via Digitalmars-d-learn

On Tuesday, 30 June 2020 at 20:43:00 UTC, Bruce Carneal wrote:
On Tuesday, 30 June 2020 at 20:12:59 UTC, Stanislav Blinov 
wrote:
On Tuesday, 30 June 2020 at 20:04:33 UTC, Steven Schveighoffer 
wrote:


The answer is -- update Phobos so it works with 
-nosharedaccess :)


Yeah... and dip1000. And dip1008. And dip... :)


Didn't want to be snippity but, yeah, with "hello world" 
breaking I thought it was time to fix the standard library.  
Thanks for the polite confirmation(s).


Looking at the stdio.d source it appears that a cast on one line 
within a template could give nosharedaccess programs access to 
stdio, stdout, and stderr.


A bug/enhancement request was filed.




Re: idiomatic output given -preview=nosharedaccess ,

2020-06-30 Thread Bruce Carneal via Digitalmars-d-learn

On Tuesday, 30 June 2020 at 20:12:59 UTC, Stanislav Blinov wrote:
On Tuesday, 30 June 2020 at 20:04:33 UTC, Steven Schveighoffer 
wrote:


The answer is -- update Phobos so it works with 
-nosharedaccess :)


Yeah... and dip1000. And dip1008. And dip... :)


Didn't want to be snippity but, yeah, with "hello world" breaking 
I thought it was time to fix the standard library.  Thanks for 
the polite confirmation(s).






idiomatic output given -preview=nosharedaccess ,

2020-06-30 Thread Bruce Carneal via Digitalmars-d-learn
Given -preview=nosharedaccess on the command line, "hello world" 
fails to compile (you are referred to core.atomic ...).


What is the idiomatic way to get writeln style output from a 
nosharedaccess program?


Is separate compilation the way to go?



Re: linker aliases to carry dlang attributes for externs

2020-04-12 Thread Bruce Carneal via Digitalmars-d-learn

On Sunday, 12 April 2020 at 23:14:42 UTC, Bruce Carneal wrote:
Could dlang compilers emit aliases for extern(C) and 
extern(C++) routines that would carry dlang specific 
information?  (@safe, @nogc, nothrow, ...)


I'm thinking two symbols.  The first as per normal C/C++, and 
the second as per normal dlang with a "use API {C, C++, ...}" 
suffix.


ABI, not API.



linker aliases to carry dlang attributes for externs

2020-04-12 Thread Bruce Carneal via Digitalmars-d-learn
Could dlang compilers emit aliases for extern(C) and extern(C++) 
routines that would carry dlang specific information?  (@safe, 
@nogc, nothrow, ...)


I'm thinking two symbols.  The first as per normal C/C++, and the 
second as per normal dlang with a "use API {C, C++, ...}" suffix.




Re: auto vectorization notes

2020-03-28 Thread Bruce Carneal via Digitalmars-d-learn

On Saturday, 28 March 2020 at 18:01:37 UTC, Crayo List wrote:

On Saturday, 28 March 2020 at 06:56:14 UTC, Bruce Carneal wrote:

On Saturday, 28 March 2020 at 05:21:14 UTC, Crayo List wrote:

On Monday, 23 March 2020 at 18:52:16 UTC, Bruce Carneal wrote:

[snip]
Explicit SIMD code, ispc or other, isn't as readable or 
composable or vanilla portable but it certainly is performance 
predictable.


This is not true! The idea of ispc is to write portable code 
that will
vectorize predictably based on the target CPU. The object 
file/binary is not portable,

if that is what you meant.
Also, I find it readable.



There are many waypoints on the readability <==> performance 
axis.  If ispc works for you along that axis, great!


I find SIMT code readability better than SIMD but a little 
worse than auto-vectorizable kernels.  Hugely better 
performance though for less effort than SIMD if your platform 
supports it.


Again I don't think this is true. Unless I am misunderstanding 
you, SIMT and SIMD
are not mutually exclusive and if you need performance then you 
must use both.
Also based on the workload and processor SIMD may be much more 
effective than SIMT.j


SIMD might become part of the solution under the hood for a 
number of reasons including: ease of deployment, programmer 
familiarity, PCIe xfer overheads, kernel launch overhead, memory 
subsystem suitability, existing code base issues, ...


SIMT works for me in high throughput situations where it's hard 
to "take a log" on the problem.  SIMD, in auto-vectorizable or 
more explicit form, works in others.


Combinations can be useful but most of the work I've come in 
contact with splits pretty clearly along the memory bandwidth 
divide (SIMT on one side, SIMD/CPU on the other).  Others need a 
plus-up in arithmetic horsepower.  The more extreme the 
requirements, the more attractive SIMT appears. (hence my 
excitement about dcompute possibly expanding the dlang 
performance envelope with much less cognitive load than 
CUDA/OpenCL/SycL/...)


On the readability front, I find per-lane programming, even with 
the current thread-divergence caveats, to be easier to reason 
about wrt correctness and performance predictability than other 
approaches.  Apparently your mileage does vary.


When you have chosen SIMD, whether ispc or other, over SIMT what 
drove the decision?  Performance?  Ease of programming to reach a 
target speed?



















Re: auto vectorization notes

2020-03-28 Thread Bruce Carneal via Digitalmars-d-learn

On Saturday, 28 March 2020 at 05:21:14 UTC, Crayo List wrote:

On Monday, 23 March 2020 at 18:52:16 UTC, Bruce Carneal wrote:

[snip]
(on the downside you have to guard against compiler code-gen 
performance regressions)




auto vectorization is bad because you never know if your code 
will get vectorized next time you make some change to it and 
recompile.

Just use : https://ispc.github.io/


Yes, that's a downside, you have to measure your performance 
sensitive code if you change it *or* change compilers or targets.


Explicit SIMD code, ispc or other, isn't as readable or 
composable or vanilla portable but it certainly is performance 
predictable.


I find SIMT code readability better than SIMD but a little worse 
than auto-vectorizable kernels.  Hugely better performance though 
for less effort than SIMD if your platform supports it.


Is anyone actively using dcompute (SIMT enabler)?  Unless I hear 
bad things I'll try it down the road as neither going back to 
CUDA nor "forward" to the SycL-verse appeals.























auto vectorization notes

2020-03-23 Thread Bruce Carneal via Digitalmars-d-learn
When speeds are equivalent, or very close, I usually prefer auto 
vectorized code to explicit SIMD/__vector code as it's easier to 
read.  (on the downside you have to guard against compiler 
code-gen performance regressions)


One oddity I've noticed is that I sometimes need to use 
pragma(inline, *false*) in order to get ldc to "do the right 
thing". Apparently the compiler sees the costs/benefits 
differently in the standalone context.


More widely known techniques that have gotten people over the 
serial/SIMD hump include:

 1) simplified indexing relationships
 2) known count inner loops (chunkify)
 3) static foreach blocks (manual inlining that the compiler 
"gathers")


I'd be interested to hear from others regarding their auto 
vectorization and __vector experiences.  What has worked and what 
hasn't worked in your performance sensitive dlang code?




Re: Strange counter-performance in an alternative `decimalLength9` function

2020-02-28 Thread Bruce Carneal via Digitalmars-d-learn

On Friday, 28 February 2020 at 10:11:23 UTC, Bruce Carneal wrote:

On Friday, 28 February 2020 at 06:50:55 UTC, 9il wrote:
On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. 
wrote:
So after reading the translation of RYU I was interested too 
see if the decimalLength() function can be written to be 
faster, as it cascades up to 8 CMP.


[...]


bsr can be done in one/two CPU operation, quite quick. But 
core.bitop.bsr wouldn't be inlined. Instead, mir-core 
(mir.bitop: ctlz) or LDC intrinsics llvm_ctlz can be used for 
to get code with inlining.


That's surprising.  I just got ldc to inline core.bitop.bsr on 
run.dlang.io using ldc -O3 -mcpu=native. (not sure what the 
target CPU is)


Under what conditions should I be guarding against an inlining 
failure?


Here's the code I used:

int main(string[] args)
{
import core.bitop : bsr;
return bsr(cast(uint)args.length);
}

BTW, I'm a huge fan of your performance work.




Re: Strange counter-performance in an alternative `decimalLength9` function

2020-02-28 Thread Bruce Carneal via Digitalmars-d-learn

On Friday, 28 February 2020 at 06:50:55 UTC, 9il wrote:

On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
So after reading the translation of RYU I was interested too 
see if the decimalLength() function can be written to be 
faster, as it cascades up to 8 CMP.


[...]


bsr can be done in one/two CPU operation, quite quick. But 
core.bitop.bsr wouldn't be inlined. Instead, mir-core 
(mir.bitop: ctlz) or LDC intrinsics llvm_ctlz can be used for 
to get code with inlining.


That's surprising.  I just got ldc to inline core.bitop.bsr on 
run.dlang.io using ldc -O3 -mcpu=native. (not sure what the 
target CPU is)


Under what conditions should I be guarding against an inlining 
failure?








Re: Strange counter-performance in an alternative `decimalLength9` function

2020-02-27 Thread Bruce Carneal via Digitalmars-d-learn

On Thursday, 27 February 2020 at 19:46:23 UTC, Basile B. wrote:
Yes please, post the benchmark method. You see the benchmarks I 
run with your version are always slowest. I'm aware that rndGen 
(and generaly any uniform rnd func) is subject to a bias but I 
dont thing this bias maters much in the case we talk about.


The code below is the test jig that I'm using currently.  It is 
adopted from yours but has added the -d=distribution command line 
option.


I have omitted the function bodies for the entries in the funcs[] 
table.  You can cut and paste there at will but I wanted to spare 
any other readers the bulk.


The timings I get from the below are entirely consistent with my 
earlier reporting and, I believe, with yours.


I urge you to investigate the difference in timings between the 
original, and currently defaulted, distribution and the -d=edig 
distribution.  If you don't understand what's going on there, or 
if you still don't believe input distributions matter, please get 
in touch.  After all, it's possible that I'm the one with the 
misunderstanding.


Have fun!


const funcs = [
_0, _1, _2,
_3, _4, _5
];

ulong runOriginalFuncCount(ubyte funcIdx, ulong count)
{
GC.disable;

uint sum;
foreach (const ulong i; 0 .. count)
{
sum += funcs[funcIdx](rndGen.front());
rndGen.popFront();
}
return sum;
}

ulong runFuncCountDist(ubyte funcIdx, ulong count, string dist)
{
enum perDigit = 1000;
// NOTE: this is a gross distortion given uint.max but it is 
"the spec"

enum maxDigits = 9;
uint[perDigit * maxDigits] samples = void;
const chunks = count / samples.length;
const leftOvers = count % samples.length;

if (dist == "prng")
{
foreach (ref val; samples[])
{
val = rndGen.front();
rndGen.popFront();
}
}
else if (dist == "edig" || dist == "edigsorted")
{
// for reference, this is the "6 LOC IIRC"
uint value = 1;
foreach (dig; 0 .. maxDigits)
{
samples[dig * perDigit .. (dig + 1) * perDigit] = 
value;

value *= 10;
}

if (auto shuffle = dist != "edigsorted")
randomShuffle(samples[]);
}
else
{
assert(0, "dist option should be in {oprng, prng, edig, 
edigsorted}");

}
const func = funcs[funcIdx];

if (count < 1) // late check gives us a handle on setup time
return 0;
// OK, enough with the setup, time to roll
ulong sum = 0;
foreach (chunk; 0 .. chunks)
{
foreach (sample; samples[])
sum += func(sample);
}
foreach (sample; samples[$ - leftOvers .. $])
sum += func(sample);
return sum;
}

// This is a timing test jig for functions that accept a uint
// and return the number of decimal digits needed to represent
// that uint (except that 10 digit values are lumped in with
// 9 digit values currently).
//
// The command line options here are:
//   -c=count (the number of values to present to the function 
selected)

//   -s=seed (the value supplied to rndGen.seed)
//   -f=funcIndex (identifies the function under test, see source 
funcs[])

//   -d=distribution (one of: oprng, prng, edig, edigsorted)
//
// Distributions explained:
//   oprng: original prng distribution, a rndGen.popFront per 
function call
//   prng:  prng distribution, also rndGen but coming from a 
large, pre-filled, array

//   edig:  same number of 1-digit, 2-digit, ... numbers, shuffled
//   edigsorted: edig values but sorted to highlight the branch 
predictor impact

//
// This test jig could evolve in at least six ways:
//  1) the full digit range could (arguably should) be accounted 
for
//  2) additional distributions, like "favor small numbers", 
could be added

//  3) additional implementation ideas can be explored
//  4) the input range could be made variable
//  5) the number base could be templatized
//  6) inlined performance could be tested by aliasing Func 
(runtime switch to template)

//
//  Final note: the main benefit of working on this problem is 
better understanding, not

//  shaving nanoseconds on this particular function.

void main(string[] args)
{

ulong count;
uint seed;
ubyte func;
// NOTE: oprng writeln output will usually not equal that of 
-d=prng

enum defaultDistribution = "oprng";
string dist = defaultDistribution;

GC.disable;
getopt(args, config.passThrough, "c|count", , "f|func", 
,

"s|seed", , "d|distribution", );
rndGen.seed(seed);
const original = dist == "oprng";
const sum = original ? runOriginalFuncCount(func, count) : 
runFuncCountDist(func, count, dist);

writeln(sum);
}








Re: Strange counter-performance in an alternative `decimalLength9` function

2020-02-27 Thread Bruce Carneal via Digitalmars-d-learn

On Thursday, 27 February 2020 at 17:11:48 UTC, Basile B. wrote:
On Thursday, 27 February 2020 at 15:29:02 UTC, Bruce Carneal 
wrote:

On Thursday, 27 February 2020 at 08:52:09 UTC, Basile B. wrote:
I will post my code if there is any meaningful difference in 
your subsequent results.


give me something I can compile and verify. I'm not there to 
steal, if you found something you can still propose it to the 
repos that would take advantage of the optim.


I'm not at all concerned about theft of trivial code.  I am 
concerned that a simple error in my code will live on in a 
copy/paste environment.


Regardless, I'll post the code once I get home.  It may be the 
only way to communicate the central problem as I see it: 
imprecision in the test specification (the input specification).




Re: Strange counter-performance in an alternative `decimalLength9` function

2020-02-27 Thread Bruce Carneal via Digitalmars-d-learn
On Thursday, 27 February 2020 at 15:29:02 UTC, Bruce Carneal 
wrote:

big snip
TL;DR for the snipped: Unsurprisingly, different inputs will lead 
to different timing results.  The equi-probable values supplied 
by a standard PRNG differ significantly from an equi-probable 
digit input.  In particular, the PRNG input is much more 
predictable.


Note also that a uint PRNG will produce numbers that need 10 
decimal digits.  If my arithmetic is correct, you'll see "9 or 
better" out of a uint PRNG 97.79% of the time.  The 
predictability of the full set of inputs is, secondarily but 
importantly, also influenced by the ability of the branch 
predictor to ferret out combinations.


I lack confidence in my ability to reason from first principles 
when presented with a modern CPU branch predictor and a complex 
input.  Much easier for me to just run tests against various 
distributions that I think span the input and reason from the 
results.









Re: Strange counter-performance in an alternative `decimalLength9` function

2020-02-27 Thread Bruce Carneal via Digitalmars-d-learn

On Thursday, 27 February 2020 at 08:52:09 UTC, Basile B. wrote:

On Thursday, 27 February 2020 at 04:44:56 UTC, Basile B. wrote:
On Thursday, 27 February 2020 at 03:58:15 UTC, Bruce Carneal 
wrote:


Maybe you talked about another implementation of 
decimalLength9 ?


Yes.  It's one I wrote after I saw your post. Psuedo-code 
here:


  auto d9_branchless(uint v) { return 1 + (v >= 10) + (v >= 
100) ... }


Using ldc to target an x86 with the above yields a series of 
cmpl, seta instruction pairs in the function body followed by 
a summing and a return.  No branching.


Let me know if the above is unclear or insufficient.


No thanks, it's crystal clear now.


although I don't see the performance gain.
snip



ubyte decimalLength9_0(const uint v)
{
if (v >= 1) { return 9; }
if (v >= 1000) { return 8; }
if (v >= 100) { return 7; }
if (v >= 10) { return 6; }
if (v >= 1) { return 5; }
if (v >= 1000) { return 4; }
if (v >= 100) { return 3; }
if (v >= 10) { return 2; }
return 1;
}

snip



Could you share your benchmarking method maybe ?


OK. I'm working with equi-probable digits.  You're working with 
equi-probable values.  The use-case may be either of these or, 
more likely, a third as yet unknown/unspecified distribution.


The predictability of the test input sequence clearly influences 
the performance of the branching implementations.  The uniform 
random input is highly predictable wrt digits. (there are *many* 
more high digit numbers than low digit numbers)


Take the implementation above as an example.  If, as in the 
random number (equi-probable value) scenario, your inputs skew 
*stongly* toward a higher number of digits, then the code should 
perform very well.


If you believe the function will see uniform random values as 
input, then testing with uniform random inputs is the way to go.  
If you believe some digits distribution is what will be seen, 
then you should alter your inputs to match.  (testing uniform 
random in addition to the supposed target distribution is almost 
always a good idea for sensitivity analysis).


My testing methodology: To obtain an equi-digit distribution, I 
fill an array with N 1-digit values, followed by N 2-digit 
values, ... followed by N 9-digit values.  I use N == 1000 and 
loop over the total array until I've presented about 1 billion 
values to the function under test.


I test against that equi-digit array before shuffling (this 
favors branching implementations) and then again after shuffling 
(this favors branchless).


If you wish to work with equi-digit distributions I'd prefer that 
you implement the functionality anew.  This is some small 
duplicated effort but will help guard against my having screwed 
up even that simple task (less than 6 LOC IIRC).


I will post my code if there is any meaningful difference in your 
subsequent results.












Re: Strange counter-performance in an alternative `decimalLength9` function

2020-02-26 Thread Bruce Carneal via Digitalmars-d-learn
On Thursday, 27 February 2020 at 03:58:15 UTC, Bruce Carneal 
wrote:

On Wednesday, 26 February 2020 at 23:09:34 UTC, Basile B. wrote:
On Wednesday, 26 February 2020 at 20:44:31 UTC, Bruce Carneal 
wrote:


After shuffling the input, branchless wins by 2.4X (240%).




snip

Let me know if the above is unclear or insufficient.


The 2.4X improvement came when using a shuffled uniform digits 
distribution.  Equal numbers of 1 digit values, 2 digit values, 
... put in to an array and shuffled.


Branchless *loses* by 8.5% or so on my Zen1 when the array is not 
shuffled (when branch predictions are nearly perfect).













Re: Strange counter-performance in an alternative `decimalLength9` function

2020-02-26 Thread Bruce Carneal via Digitalmars-d-learn

On Wednesday, 26 February 2020 at 23:09:34 UTC, Basile B. wrote:
On Wednesday, 26 February 2020 at 20:44:31 UTC, Bruce Carneal 
wrote:


After shuffling the input, branchless wins by 2.4X (240%).


I've replaced the input by the front of a rndGen (that pops for 
count times and starting with a custom seed) and I never see 
the decimalLength9_3 (which seems to be the closest to the 
original in term of performances) doing better.


The uniform random uint distribution yields a highly skewed 
digits distribution.  Note, for example, that only 10 of the 
2**32 possible outputs from a uint PRNG can be represented by a 
single decimal digit.


Your current winner will be very hard to beat if the inputs are 
uniform random.  That implementation's high-to-low cascade of 
early exit ifs aligns with PRNG output.


If you aren't sure what the input distribution is, or want 
guaranteed good worst case behavior, then branchless may be a 
better way to go.




Maybe you talked about another implementation of decimalLength9 
?


Yes.  It's one I wrote after I saw your post. Psuedo-code here:

  auto d9_branchless(uint v) { return 1 + (v >= 10) + (v >= 100) 
... }


Using ldc to target an x86 with the above yields a series of 
cmpl, seta instruction pairs in the function body followed by a 
summing and a return.  No branching.


Let me know if the above is unclear or insufficient.























Re: Strange counter-performance in an alternative `decimalLength9` function

2020-02-26 Thread Bruce Carneal via Digitalmars-d-learn
On Wednesday, 26 February 2020 at 19:44:05 UTC, Bruce Carneal 
wrote:

On Wednesday, 26 February 2020 at 13:50:11 UTC, Basile B. wrote:
On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. 
wrote:

...
foreach (i; 0 .. count)
sum += funcs[func](i);


The input stream is highly predictable and strongly skewed 
towards higher digits.


The winning function implementation lines up with that 
distribution.  It would not fare as well with higher entropy 
input.


Using sorted equi-probable inputs (N 1 digit numbers, N 2 digit 
numbers, ...) decimalLength9_0 beats a simple branchless 
implementation by about 10%.


After shuffling the input, branchless wins by 2.4X (240%).



Re: Strange counter-performance in an alternative `decimalLength9` function

2020-02-26 Thread Bruce Carneal via Digitalmars-d-learn

On Wednesday, 26 February 2020 at 13:50:11 UTC, Basile B. wrote:

On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
...
foreach (i; 0 .. count)
sum += funcs[func](i);


The input stream is highly predictable and strongly skewed 
towards higher digits.


The winning function implementation lines up with that 
distribution.  It would not fare as well with higher entropy 
input.