invariants and compiler flags, best practice?

2021-08-06 Thread Bruce Carneal via Digitalmars-d-learn
I'm nervous enough about future compilations/builds of the code 
that I'm responsible for that I employ the following idiom quite 
a bit, mostly in @trusted code:


  (some boolean expression denoting invariants) || assert(0, 
"what went wrong");


How might the above cause problems and how do you deal with the 
possibility of someone disabling checking of one sort or another? 
 Do you embrace it as late-binding desirable?  Ignore it?  Other?




Re: Can I get the time "Duration" in "nsecs" acurracy?

2021-07-10 Thread Bruce Carneal via Digitalmars-d-learn
On Saturday, 10 July 2021 at 01:11:28 UTC, Steven Schveighoffer 
wrote:



You can get better than hnsecs resolution with 
`core.time.MonoTime`, which can support whatever the OS 
supports.


However, `Duration` and `SysTime` are stored in hnsecs for a 
very specific reason -- range. Simply put, if you have a 64-bit 
integer, and you picked nanoseconds as the unit, you can store 
only 585 years of range. 10 ns gives you 5850 years, and 100 ns 
gives you 58k years. That should be good enough for all but the 
most esoteric calculations (given that a `Duration` is signed, 
this gives a range of roughly -29k years to 29k years).


Note also that hnsecs is the base unit for Windows high 
precision clocks, though their epoch is year 1600 instead of 
year 0.


Nice summary. hnsecs is a little weird but defensible given the 
range argument.


Down the road we might add a nanosecond timeline abstraction 
based on TAI with zero set to 1972 (when UTC was aligned with TAI 
IIUC).  Range issues could be addressed by animating the long 
dormant cent.  Any precision issues could be handled with fixed 
point pairs.


Doubles across the same timeline would work for casual 
applications.




Re: Optimizing for SIMD: best practices?(i.e. what features are allowed?)

2021-03-07 Thread Bruce Carneal via Digitalmars-d-learn

On Sunday, 7 March 2021 at 14:15:58 UTC, z wrote:
On Thursday, 25 February 2021 at 14:28:40 UTC, Guillaume Piolat 
wrote:

On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
How does one optimize code to make full use of the CPU's SIMD 
capabilities?
Is there any way to guarantee that "packed" versions of SIMD 
instructions will be used?(e.g. vmulps, vsqrtps, etc...)


https://code.dlang.org/packages/intel-intrinsics


I'd try to use it but the platform i'm building on requires AVX 
to get the most performance.


The code below might be worth a try on your AVX512 machine.

Unless you're looking for a combined result, you might need to 
separate out the memory access overhead by running multiple 
passes over a "known optimal for L2" data set.


Also note that I compiled with -preview=in.  I don't know if that 
matters.




import std.math : sqrt;
enum SIMDBits = 512; // 256 was tested, 512 was not
alias A = float[SIMDBits / (float.sizeof * 8)];
pragma(inline, true)
void soaEuclidean(ref A a0, in A a1, in A a2, in A a3, in A 
b1, in A b2, in A b3)

{
alias V = __vector(A);
static V vsqrt(V v)
{
A a = cast(A) v;
static foreach (i; 0 .. A.length)
a[i] = sqrt(a[i]);
return cast(V)a;
}

static V sd(in A a, in A b)
{
V v = cast(V) b - cast(V) a;
return v * v;
}

auto v = sd(a1, b1) + sd(a2, b2) + sd(a3, b3);
a0[] = vsqrt(v)[];
}




Re: Optimizing for SIMD: best practices?(i.e. what features are allowed?)

2021-02-25 Thread Bruce Carneal via Digitalmars-d-learn

On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
How does one optimize code to make full use of the CPU's SIMD 
capabilities?
Is there any way to guarantee that "packed" versions of SIMD 
instructions will be used?(e.g. vmulps, vsqrtps, etc...)
To give some context, this is a sample of one of the functions 
that could benefit from better SIMD usage :

float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {
  float distance;
  a[] -= b[];
  a[] *= a[];
  static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
  distance += a[i].abs;//abs required by the caller
  }
  return sqrt(distance);
}
vmovsd xmm0,qword ptr ds:[rdx]
vmovss xmm1,dword ptr ds:[rdx+8]
vmovsd xmm2,qword ptr ds:[rcx+4]
vsubps xmm0,xmm0,xmm2
vsubss xmm1,xmm1,dword ptr ds:[rcx+C]
vmulps xmm0,xmm0,xmm0
vmulss xmm1,xmm1,xmm1
vbroadcastss xmm2,dword ptr ds:[<__real@7fff>]
vandps xmm0,xmm0,xmm2
vpermilps xmm3,xmm0,F5
vaddss xmm0,xmm0,xmm3
vandps xmm1,xmm1,xmm2
vaddss xmm0,xmm0,xmm1
vsqrtss xmm0,xmm0,xmm0
vmovaps xmm6,xmmword ptr ss:[rsp+20]
add rsp,38
ret


I've tried to experiment with dynamic arrays of float[3] but 
the output assembly seemed to be worse.[1](in short, it's 
calling internal D functions which use "vxxxss" instructions 
while performing many moves.)


Big thanks
[1] https://run.dlang.io/is/F3Xye3


If you are developing for deployment to a platform that has a 
GPU, you might consider going SIMT (dcompute) rather than SIMD.  
SIMT is a lot easier on the eyes.  More importantly, if you're 
targetting an SoC or console, or have relatively chunky 
computations that allow you to work around the PCIe transit 
costs, the path is open to very large performance improvements.  
I've only been using dcompute for a week or so but so far it's 
been great.


If your algorithims are very branchy, or you decide to stick with 
multi-core/SIMD for any of a number of other good reasons, here 
are a few things I learned before decamping to dcompute land that 
may help:


  1)  LDC is pretty good at auto vectorization as you have 
probably observed.  Definitely worth a few iterations to try and 
get the vectorizer engaged.


  2)  LDC auto vectorization was good but explicit __vector 
programming is more predictable and was, at least for my tasks, 
much faster. I couldn't persuade the auto vectorizer to "do the 
right thing" throughout the hot path but perhaps you'll have 
better luck.


  3)  LDC does a good job of going between T[N] <==> 
__vector(T[N]) so using the static array types as your 
input/output types and the __vector types as your compute types 
works out well whenever you have to interface with an unaligned 
world. LDC issues unaligned vector loads/stores for casts or full 
array assigns: v = cast(VT)sa[];  or v[] = sa[];  These are quite 
good on modern CPUs.  To calibrate note that Ethan recently 
talked about a 10% gain he experienced using full alignment, 
IIRC, so there's that.


  4) LDC also does a good job of discovering SIMD equivalents 
given static foreach unrolled loops with explicit complie-time 
indexing of vector element operands.  You can use those along 
with pragma(inline, true) to develop your own "intrinsics" that 
supplement other libs.


  5) If you adopt the __vector approach you'll have to handle the 
partials manually. (array length % vec length != 0 indicates a 
partial or tail fragment).  If the classic (copying/padding) 
approaches to such fragmentation don't work for you I'd suggest 
using nested static functions that take ref T[N] inputs and 
outputs.  The main loops become very simple and the tail handling 
reduces to loading stack allocated T[N] variables explicitly, 
calling the static function, and unloading.


Good luck.




Re: D meets GPU: recommendations?

2021-01-29 Thread Bruce Carneal via Digitalmars-d-learn

On Friday, 29 January 2021 at 20:01:17 UTC, Bruce Carneal wrote:
On Friday, 29 January 2021 at 17:46:05 UTC, Guillaume Piolat 
wrote:
On Friday, 29 January 2021 at 16:34:25 UTC, Bruce Carneal 
wrote:
The project I've been working on for the last few months has 
a compute backend that is currently written MT+SIMD.  I would 
like to bring up a GPU variant.


What you could do is ressurect DerelictCL, port it to BindBC, 
and write vanilla OpenCL 1.2 + OpenCL C.

Not up to date on both, but CUDA is messier than OpenCL.

I don't really know about the other possibilities, like OpenGL 
+ compute shaders or Vulkan + compute shaders.



[...]
I've not looked in to OpenGL compute shaders seriously either 
but I did look at Vulkan compute shaders.  They looked very 
strong with respect to future deployability but were weak in 
other ways so I kept them off the list.  I don't think I can 
warp my code to fit in to glsl or similar, at least not easily. 
Custom neighborhood indexing, group/sub-group scheduling 
control and ability to manage group local memory efficiently 
were the main concerns, IIRC.


I took another look at Vulkan compute shaders.  They have come a 
long way.  Not nearly as nice as programming in D but good enough 
that I'll sketch out a few kernels if I can't make headway with 
dcompute.


Anybody with recent dcompute experience, please speak up.  The 
code in the repository is pretty nice but it looks unfinished or 
at least frozen in time.




Re: D meets GPU: recommendations?

2021-01-29 Thread Bruce Carneal via Digitalmars-d-learn
On Friday, 29 January 2021 at 17:46:05 UTC, Guillaume Piolat 
wrote:

On Friday, 29 January 2021 at 16:34:25 UTC, Bruce Carneal wrote:
The project I've been working on for the last few months has a 
compute backend that is currently written MT+SIMD.  I would 
like to bring up a GPU variant.


What you could do is ressurect DerelictCL, port it to BindBC, 
and write vanilla OpenCL 1.2 + OpenCL C.

Not up to date on both, but CUDA is messier than OpenCL.

I don't really know about the other possibilities, like OpenGL 
+ compute shaders or Vulkan + compute shaders.


Thanks for the pointer to Mike's bindings.  If I run in to 
trouble plugging in with both dcompute and C++/SycL I'll take a 
look before falling back to CUDA.


I've not looked in to OpenGL compute shaders seriously either but 
I did look at Vulkan compute shaders.  They looked very strong 
with respect to future deployability but were weak in other ways 
so I kept them off the list.  I don't think I can warp my code to 
fit in to glsl or similar, at least not easily. Custom 
neighborhood indexing, group/sub-group scheduling control and 
ability to manage group local memory efficiently were the main 
concerns, IIRC.




Re: D meets GPU: recommendations?

2021-01-29 Thread Bruce Carneal via Digitalmars-d-learn

On Friday, 29 January 2021 at 18:23:40 UTC, mw wrote:

On Friday, 29 January 2021 at 16:34:25 UTC, Bruce Carneal wrote:
Guidance from experience regarding any of the above, or other, 
GPU possibilities would be most welcome.




https://dlang.org/blog/2017/10/30/d-compute-running-d-on-the-gpu/


or Google search:

https://www.google.com/search?=dlang+gpu


Yes, there is a lot of readily available information on the 
possibilities that I listed.  And I have read/viewed much of 
that, including the text and video items prominent in a simple 
google search. (shame on me if I ever ask for something trivially 
available)


Current day "Guidance from *experience*..." is in another 
category, hence the request.








D meets GPU: recommendations?

2021-01-29 Thread Bruce Carneal via Digitalmars-d-learn
The project I've been working on for the last few months has a 
compute backend that is currently written MT+SIMD.  I would like 
to bring up a GPU variant.


If you have experience with this sort of thing, I'd love to hear 
from you, either within this forum or at beerconf.


In a past life I was a CUDA programmer, but I'd prefer to use 
something now that has at least some chance of running on most 
mobile devices in the future.


In addition to the CUDA fallback, I've looked at SycL, dcompute 
and, as suggested by Ethan a couple of beerconfs back, I looked 
in to the D/MLIR work.  That work is very nice but I didn't find 
the connection to GPUs.


Guidance from experience regarding any of the above, or other, 
GPU possibilities would be most welcome.




Re: low-latency GC

2020-12-06 Thread Bruce Carneal via Digitalmars-d-learn
On Sunday, 6 December 2020 at 16:42:00 UTC, Ola Fosheim Grostad 
wrote:

On Sunday, 6 December 2020 at 14:44:25 UTC, Paulo Pinto wrote:
And while on the subject of low level programming in JVM or 
.NET.


https://www.infoq.com/news/2020/12/net-5-runtime-improvements/


Didnt say anything about low level, only simd intrinsics, which 
isnt really low level?


It also stated "When it came to something that is pure CPU raw 
computation doing nothing but number crunching, in general, you 
can still eke out better performance if you really focus on 
"pedal to the metal" with your C/C++ code."


So you must make the familiar "ease-of-programming" vs "x% of 
performance" choice, where 'x' is presumably much smaller than 
earlier.




So it is more of a Go contender, and Go is not a systems level 
language... Apples and oranges.




D is good for systems level work but that's not all.  I use it 
for projects where, in the past, I'd have split the work between 
two languages (Python and C/C++).  I much prefer working with a 
single language that spans the problem space.


If there is a way to extend D's reach with zero or a near-zero 
complexity increase as seen by the programmer, I believe we 
should (as/when resources allow of course).




Re: low-latency GC

2020-12-06 Thread Bruce Carneal via Digitalmars-d-learn
On Sunday, 6 December 2020 at 08:59:49 UTC, Ola Fosheim Grostad 
wrote:

On Sunday, 6 December 2020 at 08:36:49 UTC, Bruce Carneal wrote:
Yes, but they don't allow low level programming. Go also 
freeze to sync threads this has a rather profound impact on 
code generation. They have spent a lot of effort on  sync 
instructions in code gen to lower the latency AFAIK.


So, much of the difficulty in bringing low-latency GC to dlang 
would be the large code gen changes required.  If it is a 
really big effort then that is all we need to know.  Not worth 
it until we can see a big payoff and have more resources.


Well, you could in theory avoid putting owning pointers on the 
stack/globals or require that they are registered as gc roots. 
Then you don't have to scan the stack. All you need then is 
write barriers. IIRC


'shared' with teeth?




Re: low-latency GC

2020-12-06 Thread Bruce Carneal via Digitalmars-d-learn
On Sunday, 6 December 2020 at 08:12:58 UTC, Ola Fosheim Grostad 
wrote:

On Sunday, 6 December 2020 at 07:45:17 UTC, Bruce Carneal wrote:
GCs scan memory, sure.  Lots of variations.  Not germane.  Not 
a rationale.


We need to freeze the threads when collecting stacks/globals.

OK.  Low latency GCs exist.



D is employed at multiple "levels".  Whatever level you call 
it, Go and modern JVMs employ low latency GCs in 
multi-threaded environments.  Some people would like to use D 
at that "level".


Yes, but they don't allow low level programming. Go also freeze 
to sync threads this has a rather profound impact on code 
generation. They have spent a lot of effort on  sync 
instructions in code gen to lower the latency AFAIK.


So, much of the difficulty in bringing low-latency GC to dlang 
would be the large code gen changes required.  If it is a really 
big effort then that is all we need to know.  Not worth it until 
we can see a big payoff and have more resources.




My question remains: how difficult would it be to bring such 
technology to D as a GC option?  Is it precluded somehow by 
the language?   Is it doable but quite a lot of effort because 
...?
 Is it no big deal once you have the GC itself because you 
only need xyz hooks? Is it ...?


Get rid of the system stack and globals. Use only closures and 
put in a restrictive memory model. Then maybe you can get a 
fully no freeze multi threaded GC.  That would be a different 
language.


It would be, but I don't think it is the only way to get lower 
latency GC.  That said, if the code gen effort you mentioned 
earlier is a big deal, then no need to speculate/examine further.




Also, I think Walter may have been concerned about read 
barrier overhead but, again, I'm looking for feasibility 
information.  What would it take to get something that we 
could compare?


Just add ARC + single threaded GC. And even that is quite 
expensive.


Thanks for the feedback.



Re: low-latency GC

2020-12-05 Thread Bruce Carneal via Digitalmars-d-learn
On Sunday, 6 December 2020 at 06:52:41 UTC, Ola Fosheim Grostad 
wrote:

On Sunday, 6 December 2020 at 05:41:05 UTC, Bruce Carneal wrote:
OK.  Some rationale?  Do you, for example, believe that 
no-probable-dlanger could benefit from a low-latency GC?  That 
it is too hard to implement?  That the language is somehow 
incompatible? That ...


The GC needs to scan all the affected call stacks before it can 
do incremental collection. Multi threaded GC is generally not 
compatible with low level programming.


GCs scan memory, sure.  Lots of variations.  Not germane.  Not a 
rationale.


D is employed at multiple "levels".  Whatever level you call it, 
Go and modern JVMs employ low latency GCs in multi-threaded 
environments.  Some people would like to use D at that "level".


My question remains: how difficult would it be to bring such 
technology to D as a GC option?  Is it precluded somehow by the 
language?   Is it doable but quite a lot of effort because ...?  
Is it no big deal once you have the GC itself because you only 
need xyz hooks? Is it ...?


Also, I think Walter may have been concerned about read barrier 
overhead but, again, I'm looking for feasibility information.  
What would it take to get something that we could compare?




Re: low-latency GC

2020-12-05 Thread Bruce Carneal via Digitalmars-d-learn
On Sunday, 6 December 2020 at 05:29:37 UTC, Ola Fosheim Grostad 
wrote:

On Sunday, 6 December 2020 at 05:16:26 UTC, Bruce Carneal wrote:
How difficult would it be to add a, selectable, low-latency GC 
to dlang?


Is it closer to "we cant get there from here" or "no big deal 
if you already have the low-latency GC in hand"?


I've heard Walter mention performance issues (write barriers 
IIRC).  I'm also interested in the GC-flavor performance trade 
offs but here I'm just asking about feasibility.


The only reasonable option for D is single threaded GC or ARC.


OK.  Some rationale?  Do you, for example, believe that 
no-probable-dlanger could benefit from a low-latency GC?  That it 
is too hard to implement?  That the language is somehow 
incompatible? That ...




low-latency GC

2020-12-05 Thread Bruce Carneal via Digitalmars-d-learn
How difficult would it be to add a, selectable, low-latency GC to 
dlang?


Is it closer to "we cant get there from here" or "no big deal if 
you already have the low-latency GC in hand"?


I've heard Walter mention performance issues (write barriers 
IIRC).  I'm also interested in the GC-flavor performance trade 
offs but here I'm just asking about feasibility.




Re: is type checking in D undecidable?

2020-10-23 Thread Bruce Carneal via Digitalmars-d-learn

On Friday, 23 October 2020 at 16:56:46 UTC, Kagamin wrote:
On Thursday, 22 October 2020 at 18:24:47 UTC, Bruce Carneal 
wrote:
Per the wiki on termination analysis some languages with 
dependent types (Agda, Coq) have built-in termination checkers.


What they do with code that does, say, a hash preimage attack?


Not my area but after a quick wiki skim my guess is that the 
termination checkers would not be helpful at all.


If you do pick up one of the termination checker languages and 
experiment, please post back with the results.





Re: is type checking in D undecidable?

2020-10-22 Thread Bruce Carneal via Digitalmars-d-learn

On Friday, 23 October 2020 at 04:24:09 UTC, Paul Backus wrote:

On Friday, 23 October 2020 at 00:53:19 UTC, Bruce Carneal wrote:
When you write functions, the compiler helps you out with 
fully automated constraint checking.  When you write templates 
you can write them so that they look like simple functions, in 
which case you're on pretty solid ground.  Your manual 
constraints will probably work.  Hard to screw up a four line 
eponymous template with constraints.  Hard to screw up a 
"leaf" template with a small number of template args.  
Possible but hard.  Not so hard to screw up 
"wanna-be-as-general-as-possible-but-special-case-performant-and-sometimes-wierdly-recursive-cuz-otherwise-the-compiler-blows-up" templates.


This is true, but it has nothing at all to do with 
decidability--which is a term with a precise technical 
definition in computer science.


Yep.  The thread started with the technical definition, as you'll 
note in the wiki articles that I cited, and then moved on.  I 
probably should have started another thread.




The reason writing correct generic code using templates (or any 
macro system) is so difficult is that templates (and macros in 
general) are, effectively, dynamically typed. Unlike regular 
functions, templates are not type checked when they are 
declared, but when they are "executed" (that is, instantiated). 
In that sense, writing templates in D is very similar to 
writing code in a dynamically-typed language like Python or 
Javascript.


Yep. Good observations.  Functions offer some nice benefits.  I'd 
like to see their use increase (type functions displacing 
templates wherever appropriate).






Re: is type checking in D undecidable?

2020-10-22 Thread Bruce Carneal via Digitalmars-d-learn

On Thursday, 22 October 2020 at 20:37:22 UTC, Paul Backus wrote:
On Thursday, 22 October 2020 at 19:24:53 UTC, Bruce Carneal 
wrote:
On a related topic, I believe that type functions enable a 
large amount of code in the "may be hard to prove decidable" 
category (templates) to be (re)written as clearly decidable 
code.  Easier for the compiler to deal with and, more 
importantly, easier to teach.


Type functions, like regular functions, are Turing-complete, 
and therefore undecidable in the general case.


Sure, most languages are Turing complete at run time.  A few are 
Turing complete at compile time but templates are also pattern 
matching code expanders. Polymorphic.


By design templates are open ended and powerful (in both the 
practical and theoretical sense).  In some situations they're 
exactly what you need.  They are also harder to employ correctly 
than functions.


When you write functions, the compiler helps you out with fully 
automated constraint checking.  When you write templates you can 
write them so that they look like simple functions, in which case 
you're on pretty solid ground.  Your manual constraints will 
probably work.  Hard to screw up a four line eponymous template 
with constraints.  Hard to screw up a "leaf" template with a 
small number of template args.  Possible but hard.  Not so hard 
to screw up 
"wanna-be-as-general-as-possible-but-special-case-performant-and-sometimes-wierdly-recursive-cuz-otherwise-the-compiler-blows-up" templates.


It'd be great if we could get rid of many such templates, and, 
even more importantly, avoid writing a lot more.  That is likely 
when we start asking if type functions can reduce bugs long term, 
both those experienced in the currently tortured template 
subsystem and those experienced in user code.  The performance 
gains exhibited by type functions are, to me, just gravy.




Re: is type checking in D undecidable?

2020-10-22 Thread Bruce Carneal via Digitalmars-d-learn
On Thursday, 22 October 2020 at 18:46:07 UTC, Ola Fosheim Grøstad 
wrote:

On Thursday, 22 October 2020 at 18:38:12 UTC, Stefan Koch wrote:
On Thursday, 22 October 2020 at 18:33:52 UTC, Ola Fosheim 
Grøstad wrote:


In general, it is hard to tell if a computation is 
long-running or unsolvable.


You could even say ... it's undecidable :)


:-) Yes, although ...


[...]

Also, some advanced systems might be able to detect that no 
real progress is possible. For example being able to prove that 
the number of "subqueries" to be tried will increase faster 
than the number of "subqueries" that can be resolved.


I dont think it is any easier to prove the "will increase faster" 
proposition than it is to prove the whole thing.




But this is really the frontier of programming language 
design...


Agree. As I see it, D is on the frontier of practical (meta) 
programming.  A very exciting place to be.


On a related topic, I believe that type functions enable a large 
amount of code in the "may be hard to prove decidable" category 
(templates) to be (re)written as clearly decidable code.  Easier 
for the compiler to deal with and, more importantly, easier to 
teach.




Re: is type checking in D undecidable?

2020-10-22 Thread Bruce Carneal via Digitalmars-d-learn
On Thursday, 22 October 2020 at 18:04:32 UTC, Ola Fosheim Grøstad 
wrote:
On Thursday, 22 October 2020 at 17:25:44 UTC, Bruce Carneal 
wrote:
Is type checking in D undecidable?  Per the wiki on dependent 
types it sure looks like it is.


Even if it is, you can still write something that is decidable 
in D, but impractical in terms of compile time.


Yep.  Within some non-exponential CTFE speed factor that's 
equivalent to saying your program might run too long.




You probably mean more advanced type systems where these things 
are expressed more implicitly. Basically type systems where you 
can express and resolve properties related to infinite sizes. D 
does not have such capabilities, so you have to go out of your 
way to end up in that territory in D.


Where "out of your way" means what?  Use of templates with 
potentially unbounded recursive expression?  Other?


Per the wiki on termination analysis some languages with 
dependent types (Agda, Coq) have built-in termination checkers.  
I assume that DMD employs something short of such a checker, some 
combination of cycle detection backed up by resource bounds?




is type checking in D undecidable?

2020-10-22 Thread Bruce Carneal via Digitalmars-d-learn
Is type checking in D undecidable?  Per the wiki on dependent 
types it sure looks like it is.


I assume that it's well known to the compiler contributors that D 
type checking is undecidable which, among other reasons, is why 
we have things like template recursion limits.


Confirmation of the assumption or refutation would be most 
welcome.




Re: __vector(ubyte[32]) misalignment

2020-08-10 Thread Bruce Carneal via Digitalmars-d-learn
On Monday, 10 August 2020 at 13:52:46 UTC, Steven Schveighoffer 
wrote:

On 8/9/20 8:46 AM, Steven Schveighoffer wrote:

On 8/9/20 8:37 AM, Steven Schveighoffer wrote:

I think this has come up before, there may even be a bug 
report on it.


Found one, I'll see if I can fix the array runtime:

https://issues.dlang.org/show_bug.cgi?id=10826



Bruce, I have a PR to hopefully fix these issues, if you want 
to test against it:


https://github.com/dlang/druntime/pull/3192

-Steve


No biggee but it looks like there is some duplicate code at the 
end of the __alignPad unittest.






Re: __vector(ubyte[32]) misalignment

2020-08-10 Thread Bruce Carneal via Digitalmars-d-learn
On Monday, 10 August 2020 at 13:52:46 UTC, Steven Schveighoffer 
wrote:

On 8/9/20 8:46 AM, Steven Schveighoffer wrote:

On 8/9/20 8:37 AM, Steven Schveighoffer wrote:

I think this has come up before, there may even be a bug 
report on it.


Found one, I'll see if I can fix the array runtime:

https://issues.dlang.org/show_bug.cgi?id=10826



Bruce, I have a PR to hopefully fix these issues, if you want 
to test against it:


https://github.com/dlang/druntime/pull/3192

-Steve


The "fix issue 10826" reading was interesting.  Thanks for 
pushing this one through.


Re: __vector(ubyte[32]) misalignment

2020-08-09 Thread Bruce Carneal via Digitalmars-d-learn
On Sunday, 9 August 2020 at 12:37:06 UTC, Steven Schveighoffer 
wrote:

On 8/9/20 8:09 AM, Bruce Carneal wrote:

[...]


All blocks in the GC that are more than 16 bytes are aligned by 
32 bytes. You shouldn't have any 16 byte blocks here, because 
each element is 32 bytes long.


However, if your block grows to a page size, the alignment will 
be 16 bytes off (due to the metadata stored at the front of the 
block).


A page size is 4096 bytes. So anything larger than 2048 will 
require a page-sized block or larger.


I would guess that once your array gets longer than 63 
elements, it's always misaligned?


Quality sleuthing Steve. The program says it found misalignments 
with 37 out of 100 attempts of length [1..100].




Re: __vector(ubyte[32]) misalignment

2020-08-09 Thread Bruce Carneal via Digitalmars-d-learn

On Sunday, 9 August 2020 at 10:02:32 UTC, kinke wrote:

On Sunday, 9 August 2020 at 01:03:51 UTC, Bruce Carneal wrote:
Is sub .alignof alignment expected here?  IOW, do I have to 
manually manage memory if I want alignments above 16?


IIRC, yes when using the GC, as that only guarantees 16-bytes 
alignment. Static arrays on the stack should be aligned just 
fine with LDC.


Yes, it presents as a GC limitation.

Many thanks for your LDC work.





Re: __vector(ubyte[32]) misalignment

2020-08-09 Thread Bruce Carneal via Digitalmars-d-learn

On Sunday, 9 August 2020 at 09:58:18 UTC, Johan wrote:

On Sunday, 9 August 2020 at 01:03:51 UTC, Bruce Carneal wrote:
The .alignof attribute of __vector(ubyte[32]) is 32 but 
initializing an array of such vectors via an assignment to 
.length has given me 16 byte alignment (and subsequent seg 
faults which I suspect are related).


Is sub .alignof alignment expected here?  IOW, do I have to 
manually manage memory if I want alignments above 16?


Do you have a code example?
And what compiler are you using?

-Johan


At run.dlang.io recent runs of both dmd and lcd compilations of 
the below revealed misalignment.


import std;

void main() @safe
{
alias V = __vector(ubyte[32]); // requires -mcpu=native or 
other on cmd line

V[] va;
size_t misalignments;
foreach(N; 1..101) {
va.length = N;
const uptr = cast(ulong)va.ptr;
misalignments += (uptr % V.alignof) != 0;
}
writefln("misaligned %s per cent of the time", misalignments);
}


Re: __vector(ubyte[32]) misalignment

2020-08-09 Thread Bruce Carneal via Digitalmars-d-learn

On Sunday, 9 August 2020 at 05:49:23 UTC, user1234 wrote:

On Sunday, 9 August 2020 at 01:56:54 UTC, Bruce Carneal wrote:

On Sunday, 9 August 2020 at 01:03:51 UTC, Bruce Carneal wrote:




Manually managing the alignment eliminated the seg faulting.

Additionally, I found that std.experimental.mallocator 
Mallocator.alignment is 16.


So, is the misalignment baked in at this point?


there's AlignedMallocator that allows to overrides the 
"platformAlignment".

To get vector spece Mallocator is indeed a bad choice.


Yep.


Re: __vector(ubyte[32]) misalignment

2020-08-08 Thread Bruce Carneal via Digitalmars-d-learn

On Sunday, 9 August 2020 at 01:03:51 UTC, Bruce Carneal wrote:
The .alignof attribute of __vector(ubyte[32]) is 32 but 
initializing an array of such vectors via an assignment to 
.length has given me 16 byte alignment (and subsequent seg 
faults which I suspect are related).


Is sub .alignof alignment expected here?  IOW, do I have to 
manually manage memory if I want alignments above 16?


Manually managing the alignment eliminated the seg faulting.

Additionally, I found that std.experimental.mallocator 
Mallocator.alignment is 16.


So, is the misalignment baked in at this point?





__vector(ubyte[32]) misalignment

2020-08-08 Thread Bruce Carneal via Digitalmars-d-learn
The .alignof attribute of __vector(ubyte[32]) is 32 but 
initializing an array of such vectors via an assignment to 
.length has given me 16 byte alignment (and subsequent seg faults 
which I suspect are related).


Is sub .alignof alignment expected here?  IOW, do I have to 
manually manage memory if I want alignments above 16?




Re: safety and auto vectorization

2020-08-03 Thread Bruce Carneal via Digitalmars-d-learn
On Monday, 3 August 2020 at 18:55:36 UTC, Steven Schveighoffer 
wrote:

On 8/2/20 1:31 PM, Bruce Carneal wrote:

import std;

void f0(int[] a, int[] b, int[] dst) @safe {
     dst[] = a[] + b[];
}


[snip of auto-vectorization example]


I was surprised that f0 ran just fine with a.length and 
b.length geq dst.length.  Is that a bug or a feature?




First, I think this is a bug. A regression in fact. As of 2.077 
this works, and before it did not. There is nothing in the spec 
that says the behavior is defined for this case.


Second, it's more than just that. This also runs currently:

void main()
{
auto a = [1, 2, 3];
auto b = [4, 5, 6];
int[] dst = new int[4]; // note the extra element
dst[] = a[] + b[];
writeln(dst[3]);
}

Prior to 2.077, this fails with array length problems.

After that it prints (at the moment): 402653184

If I up the size to 5, it fails with a range violation. I 
strongly suspect some off-by-one errors, but this looks unsafe.


-Steve


Thanks Steve (and Chad).  Summary: underspecified, varying 
behavior across versions, buggy.


Steve, what's the best way for me to report this?  Are spec 
issues lumped in with the other bugzilla reports?









safety and auto vectorization

2020-08-02 Thread Bruce Carneal via Digitalmars-d-learn

import std;

void f0(int[] a, int[] b, int[] dst) @safe {
dst[] = a[] + b[];
}

void f1(int[] a, int[] b, int[] dst) @trusted {
const minLen = min(a.length, b.length, dst.length);
dst[0..minLen] = a[0..minLen] + b[0..minLen];
assert(dst.length == minLen);
}

I was surprised that f0 ran just fine with a.length and b.length 
geq dst.length.  Is that a bug or a feature?


Assuming it's a feature, are f0 and f1 morally equivalent?  I ask 
because f1 auto-vectorizes in ldc while f0 does not.  Not sure 
why.  As a guess I'd say that the front end doesn't hoist bounds 
checks in f0 or at least doesn't convey the info to the back end 
in a comprehensible fashion.  Non-guesses welcome.







Re: idiomatic output given -preview=nosharedaccess ,

2020-07-01 Thread Bruce Carneal via Digitalmars-d-learn

On Tuesday, 30 June 2020 at 20:43:00 UTC, Bruce Carneal wrote:
On Tuesday, 30 June 2020 at 20:12:59 UTC, Stanislav Blinov 
wrote:
On Tuesday, 30 June 2020 at 20:04:33 UTC, Steven Schveighoffer 
wrote:


The answer is -- update Phobos so it works with 
-nosharedaccess :)


Yeah... and dip1000. And dip1008. And dip... :)


Didn't want to be snippity but, yeah, with "hello world" 
breaking I thought it was time to fix the standard library.  
Thanks for the polite confirmation(s).


Looking at the stdio.d source it appears that a cast on one line 
within a template could give nosharedaccess programs access to 
stdio, stdout, and stderr.


A bug/enhancement request was filed.




Re: idiomatic output given -preview=nosharedaccess ,

2020-06-30 Thread Bruce Carneal via Digitalmars-d-learn

On Tuesday, 30 June 2020 at 20:12:59 UTC, Stanislav Blinov wrote:
On Tuesday, 30 June 2020 at 20:04:33 UTC, Steven Schveighoffer 
wrote:


The answer is -- update Phobos so it works with 
-nosharedaccess :)


Yeah... and dip1000. And dip1008. And dip... :)


Didn't want to be snippity but, yeah, with "hello world" breaking 
I thought it was time to fix the standard library.  Thanks for 
the polite confirmation(s).






idiomatic output given -preview=nosharedaccess ,

2020-06-30 Thread Bruce Carneal via Digitalmars-d-learn
Given -preview=nosharedaccess on the command line, "hello world" 
fails to compile (you are referred to core.atomic ...).


What is the idiomatic way to get writeln style output from a 
nosharedaccess program?


Is separate compilation the way to go?



Re: linker aliases to carry dlang attributes for externs

2020-04-12 Thread Bruce Carneal via Digitalmars-d-learn

On Sunday, 12 April 2020 at 23:14:42 UTC, Bruce Carneal wrote:
Could dlang compilers emit aliases for extern(C) and 
extern(C++) routines that would carry dlang specific 
information?  (@safe, @nogc, nothrow, ...)


I'm thinking two symbols.  The first as per normal C/C++, and 
the second as per normal dlang with a "use API {C, C++, ...}" 
suffix.


ABI, not API.



linker aliases to carry dlang attributes for externs

2020-04-12 Thread Bruce Carneal via Digitalmars-d-learn
Could dlang compilers emit aliases for extern(C) and extern(C++) 
routines that would carry dlang specific information?  (@safe, 
@nogc, nothrow, ...)


I'm thinking two symbols.  The first as per normal C/C++, and the 
second as per normal dlang with a "use API {C, C++, ...}" suffix.




Re: auto vectorization notes

2020-03-28 Thread Bruce Carneal via Digitalmars-d-learn

On Saturday, 28 March 2020 at 18:01:37 UTC, Crayo List wrote:

On Saturday, 28 March 2020 at 06:56:14 UTC, Bruce Carneal wrote:

On Saturday, 28 March 2020 at 05:21:14 UTC, Crayo List wrote:

On Monday, 23 March 2020 at 18:52:16 UTC, Bruce Carneal wrote:

[snip]
Explicit SIMD code, ispc or other, isn't as readable or 
composable or vanilla portable but it certainly is performance 
predictable.


This is not true! The idea of ispc is to write portable code 
that will
vectorize predictably based on the target CPU. The object 
file/binary is not portable,

if that is what you meant.
Also, I find it readable.



There are many waypoints on the readability <==> performance 
axis.  If ispc works for you along that axis, great!


I find SIMT code readability better than SIMD but a little 
worse than auto-vectorizable kernels.  Hugely better 
performance though for less effort than SIMD if your platform 
supports it.


Again I don't think this is true. Unless I am misunderstanding 
you, SIMT and SIMD
are not mutually exclusive and if you need performance then you 
must use both.
Also based on the workload and processor SIMD may be much more 
effective than SIMT.j


SIMD might become part of the solution under the hood for a 
number of reasons including: ease of deployment, programmer 
familiarity, PCIe xfer overheads, kernel launch overhead, memory 
subsystem suitability, existing code base issues, ...


SIMT works for me in high throughput situations where it's hard 
to "take a log" on the problem.  SIMD, in auto-vectorizable or 
more explicit form, works in others.


Combinations can be useful but most of the work I've come in 
contact with splits pretty clearly along the memory bandwidth 
divide (SIMT on one side, SIMD/CPU on the other).  Others need a 
plus-up in arithmetic horsepower.  The more extreme the 
requirements, the more attractive SIMT appears. (hence my 
excitement about dcompute possibly expanding the dlang 
performance envelope with much less cognitive load than 
CUDA/OpenCL/SycL/...)


On the readability front, I find per-lane programming, even with 
the current thread-divergence caveats, to be easier to reason 
about wrt correctness and performance predictability than other 
approaches.  Apparently your mileage does vary.


When you have chosen SIMD, whether ispc or other, over SIMT what 
drove the decision?  Performance?  Ease of programming to reach a 
target speed?



















Re: auto vectorization notes

2020-03-28 Thread Bruce Carneal via Digitalmars-d-learn

On Saturday, 28 March 2020 at 05:21:14 UTC, Crayo List wrote:

On Monday, 23 March 2020 at 18:52:16 UTC, Bruce Carneal wrote:

[snip]
(on the downside you have to guard against compiler code-gen 
performance regressions)




auto vectorization is bad because you never know if your code 
will get vectorized next time you make some change to it and 
recompile.

Just use : https://ispc.github.io/


Yes, that's a downside, you have to measure your performance 
sensitive code if you change it *or* change compilers or targets.


Explicit SIMD code, ispc or other, isn't as readable or 
composable or vanilla portable but it certainly is performance 
predictable.


I find SIMT code readability better than SIMD but a little worse 
than auto-vectorizable kernels.  Hugely better performance though 
for less effort than SIMD if your platform supports it.


Is anyone actively using dcompute (SIMT enabler)?  Unless I hear 
bad things I'll try it down the road as neither going back to 
CUDA nor "forward" to the SycL-verse appeals.























auto vectorization notes

2020-03-23 Thread Bruce Carneal via Digitalmars-d-learn
When speeds are equivalent, or very close, I usually prefer auto 
vectorized code to explicit SIMD/__vector code as it's easier to 
read.  (on the downside you have to guard against compiler 
code-gen performance regressions)


One oddity I've noticed is that I sometimes need to use 
pragma(inline, *false*) in order to get ldc to "do the right 
thing". Apparently the compiler sees the costs/benefits 
differently in the standalone context.


More widely known techniques that have gotten people over the 
serial/SIMD hump include:

 1) simplified indexing relationships
 2) known count inner loops (chunkify)
 3) static foreach blocks (manual inlining that the compiler 
"gathers")


I'd be interested to hear from others regarding their auto 
vectorization and __vector experiences.  What has worked and what 
hasn't worked in your performance sensitive dlang code?




Re: Strange counter-performance in an alternative `decimalLength9` function

2020-02-28 Thread Bruce Carneal via Digitalmars-d-learn

On Friday, 28 February 2020 at 10:11:23 UTC, Bruce Carneal wrote:

On Friday, 28 February 2020 at 06:50:55 UTC, 9il wrote:
On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. 
wrote:
So after reading the translation of RYU I was interested too 
see if the decimalLength() function can be written to be 
faster, as it cascades up to 8 CMP.


[...]


bsr can be done in one/two CPU operation, quite quick. But 
core.bitop.bsr wouldn't be inlined. Instead, mir-core 
(mir.bitop: ctlz) or LDC intrinsics llvm_ctlz can be used for 
to get code with inlining.


That's surprising.  I just got ldc to inline core.bitop.bsr on 
run.dlang.io using ldc -O3 -mcpu=native. (not sure what the 
target CPU is)


Under what conditions should I be guarding against an inlining 
failure?


Here's the code I used:

int main(string[] args)
{
import core.bitop : bsr;
return bsr(cast(uint)args.length);
}

BTW, I'm a huge fan of your performance work.




Re: Strange counter-performance in an alternative `decimalLength9` function

2020-02-28 Thread Bruce Carneal via Digitalmars-d-learn

On Friday, 28 February 2020 at 06:50:55 UTC, 9il wrote:

On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
So after reading the translation of RYU I was interested too 
see if the decimalLength() function can be written to be 
faster, as it cascades up to 8 CMP.


[...]


bsr can be done in one/two CPU operation, quite quick. But 
core.bitop.bsr wouldn't be inlined. Instead, mir-core 
(mir.bitop: ctlz) or LDC intrinsics llvm_ctlz can be used for 
to get code with inlining.


That's surprising.  I just got ldc to inline core.bitop.bsr on 
run.dlang.io using ldc -O3 -mcpu=native. (not sure what the 
target CPU is)


Under what conditions should I be guarding against an inlining 
failure?








Re: Strange counter-performance in an alternative `decimalLength9` function

2020-02-27 Thread Bruce Carneal via Digitalmars-d-learn

On Thursday, 27 February 2020 at 19:46:23 UTC, Basile B. wrote:
Yes please, post the benchmark method. You see the benchmarks I 
run with your version are always slowest. I'm aware that rndGen 
(and generaly any uniform rnd func) is subject to a bias but I 
dont thing this bias maters much in the case we talk about.


The code below is the test jig that I'm using currently.  It is 
adopted from yours but has added the -d=distribution command line 
option.


I have omitted the function bodies for the entries in the funcs[] 
table.  You can cut and paste there at will but I wanted to spare 
any other readers the bulk.


The timings I get from the below are entirely consistent with my 
earlier reporting and, I believe, with yours.


I urge you to investigate the difference in timings between the 
original, and currently defaulted, distribution and the -d=edig 
distribution.  If you don't understand what's going on there, or 
if you still don't believe input distributions matter, please get 
in touch.  After all, it's possible that I'm the one with the 
misunderstanding.


Have fun!


const funcs = [
_0, _1, _2,
_3, _4, _5
];

ulong runOriginalFuncCount(ubyte funcIdx, ulong count)
{
GC.disable;

uint sum;
foreach (const ulong i; 0 .. count)
{
sum += funcs[funcIdx](rndGen.front());
rndGen.popFront();
}
return sum;
}

ulong runFuncCountDist(ubyte funcIdx, ulong count, string dist)
{
enum perDigit = 1000;
// NOTE: this is a gross distortion given uint.max but it is 
"the spec"

enum maxDigits = 9;
uint[perDigit * maxDigits] samples = void;
const chunks = count / samples.length;
const leftOvers = count % samples.length;

if (dist == "prng")
{
foreach (ref val; samples[])
{
val = rndGen.front();
rndGen.popFront();
}
}
else if (dist == "edig" || dist == "edigsorted")
{
// for reference, this is the "6 LOC IIRC"
uint value = 1;
foreach (dig; 0 .. maxDigits)
{
samples[dig * perDigit .. (dig + 1) * perDigit] = 
value;

value *= 10;
}

if (auto shuffle = dist != "edigsorted")
randomShuffle(samples[]);
}
else
{
assert(0, "dist option should be in {oprng, prng, edig, 
edigsorted}");

}
const func = funcs[funcIdx];

if (count < 1) // late check gives us a handle on setup time
return 0;
// OK, enough with the setup, time to roll
ulong sum = 0;
foreach (chunk; 0 .. chunks)
{
foreach (sample; samples[])
sum += func(sample);
}
foreach (sample; samples[$ - leftOvers .. $])
sum += func(sample);
return sum;
}

// This is a timing test jig for functions that accept a uint
// and return the number of decimal digits needed to represent
// that uint (except that 10 digit values are lumped in with
// 9 digit values currently).
//
// The command line options here are:
//   -c=count (the number of values to present to the function 
selected)

//   -s=seed (the value supplied to rndGen.seed)
//   -f=funcIndex (identifies the function under test, see source 
funcs[])

//   -d=distribution (one of: oprng, prng, edig, edigsorted)
//
// Distributions explained:
//   oprng: original prng distribution, a rndGen.popFront per 
function call
//   prng:  prng distribution, also rndGen but coming from a 
large, pre-filled, array

//   edig:  same number of 1-digit, 2-digit, ... numbers, shuffled
//   edigsorted: edig values but sorted to highlight the branch 
predictor impact

//
// This test jig could evolve in at least six ways:
//  1) the full digit range could (arguably should) be accounted 
for
//  2) additional distributions, like "favor small numbers", 
could be added

//  3) additional implementation ideas can be explored
//  4) the input range could be made variable
//  5) the number base could be templatized
//  6) inlined performance could be tested by aliasing Func 
(runtime switch to template)

//
//  Final note: the main benefit of working on this problem is 
better understanding, not

//  shaving nanoseconds on this particular function.

void main(string[] args)
{

ulong count;
uint seed;
ubyte func;
// NOTE: oprng writeln output will usually not equal that of 
-d=prng

enum defaultDistribution = "oprng";
string dist = defaultDistribution;

GC.disable;
getopt(args, config.passThrough, "c|count", , "f|func", 
,

"s|seed", , "d|distribution", );
rndGen.seed(seed);
const original = dist == "oprng";
const sum = original ? runOriginalFuncCount(func, count) : 
runFuncCountDist(func, count, dist);

writeln(sum);
}








Re: Strange counter-performance in an alternative `decimalLength9` function

2020-02-27 Thread Bruce Carneal via Digitalmars-d-learn

On Thursday, 27 February 2020 at 17:11:48 UTC, Basile B. wrote:
On Thursday, 27 February 2020 at 15:29:02 UTC, Bruce Carneal 
wrote:

On Thursday, 27 February 2020 at 08:52:09 UTC, Basile B. wrote:
I will post my code if there is any meaningful difference in 
your subsequent results.


give me something I can compile and verify. I'm not there to 
steal, if you found something you can still propose it to the 
repos that would take advantage of the optim.


I'm not at all concerned about theft of trivial code.  I am 
concerned that a simple error in my code will live on in a 
copy/paste environment.


Regardless, I'll post the code once I get home.  It may be the 
only way to communicate the central problem as I see it: 
imprecision in the test specification (the input specification).




Re: Strange counter-performance in an alternative `decimalLength9` function

2020-02-27 Thread Bruce Carneal via Digitalmars-d-learn
On Thursday, 27 February 2020 at 15:29:02 UTC, Bruce Carneal 
wrote:

big snip
TL;DR for the snipped: Unsurprisingly, different inputs will lead 
to different timing results.  The equi-probable values supplied 
by a standard PRNG differ significantly from an equi-probable 
digit input.  In particular, the PRNG input is much more 
predictable.


Note also that a uint PRNG will produce numbers that need 10 
decimal digits.  If my arithmetic is correct, you'll see "9 or 
better" out of a uint PRNG 97.79% of the time.  The 
predictability of the full set of inputs is, secondarily but 
importantly, also influenced by the ability of the branch 
predictor to ferret out combinations.


I lack confidence in my ability to reason from first principles 
when presented with a modern CPU branch predictor and a complex 
input.  Much easier for me to just run tests against various 
distributions that I think span the input and reason from the 
results.









Re: Strange counter-performance in an alternative `decimalLength9` function

2020-02-27 Thread Bruce Carneal via Digitalmars-d-learn

On Thursday, 27 February 2020 at 08:52:09 UTC, Basile B. wrote:

On Thursday, 27 February 2020 at 04:44:56 UTC, Basile B. wrote:
On Thursday, 27 February 2020 at 03:58:15 UTC, Bruce Carneal 
wrote:


Maybe you talked about another implementation of 
decimalLength9 ?


Yes.  It's one I wrote after I saw your post. Psuedo-code 
here:


  auto d9_branchless(uint v) { return 1 + (v >= 10) + (v >= 
100) ... }


Using ldc to target an x86 with the above yields a series of 
cmpl, seta instruction pairs in the function body followed by 
a summing and a return.  No branching.


Let me know if the above is unclear or insufficient.


No thanks, it's crystal clear now.


although I don't see the performance gain.
snip



ubyte decimalLength9_0(const uint v)
{
if (v >= 1) { return 9; }
if (v >= 1000) { return 8; }
if (v >= 100) { return 7; }
if (v >= 10) { return 6; }
if (v >= 1) { return 5; }
if (v >= 1000) { return 4; }
if (v >= 100) { return 3; }
if (v >= 10) { return 2; }
return 1;
}

snip



Could you share your benchmarking method maybe ?


OK. I'm working with equi-probable digits.  You're working with 
equi-probable values.  The use-case may be either of these or, 
more likely, a third as yet unknown/unspecified distribution.


The predictability of the test input sequence clearly influences 
the performance of the branching implementations.  The uniform 
random input is highly predictable wrt digits. (there are *many* 
more high digit numbers than low digit numbers)


Take the implementation above as an example.  If, as in the 
random number (equi-probable value) scenario, your inputs skew 
*stongly* toward a higher number of digits, then the code should 
perform very well.


If you believe the function will see uniform random values as 
input, then testing with uniform random inputs is the way to go.  
If you believe some digits distribution is what will be seen, 
then you should alter your inputs to match.  (testing uniform 
random in addition to the supposed target distribution is almost 
always a good idea for sensitivity analysis).


My testing methodology: To obtain an equi-digit distribution, I 
fill an array with N 1-digit values, followed by N 2-digit 
values, ... followed by N 9-digit values.  I use N == 1000 and 
loop over the total array until I've presented about 1 billion 
values to the function under test.


I test against that equi-digit array before shuffling (this 
favors branching implementations) and then again after shuffling 
(this favors branchless).


If you wish to work with equi-digit distributions I'd prefer that 
you implement the functionality anew.  This is some small 
duplicated effort but will help guard against my having screwed 
up even that simple task (less than 6 LOC IIRC).


I will post my code if there is any meaningful difference in your 
subsequent results.












Re: Strange counter-performance in an alternative `decimalLength9` function

2020-02-26 Thread Bruce Carneal via Digitalmars-d-learn
On Thursday, 27 February 2020 at 03:58:15 UTC, Bruce Carneal 
wrote:

On Wednesday, 26 February 2020 at 23:09:34 UTC, Basile B. wrote:
On Wednesday, 26 February 2020 at 20:44:31 UTC, Bruce Carneal 
wrote:


After shuffling the input, branchless wins by 2.4X (240%).




snip

Let me know if the above is unclear or insufficient.


The 2.4X improvement came when using a shuffled uniform digits 
distribution.  Equal numbers of 1 digit values, 2 digit values, 
... put in to an array and shuffled.


Branchless *loses* by 8.5% or so on my Zen1 when the array is not 
shuffled (when branch predictions are nearly perfect).













Re: Strange counter-performance in an alternative `decimalLength9` function

2020-02-26 Thread Bruce Carneal via Digitalmars-d-learn

On Wednesday, 26 February 2020 at 23:09:34 UTC, Basile B. wrote:
On Wednesday, 26 February 2020 at 20:44:31 UTC, Bruce Carneal 
wrote:


After shuffling the input, branchless wins by 2.4X (240%).


I've replaced the input by the front of a rndGen (that pops for 
count times and starting with a custom seed) and I never see 
the decimalLength9_3 (which seems to be the closest to the 
original in term of performances) doing better.


The uniform random uint distribution yields a highly skewed 
digits distribution.  Note, for example, that only 10 of the 
2**32 possible outputs from a uint PRNG can be represented by a 
single decimal digit.


Your current winner will be very hard to beat if the inputs are 
uniform random.  That implementation's high-to-low cascade of 
early exit ifs aligns with PRNG output.


If you aren't sure what the input distribution is, or want 
guaranteed good worst case behavior, then branchless may be a 
better way to go.




Maybe you talked about another implementation of decimalLength9 
?


Yes.  It's one I wrote after I saw your post. Psuedo-code here:

  auto d9_branchless(uint v) { return 1 + (v >= 10) + (v >= 100) 
... }


Using ldc to target an x86 with the above yields a series of 
cmpl, seta instruction pairs in the function body followed by a 
summing and a return.  No branching.


Let me know if the above is unclear or insufficient.























Re: Strange counter-performance in an alternative `decimalLength9` function

2020-02-26 Thread Bruce Carneal via Digitalmars-d-learn
On Wednesday, 26 February 2020 at 19:44:05 UTC, Bruce Carneal 
wrote:

On Wednesday, 26 February 2020 at 13:50:11 UTC, Basile B. wrote:
On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. 
wrote:

...
foreach (i; 0 .. count)
sum += funcs[func](i);


The input stream is highly predictable and strongly skewed 
towards higher digits.


The winning function implementation lines up with that 
distribution.  It would not fare as well with higher entropy 
input.


Using sorted equi-probable inputs (N 1 digit numbers, N 2 digit 
numbers, ...) decimalLength9_0 beats a simple branchless 
implementation by about 10%.


After shuffling the input, branchless wins by 2.4X (240%).



Re: Strange counter-performance in an alternative `decimalLength9` function

2020-02-26 Thread Bruce Carneal via Digitalmars-d-learn

On Wednesday, 26 February 2020 at 13:50:11 UTC, Basile B. wrote:

On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
...
foreach (i; 0 .. count)
sum += funcs[func](i);


The input stream is highly predictable and strongly skewed 
towards higher digits.


The winning function implementation lines up with that 
distribution.  It would not fare as well with higher entropy 
input.