Re: 200-600x slower Dlang performance with nested foreach loop

2021-01-31 Thread methonash via Digitalmars-d-learn
On Sunday, 31 January 2021 at 00:53:05 UTC, Steven Schveighoffer 
wrote:
I'd suggest trying it in reverse. If you have the sequence 
"cba", "ba", "a", then determining "a" is in "ba" is probably 
cheaper than determining "a" is in "cba".


I have user requirements that this application track string IDs 
that get collapsed under parents. To minimize/simplify array 
concatenations, I figured that going in descending order might 
make those operations less expensive.


Though I think this is also worth trying.

Are you still convinced that it's possible to do it in under 2 
seconds? That would seem a huge discrepancy. If not, what 
specifically are you looking for in terms of performance?


Good question. As a sanity check, at some point this past week, I 
re-wrote everything in a single contained Perl program, and 
running that program on the dataset I've shared above took well 
over 2 hours of wall time.


Considering that the index() function in Perl is pre-compiled, I 
imagine that Dlang running 4-5x faster is a pretty good speed 
boost, as it may only be benefiting from the speed of compiled 
loops over interpreted loops.


What confuses me, at this point, is this: I originally wrote the 
D code using foreach in this style:


foreach( i, ref parentString; strings )
{
foreach( j, ref childString; strings[ i + 1 .. $ ] )
{
// ...
}
}

Essentially, the value of j printed to stdout should always be 
larger than the value of i. Yet, when running in the above style, 
the values of j reported to stdout inevitably become smaller than 
i, suggesting that the loop is somehow traversing backwards. How 
can this be explained?


Re: 200-600x slower Dlang performance with nested foreach loop

2021-01-30 Thread methonash via Digitalmars-d-learn

Greetings all,

Many thanks for sharing your collective perspective and advice 
thus far! It has been very helpful and instructive. I return 
bearing live data and a minimally complete, compilable, and 
executable program to experiment with and potentially optimize. 
The dataset can be pulled from here:


https://filebin.net/qf2km1ea9qgd5skp/seqs.fasta.gz?t=97kgpebg

Running "cksum" on this file:

1477520542 2199192 seqs.fasta.gz

Naturally, you'll need to gunzip this file. The decompressed file 
contains strings on every even-numbered line that have already 
been reduced to the unique de-duplicated subset, and they have 
already been sorted by descending length and alphabetical 
identity.


From my initial post, the focus is now entirely on step #4: 
finding/removing strings that can be entirely absorbed 
(substringed) by their largest possible parent.


And now for the code:


import std.stdio : writefln, File, stdin;
import std.conv : to;
import std.string : indexOf;

void main()
{
string[] seqs;

foreach( line; stdin.byLine() )
{
if( line[ 0 ] == '>' ) continue;
else seqs ~= to!string( line );
}

foreach( i; 0 .. seqs.length )
{
if( seqs[ i ].length == 0 ) continue;

foreach( j; i + 1 .. seqs.length )
{
if( seqs[ j ].length == 0 ||
seqs[ i ].length == seqs[ j ].length 
) continue;


if( indexOf( seqs[ i ], seqs[ j ] ) > -1 )
{
seqs[ j ] = "";

writefln( "%s contains %s", i, j 
);

}
}
}
}


Compile the source and then run the executable via redirecting 
stdin:


./substr < seqs.fasta

See any straightforward optimization paths here?

For curiosity, I experimented with use of string[] and ubyte[][] 
and several functions (indexOf, canFind, countUntil) to assess 
for the best potential performer. My off-the-cuff results:


string[] with indexOf() :: 26.5-27 minutes
string[] with canFind() :: >28 minutes
ubyte[][] with canFind() :: 27.5 minutes
ubyte[][] with countUntil() :: 27.5 minutes

Resultantly, the code above uses string[] with indexOf(). Tests 
were performed with an Intel(R) Xeon(R) Platinum 8259CL CPU @ 
2.50GHz.


I have additional questions/concerns/confusion surrounding the 
foreach() syntax I have had to apply above, but performance 
remains my chief immediate concern.


Re: 200-600x slower Dlang performance with nested foreach loop

2021-01-26 Thread methonash via Digitalmars-d-learn

On Tuesday, 26 January 2021 at 18:17:31 UTC, H. S. Teoh wrote:
Do not do this. Every time you call .array it allocates a new 
array and copies all its contents over. If this code runs 
frequently, it will cause a big performance hit, not to mention 
high GC load.


The function you're looking for is .release, not .array.


Many thanks for the tip! I look forward to trying this soon. For 
reference, the .array call is only performed once.


That nested loop is an O(n^2) algorithm. Meaning it will slow 
down *very* quickly as the size of the array n increases.  You 
might want to think about how to improve this algorithm.


Nice observation, and yes, this would typically be an O(n^2) 
approach.


However, due to subsetting the input dataset to unique strings 
and then sorting in descending length, one might notice that the 
inner foreach loop does not iterate over all of n, only on the 
iterator value i+1 through the end of the array.


Thus, I believe this would then become approximately O(n^2/2). 
More precisely, it should be O( ( n^2 + n ) / 2 ).


Further: the original dataset has 64k strings. Squaring that 
yields 4.1 billion string comparisons.


Once uniquely de-duplicated, the dataset is reduced to ~46k 
strings. Considering roughly O(n^2/2) at this level, this yields 
1.06 billion string comparisons.


So, performing steps 1 through 3 improves the brute-force string 
comparison problem four-fold in my test development dataset.


Using AA's may not necessarily improve performance.  It depends 
on what your code does with it.  Because AA's require random 
access to memory, it's not friendly to the CPU cache hierarchy, 
whereas traversing linear arrays is more cache-friendly and in 
some cases will out-perform AA's.


I figured a built-in AA might be an efficient path to performing 
unique string de-duplication. If there's a more performant method 
available, I'll certainly try it.


First of all, you need to use a profiler to identify where the 
hotspots are.  Otherwise you could well be pouring tons of 
effort into "optimizing" code that doesn't actually need 
optimizing, while completely missing the real source of the 
problem.  Whenever you run into performance problems, do not 
assume you know where the problem is, profile, profile, profile!


Message received. Given that D is the first compiled language 
I've semi-seriously dabbled with, I have no real experience with 
profiler usage.


Second, you only posted a small fragment of your code, so it's 
hard to say where the problem really is.  I can only guess 
based on what you described.  If you could post the entire 
program, or at least a complete, compilable and runnable 
excerpt thereof that displays the same (or similar) performance 
problems, then we could better help you pinpoint where the 
problem is.


Yes, I'll be looking to present a complete, compilable, and 
executable demo of code for this issue if/when subsequent efforts 
continue to fail.


For professional reasons (because I no longer work in academia), 
I cannot share the original source code for the issue presented 
here, but I can attempt to reproduce it in a minimally complete 
form for a public dataset.


Re: 200-600x slower Dlang performance with nested foreach loop

2021-01-26 Thread methonash via Digitalmars-d-learn

On Tuesday, 26 January 2021 at 17:56:22 UTC, Paul Backus wrote:


It would be much easier for us to help you with this if you 
could post the full program, or at the very least a reduced 
version that reproduces the same issue. [1] Since your attempts 
so far have failed to fix the problem, it is quite likely that 
some part of the code you do not suspect is actually to blame.


I cannot post the full source code.

Regarding a reduced version reproducing the issue: well, that's 
exactly what the nested foreach loop does. Without it, the 
program reaches that point quickly.


With the nested foreach block, it slows to a crawl. More 
specifically, commenting-out the indexOf() or countUntil() 
sub-blocks preserves fast performance, but I'm not sure if that 
may be related to compiler optimizations realizing that there's 
nothing but "dead/nonexistent code" inside the loops and 
generating a binary that just never goes there.


If this may help: I've composed the second Dlang implementation 
as one extended block of code within main() and am thinking of 
soon refactoring the code into different functions. I remain 
pessimistic of whether this may help.


Is there any possibility this could be GC-related?


200-600x slower Dlang performance with nested foreach loop

2021-01-26 Thread methonash via Digitalmars-d-learn

Greetings Dlang wizards,

I seek knowledge/understanding of a very frustrating phenomenon 
I've experienced over the past several days.


The problem space:

1) Read a list of strings from a file
2) De-duplicate all strings into the subset of unique strings
3) Sort the subset of unique strings by descending length and 
then by ascending lexicographic identity
4) Iterate through the sorted subset of unique strings, 
identifying smaller sequences with perfect identity to their 
largest possible parent string


I have written a Dlang program that performantly progresses 
through step #3 above. I used a built-in AA (associative array) 
to uniquely de-duplicate the initial set of strings and then used 
multiSort(). Performance was good up till this point, especially 
with use of the LDC compiler.


Things went sideways at step #4: because multiSort() returns a 
SortedRange, I used .array to convert the returned SortedRange 
into an array of type string[]. This appeared to work, and 
neither DMD nor LDC threw any warnings/errors for doing this.


With the formally returned array, I then attempted to construct a 
double foreach loop to iterate through the sorted array of unique 
strings and find substring matches.


foreach( i, ref pStr; sortedArr )
{
foreach( j, ref cStr; sortedArr[ i + 1 .. $ ] )
{
if( indexOf( pStr, cStr ) > -1 )
{
// ...
}
}
}

Before adding the code excerpt above, the Dlang program was 
taking ~1 second on an input file containing approx. 64,000 
strings.


By adding the code above, the program now takes 6 minutes to 
complete. An attempt was made to more efficiently perform 
ASCII-only substring searching by converting the sorted string[] 
into ubyte[][] and then using countUntil() instead of indexOf(), 
but this had an effect that was completely opposite to what I had 
previously experienced: the program then took over 20 minutes to 
complete!


Thus, I am entirely baffled.

My first attempt to solve this problem space used a small Perl 
program to perform steps 1 through 3, which would then pipe 
intermediate output to a small Dlang program handling only step 
#4 using dynamic arrays (no use of AAs) of ubyte[][] with use of 
countUntil().


The Dlang code for the nested foreach block above is essentially 
near-identical between my two Dlang implementations. Yet, the 
second implementation--where I'm trying to solve the entire 
problem space in D--has absolutely failed in terms of performance.


Perl+D, ubyte[][], countUntil() :: under 2 seconds
only D, string[], indexOf() :: ~6 minutes
only D, ubyte[][], countUntil() :: >20 minutes

Please advise. This nightmarish experience is shaking my 
confidence in using D.


Re: Reading from stdin significantly slower than reading file directly?

2020-08-13 Thread methonash via Digitalmars-d-learn

Thank you all very much for your detailed feedback!

I wound up pulling the "TREE_GRM_ESTN.csv" file referred to by 
Jon and used it in subsequent tests. Created D-programs for 
reading directly through a File() structure, versus reading 
byLine() from the stdin alias.


After copying the large CSV file to /dev/shm/ (e.g. a ramdisk), I 
re-ran the two programs repeatedly, and I was able to approach 
the 20-30% overhead margin I would expect to see for using a 
shell pipe and its buffer; my results now similarly match Jon's 
above.


Lesson learned: be wary of networked I/O systems (e.g. Isilon 
storage arrays); all kinds of weirdness can happen there ...


Reading from stdin significantly slower than reading file directly?

2020-08-12 Thread methonash via Digitalmars-d-learn

Hi,

Relative beginner to D-lang here, and I'm very confused by the 
apparent performance disparity I've noticed between programs that 
do the following:


1) cat some-large-file | D-program-reading-stdin-byLine()

2) D-program-directly-reading-file-byLine() using File() struct

The D-lang difference I've noticed from options (1) and (2) is 
somewhere in the range of 80% wall time taken (7.5s vs 4.1s), 
which seems pretty extreme.


For comparison, I attempted the same using Perl with the same 
large file, and I only noticed a 25% difference (10s vs 8s) in 
performance, which I imagine to be partially attributable to the 
overhead incurred by using a pipe and its buffer.


So, is this difference in D-lang performance typical? Is this 
expected behavior?


Was wondering if this may have anything to do with the library 
definition for std.stdio.stdin 
(https://dlang.org/library/std/stdio/stdin.html)? Does global 
file-locking significantly affect read-performance?


For reference: I'm trying to build a single-threaded application; 
my present use-case cannot benefit from parallelism, because its 
ultimate purpose is to serve as a single-threaded downstream 
filter from an upstream application consuming (n-1) system 
threads.