Re: topN using a heap

2016-09-25 Thread Andrei Alexandrescu via Digitalmars-d

On 9/25/16 4:19 AM, Jon Degenhardt wrote:

On Saturday, 24 September 2016 at 18:22:51 UTC, Andrei Alexandrescu wrote:

Got curious so I tried to patch my ldc installation (0.17.1 (DMD
v2.068.2, LLVM 3.8.0)), but sadly that didn't work out because
sorting.d uses the new syntax in various places.

Probably the best baseline is the equivalent C++ code using the mature
implementation of nth_element, see
http://paste.ofcode.org/ieZPdchjzTXbDcG3LvsYBP (also pasted at the end
of this message). Here's what I got:
http://paste.ofcode.org/6feyBLRiJtieHdZ3bmGaUW, see also text below.

The topN code produced with dmd is faster in virtually all cases (has
parity for -f 4, which I suspect is a degenerate case with most
elements equal, which exercises only a small part of the algorithm).
For C++ I used -O3 -DNDEBUG, any other flags I should use?

[snip]


I added the topN pull request to a local LDC build (current LDC master).
Also built the C++ nth_element and sort program you listed. (GCC 4.9.3,
compiler line: g++-mp-4.9 -O3 -static-libgcc -static-libstdc++ -std=c++11).

Did 7 runs for each variant on the three fields in ngram file. Median
times are below.

Median sort times (ms):

 Field 2   Field 3   Field 4
DMD  351   348   326
LDC  273   261   245
C++  281   260   246

Median topN / nth_element times (ms):

 Field 2   Field 3   Field 4
LDC   214457
C++   416071

Looking very good indeed!


Thanks for this work! -- Andrei


Re: topN using a heap

2016-09-25 Thread Jon Degenhardt via Digitalmars-d
On Saturday, 24 September 2016 at 18:22:51 UTC, Andrei 
Alexandrescu wrote:
Got curious so I tried to patch my ldc installation (0.17.1 
(DMD v2.068.2, LLVM 3.8.0)), but sadly that didn't work out 
because sorting.d uses the new syntax in various places.


Probably the best baseline is the equivalent C++ code using the 
mature implementation of nth_element, see 
http://paste.ofcode.org/ieZPdchjzTXbDcG3LvsYBP (also pasted at 
the end of this message). Here's what I got: 
http://paste.ofcode.org/6feyBLRiJtieHdZ3bmGaUW, see also text 
below.


The topN code produced with dmd is faster in virtually all 
cases (has parity for -f 4, which I suspect is a degenerate 
case with most elements equal, which exercises only a small 
part of the algorithm). For C++ I used -O3 -DNDEBUG, any other 
flags I should use?


[snip]

I added the topN pull request to a local LDC build (current LDC 
master). Also built the C++ nth_element and sort program you 
listed. (GCC 4.9.3, compiler line: g++-mp-4.9 -O3 -static-libgcc 
-static-libstdc++ -std=c++11).


Did 7 runs for each variant on the three fields in ngram file. 
Median times are below.


Median sort times (ms):

 Field 2   Field 3   Field 4
DMD  351   348   326
LDC  273   261   245
C++  281   260   246

Median topN / nth_element times (ms):

 Field 2   Field 3   Field 4
LDC   214457
C++   416071

Looking very good indeed!


Re: topN using a heap

2016-09-24 Thread Andrei Alexandrescu via Digitalmars-d

On 09/23/2016 04:45 PM, Jon Degenhardt wrote:

That's a very nice result. Both the absolute numbers and that all three
sets achieve similar performance, as they different distribution
characteristics.


Got curious so I tried to patch my ldc installation (0.17.1 (DMD 
v2.068.2, LLVM 3.8.0)), but sadly that didn't work out because sorting.d 
uses the new syntax in various places.


Probably the best baseline is the equivalent C++ code using the mature 
implementation of nth_element, see 
http://paste.ofcode.org/ieZPdchjzTXbDcG3LvsYBP (also pasted at the end 
of this message). Here's what I got: 
http://paste.ofcode.org/6feyBLRiJtieHdZ3bmGaUW, see also text below.


The topN code produced with dmd is faster in virtually all cases (has 
parity for -f 4, which I suspect is a degenerate case with most elements 
equal, which exercises only a small part of the algorithm). For C++ I 
used -O3 -DNDEBUG, any other flags I should use?


As an aside, I enjoy poking fun at the stunning inefficiency of 
iostreams in my public talks; they delivered again :o).



Andrei

=== shell
$ g++ -O3 -DNDEBUG -omedian_by_sort_cpp issue16517.cpp
$ g++ -O3 -DNDEBUG -DTOPN -omedian_by_topn_cpp issue16517.cpp
$ dmd -O -inline -release -ofmedian_by_sort -boundscheck=off 
issue16517.d
$ dmd -O -inline -release -version=topn -ofmedian_by_topn 
-boundscheck=off issue16517.d
$ cut -f 2  /tmp/googlebooks-eng-all-1gram-20120701-0 | 
./median_by_sort_cpp
median (via sort): 1972; lines: 10512769; total time (ms): 3419.11; sort 
time (ms): 314.086

$ cut -f 2  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort
median (via sort): 1972; lines: 10512769; total time (ms): 1462; sort 
time (ms): 399

$ cut -f 2  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn_cpp
median (via nth_element): 1972; lines: 10512769; total time (ms): 
3255.41; nth_element time (ms): 40.676

$ cut -f 2  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn
median (via topN): 1972; lines: 10512769; total time (ms): 1211; topN 
time (ms): 32

$ cut -f 3  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort_cpp
median (via sort): 3; lines: 10512769; total time (ms): 2949.11; sort 
time (ms): 297.814

$ cut -f 3  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort
median (via sort): 3; lines: 10512769; total time (ms): 1351; sort time 
(ms): 382

$ cut -f 3  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn_cpp
median (via nth_element): 3; lines: 10512769; total time (ms): 2316.75; 
nth_element time (ms): 65.041

$ cut -f 3  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn
median (via topN): 3; lines: 10512769; total time (ms): 973; topN time 
(ms): 59

$ cut -f 4  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort_cpp
median (via sort): 2; lines: 10512769; total time (ms): 2614.47; sort 
time (ms): 351.285

$ cut -f 4  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort
median (via sort): 2; lines: 10512769; total time (ms): 1269; sort time 
(ms): 361

$ cut -f 4  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn_cpp
median (via nth_element): 2; lines: 10512769; total time (ms): 2443.17; 
nth_element time (ms): 76.211

$ cut -f 4  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn
median (via topN): 2; lines: 10512769; total time (ms): 998; topN time 
(ms): 77

===

=== issue16517.cpp
#include 
#include 
#include 
#include 
#include 

using namespace std;

int main()
{
clock_t swTotal, swSort;
swTotal = clock();
vector values;
double num;
while (cin >> num) {
values.push_back(num);
}

size_t medianIndex = values.size() / 2;
swSort = clock();
#ifdef TOPN
const char* method = "nth_element";
nth_element(values.begin(), values.begin() + medianIndex, 
values.end());

#else
const char* method = "sort";
sort(values.begin(), values.end());
#endif
clock_t done = clock();

printf("median (via %s): %g; lines: %zu; total time (ms): %g; %s 
time (ms): %g\n",

 method, values[medianIndex], values.size(),
 (done - swTotal) * 1000. / CLOCKS_PER_SEC,
 method, (done - swSort) * 1000. / CLOCKS_PER_SEC);
}
===

=== issue16517.d
import std.stdio;
import std.array : appender;
import std.algorithm : sort, topN;
import std.conv;
import std.range;
import std.datetime: StopWatch;

void main()
{
StopWatch swTotal, swSort;
swTotal.start;
Appender!(double[]) values;
foreach (line; stdin.byLine)
values.put(line.to!double);

size_t medianIndex = values.data.length/2;
swSort.start;
version(topn)
{
topN(values.data, medianIndex);
auto method = "topN";
}
else {
sort(values.data);
auto method = "sort";
}
swTotal.stop;
swSort.stop;

writefln("median (via %s): %g; 

Re: topN using a heap

2016-09-23 Thread Jon Degenhardt via Digitalmars-d
On Friday, 23 September 2016 at 16:09:18 UTC, Andrei Alexandrescu 
wrote:


BTW, as I commented in 
https://issues.dlang.org/show_bug.cgi?id=16517, using the new 
topN implementation instead of sort to compute the median over 
google's 1-grams is over 11x faster using dmd.


That's a very nice result. Both the absolute numbers and that all 
three sets achieve similar performance, as they different 
distribution characteristics.


--Jon


Re: topN using a heap

2016-09-23 Thread Andrei Alexandrescu via Digitalmars-d

On 09/23/2016 11:40 AM, Stefan Koch wrote:

On Friday, 23 September 2016 at 15:30:20 UTC, Andrei Alexandrescu wrote:

Work is blocked by https://issues.dlang.org/show_bug.cgi?id=16528,
which is quite a head-scratcher. Any ideas for a workaround? Thanks!
-- Andrei


annotate the templates.


Nagonna work for generic functions -- Andrei


Re: topN using a heap

2016-09-23 Thread Andrei Alexandrescu via Digitalmars-d

On 09/23/2016 11:40 AM, Stefan Koch wrote:

On Friday, 23 September 2016 at 15:30:20 UTC, Andrei Alexandrescu wrote:

Work is blocked by https://issues.dlang.org/show_bug.cgi?id=16528,
which is quite a head-scratcher. Any ideas for a workaround? Thanks!
-- Andrei


annotate the templates.


I don't want to lose the deduction. I used another workaround (enumerate 
the potential unsafe operations under an if (false) and then forward to 
a @trusted function) in https://github.com/dlang/phobos/pull/4815.


Reviews are welcome!

BTW, as I commented in https://issues.dlang.org/show_bug.cgi?id=16517, 
using the new topN implementation instead of sort to compute the median 
over google's 1-grams is over 11x faster using dmd.



Andrei



Re: topN using a heap

2016-09-23 Thread Stefan Koch via Digitalmars-d
On Friday, 23 September 2016 at 15:30:20 UTC, Andrei Alexandrescu 
wrote:
Work is blocked by 
https://issues.dlang.org/show_bug.cgi?id=16528, which is quite 
a head-scratcher. Any ideas for a workaround? Thanks! -- Andrei


annotate the templates.


Re: topN using a heap

2016-09-23 Thread Andrei Alexandrescu via Digitalmars-d
Work is blocked by https://issues.dlang.org/show_bug.cgi?id=16528, which 
is quite a head-scratcher. Any ideas for a workaround? Thanks! -- Andrei


Re: topN using a heap

2016-09-21 Thread Andrei Alexandrescu via Digitalmars-d

On 09/21/2016 01:37 PM, Jon Degenhardt wrote:

On Wednesday, 21 September 2016 at 14:58:27 UTC, Andrei Alexandrescu wrote:

On 9/21/16 4:16 AM, Jon Degenhardt wrote:

Timing comparison of sort and topN, times in milliseconds:

  sort  topN
Field 2:   289  1756
Field 3:   285148793
Field 4:   273668906

The above times are for LDC 1.1.0-beta2 (DMD 2.071.1). Similar behavior
is seen for DMD 2.071.2. This makes topN pretty much unusable.


I have it on my list to move https://arxiv.org/abs/1606.00484 into
Phobos. Thanks for the data! -- Andrei


Very good, thanks. It'll be interesting to see how this algorithm does
on this data set.


Preliminary results indicate that QuickselectAdaptive is about 10x 
faster than sort, net of data reading overheads, on the second-field 
data set. I'll proceed with submitting the algorithm in Phobos. -- Andrei


Re: topN using a heap

2016-09-21 Thread Jon Degenhardt via Digitalmars-d
On Wednesday, 21 September 2016 at 14:58:27 UTC, Andrei 
Alexandrescu wrote:

On 9/21/16 4:16 AM, Jon Degenhardt wrote:

Timing comparison of sort and topN, times in milliseconds:

  sort  topN
Field 2:   289  1756
Field 3:   285148793
Field 4:   273668906

The above times are for LDC 1.1.0-beta2 (DMD 2.071.1). Similar 
behavior

is seen for DMD 2.071.2. This makes topN pretty much unusable.


I have it on my list to move https://arxiv.org/abs/1606.00484 
into Phobos. Thanks for the data! -- Andrei


Very good, thanks. It'll be interesting to see how this algorithm 
does on this data set.


--Jon


Re: topN using a heap

2016-09-21 Thread Andrei Alexandrescu via Digitalmars-d

On 9/21/16 4:16 AM, Jon Degenhardt wrote:

Timing comparison of sort and topN, times in milliseconds:

  sort  topN
Field 2:   289  1756
Field 3:   285148793
Field 4:   273668906

The above times are for LDC 1.1.0-beta2 (DMD 2.071.1). Similar behavior
is seen for DMD 2.071.2. This makes topN pretty much unusable.


I have it on my list to move https://arxiv.org/abs/1606.00484 into 
Phobos. Thanks for the data! -- Andrei




Re: topN using a heap

2016-09-21 Thread Jon Degenhardt via Digitalmars-d
On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu 
wrote:


So let me summarize what has happened:

1. topN was reportedly slow. It was using a random pivot. I 
made it use getPivot (deterministic) instead of a random pivot 
in https://github.com/D-Programming-Language/phobos/pull/3921. 
getPivot is also what sort uses.


[snip]

Not completely clear from this thread what the conclusion was wrt 
getting known topN performance issues addressed. From pull 
requests it appears identified fixes are in current release 
versions of the DMD/LDC. However, I hit significant issues on one 
of the first large data sets I tried. Not an artificial data, but 
one with very skewed distributions of values (a google ngram 
file).


Details here:  https://issues.dlang.org/show_bug.cgi?id=16517. 
Includes test program, url for the ngram file.


A brief summary - Data file is a TSV file with 3 numeric fields, 
a bit over 10 million values each with different distribution 
properties. Used both topN and sort get the median value. Since 
this was median, it topN for the mid-point value, not at one end 
or the other. (This is a specific callout for some of the issues 
identified.)


Timing comparison of sort and topN, times in milliseconds:

  sort  topN
Field 2:   289  1756
Field 3:   285148793
Field 4:   273668906

The above times are for LDC 1.1.0-beta2 (DMD 2.071.1). Similar 
behavior is seen for DMD 2.071.2. This makes topN pretty much 
unusable.


Re: topN using a heap

2016-01-20 Thread Andrei Alexandrescu via Digitalmars-d

On 01/19/2016 08:08 PM, Ivan Kazmenko wrote:

On Tuesday, 19 January 2016 at 13:52:08 UTC, Andrei Alexandrescu wrote:

On 01/18/2016 09:21 PM, Ivan Kazmenko wrote:
Do you think sort and topN would be attackable if they used a
per-process-seeded RNG as per Xinok's idea? -- Andrei


Yes, but with a little interactivity (not generating the input in
advance) and more labor (finding the state of RNG).

[snip]

Thanks! -- Andrei



Re: topN using a heap

2016-01-19 Thread Timon Gehr via Digitalmars-d

On 01/19/2016 04:33 PM, Andrei Alexandrescu wrote:

On 01/19/2016 08:10 AM, Andrei Alexandrescu wrote:

So anyhow we need the "intro" part. I'll get to that soon.


Destroy! I made it part of
https://github.com/D-Programming-Language/phobos/pull/3934.

https://github.com/andralex/phobos/commit/4ba95cd1bd5124b53324c441e62c51d759481b04
...


The switching heuristic is bad. It always switches after at most 8 steps.

I'd just use the second heuristic discussed in 
https://en.wikipedia.org/wiki/Introselect .


Re: topN using a heap

2016-01-19 Thread Andrei Alexandrescu via Digitalmars-d

On 01/19/2016 01:27 PM, Timon Gehr wrote:

On 01/19/2016 04:33 PM, Andrei Alexandrescu wrote:

On 01/19/2016 08:10 AM, Andrei Alexandrescu wrote:

So anyhow we need the "intro" part. I'll get to that soon.


Destroy! I made it part of
https://github.com/D-Programming-Language/phobos/pull/3934.

https://github.com/andralex/phobos/commit/4ba95cd1bd5124b53324c441e62c51d759481b04

...


The switching heuristic is bad. It always switches after at most 8 steps.


My algebra is completely rotten. Thanks! Updated 
https://github.com/andralex/phobos/commit/cdd59b51a397ee1a4584e55c07d921c59acb5978.



Andrei


Re: topN using a heap

2016-01-19 Thread Ivan Kazmenko via Digitalmars-d
On Tuesday, 19 January 2016 at 13:52:08 UTC, Andrei Alexandrescu 
wrote:

On 01/18/2016 09:21 PM, Ivan Kazmenko wrote:
Do you think sort and topN would be attackable if they used a 
per-process-seeded RNG as per Xinok's idea? -- Andrei


Yes, but with a little interactivity (not generating the input in 
advance) and more labor (finding the state of RNG).


Suppose the adversary has the means to generate an input for 
sort/topN, inspect the output, and then generate a bad input.  
With the current (not-secure) RNGs, from the first two actions, 
they can learn the current state of RNG: with sort, the order of 
elements which compare as equal depends on the choice of pivot, 
and therefore exposes the random bits involved; with topN, the 
order of everything except what topN guarantees does the same 
thing.


After that, generating a bad input is definitely possible, since 
everything is known and deterministic at this point.  The 
previously discussed method can be used, or something more clever 
if they want to make it fast.


Ivan Kazmenko.



Re: topN using a heap

2016-01-19 Thread Andrei Alexandrescu via Digitalmars-d

On 01/18/2016 09:44 PM, Ivan Kazmenko wrote:

On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu wrote:

4. sort was and is attackable before all of these changes

(b) Improve sort() first, then apply a similar strategy to improving
topN. Do not use RNGs at all.


Since point 4 is in fact already fixed a good while ago, my suggestion
would be (b): to do the same for topN.

Introselect (https://en.wikipedia.org/wiki/Introselect), already
mentioned in this thread, uses median-of-medians to achieve worst case
O(n) performance if we recurse too deep.  There is already an issue
suggesting to implement it:
https://issues.dlang.org/show_bug.cgi?id=12141 (std.algorithm: implement
deterministic topN).

In fact, the O(n log n) heap approach as it is implemented now could be
faster than O(n) median-of-medians approach on reasonable inputs.  So,
someone will have to implement median-of-medians and, well, measure.

At the very least, googling for "median of medians in practice" and such
yields the tag wiki from StackOverflow.com: "The constant factor in the
O(n) is large, and the algorithm is not commonly used in practice.".


Yah, I also think heap/double-heap topN is better; median-of-medians 
(MoM) is theoretically nice but practically less so. Though there are 
two things that could help MoM in D: (a) median of five (core part of 
MoM) is relatively cheap to compute with six comparisons; (b) we can use 
stride() to compute median without needing to copy a fifth of the input 
or to complicate the algorithm - just recurse on the stride! All in all 
this might be a good algo to implement and test.


So anyhow we need the "intro" part. I'll get to that soon.


Andrei



Re: topN using a heap

2016-01-19 Thread Andrei Alexandrescu via Digitalmars-d

On 01/18/2016 09:42 PM, Xinok wrote:

To be clear, sort is NOT attackable. Introsort is used for unstable
sorting which begins with quick sort but falls back to heap sort after
too many recursions to maintain O(n log n) running time. Timsort is used
for stable sorting which is a variant of merge sort but still runs in
O(n log n) time. In either case, you're guaranteed to have O(n log n)
running time in the worst case.


Sorry, my bad. Could you please point to the code doing the 
introspection? I might do the same in topN.



On the other hand, someone can improve upon quick sort by choosing
better pivots but be careful not to add too much overhead. A couple
years ago, I tried choosing the pivot from a median of five but the
result was as much as 10% slower. One idea is to begin initially
choosing better pivots, but once the sublists fall below a certain
length, simply choose the pivot from a median of three to avoid the
extra overhead.


That's what https://github.com/D-Programming-Language/phobos/pull/3922 
and in my measurements it's about as fast or faster than the current. 
Could you please re-run your measurements against that PR?



Speaking of RNGs, they're technically pure as long as you always use the
same seed. So what if we generated a random seed at the start of the
process, but then reused that same seed over and over in pure functions
for the duration of the process?


That's a great idea. The way D is defined, it already implies that 
"immutable" is process-immutable, not immutable across runs. Consider 
http://dpaste.dzfl.pl/2fa5baf0c149.


Very nice insights, Xinok. Thanks!


Andrei




Re: topN using a heap

2016-01-19 Thread Andrei Alexandrescu via Digitalmars-d

On 01/18/2016 09:21 PM, Ivan Kazmenko wrote:

On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu wrote:

4. sort was and is attackable before all of these changes


No, sort utilizes Introsort (Quicksort but switch to Heapsort if recurse
too deep): see
https://github.com/D-Programming-Language/phobos/blob/2.067/std/algorithm/sorting.d#L1048-L1053.
This improvement happened a year or two ago, when algorithm.d was not
even separated into sub-modules.

My adversary program (change "topN (a, MAX_N / 2)" to "sort (a)") admits
that by not being able to slow the sort down.  But, if I comment out
line 1053 to disable the switching, things go quadratic as expected.

L1053:depth = depth >= depth.max / 2 ? (depth / 3) * 2 : (depth * 2)
/ 3;

The same remedy can help for topN: if we called partition more than log
(length) in base (3/2) times, switch to the heap implementation of topN
regardless of whether n is close to the edge.


Do you think sort and topN would be attackable if they used a 
per-process-seeded RNG as per Xinok's idea? -- Andrei




Re: topN using a heap

2016-01-19 Thread Andrei Alexandrescu via Digitalmars-d

On 01/19/2016 08:10 AM, Andrei Alexandrescu wrote:

So anyhow we need the "intro" part. I'll get to that soon.


Destroy! I made it part of 
https://github.com/D-Programming-Language/phobos/pull/3934.


https://github.com/andralex/phobos/commit/4ba95cd1bd5124b53324c441e62c51d759481b04

One interesting side topic would be how to avoid the use of goto without 
losing clarity.



Andrei



Re: topN using a heap

2016-01-18 Thread Ivan Kazmenko via Digitalmars-d

On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote:
On Sunday, 17 January 2016 at 22:20:30 UTC, Andrei Alexandrescu 
wrote:

All - let me know how things can be further improved. Thx!

Here goes the test which shows quadratic behavior for the new 
version:

http://dpaste.dzfl.pl/e4b3bc26c3cf
(dpaste kills the slow code before it completes the task)

The inspiration is the paper "A Killer Adversary for Quicksort":
http://www.cs.dartmouth.edu/~doug/mdmspe.pdf
(I already mentioned it on the forums a while ago)

Ivan Kazmenko.


Perhaps I should include a textual summary as well.

The code on DPaste starts by constructing an array of Elements of 
size MAX_N; in the code, MAX_N is 50_000.  After that, we run the 
victim function on our array.  Here, the victim is topN (array, 
MAX_N / 2); it could be sort (array) or something else.


An Element contains, or rather, pretends to contain, an int 
value.  Here is how Element is special: the result of comparison 
for two Elements is decided on-the-fly.  An Element can be either 
UNDECIDED or have a fixed value.  Initially, all elements are 
UNDECIDED.  When we compare two Elements and at least one of them 
has a fixed value, the comparison is resolved naturally, and 
UNDECIDED element is greater than any fixed element.  When we 
compare two UNDECIDED elements, the one which participated more 
in the comparisons so far gets a fixed value: greater than any 
other value fixed so far, but still less than UNDECIDED.  This 
way, the results of old comparisons are consistent with the new 
fixed value.


Now, what do we achieve by running the victim function?  Turns 
out that the algorithms using the idea of QuickSort or 
QuickSelect tend to make most comparisons against their current 
pivot value.  Our Element responds to that by fixing the pivot to 
one of the lowest available values.  After that, a partition 
using such pivot will have only few, O(1), elements before the 
pivot, and the rest after the pivot.  In total, this will lead to 
quadratic performance.


After running the victim function on our array of Elements (which 
- careful here - already takes quadratic time), we reorder them 
in their original order (to do that, each Element also stores its 
original index).


Now, we can re-run the algorithm on the array obtained so far.  
If the victim function is (strongly) pure, it will inevitably 
make the same comparisons in the same order.  The only difference 
is that their result will already be decided.


Alternatively, we can make an int array of the current values in 
our array of Elements (also in their original order).  Running 
the victim function on the int array must also make the same 
(quadratic number of) comparisons in the same order.


Ivan Kazmenko.



Re: topN using a heap

2016-01-18 Thread Ivan Kazmenko via Digitalmars-d
On Sunday, 17 January 2016 at 22:20:30 UTC, Andrei Alexandrescu 
wrote:

On 01/17/2016 03:32 PM, Ivan Kazmenko wrote:

Here is a more verbose version.


OK, very nice. Thanks! I've modified topN to work as follows. 
In a loop:


* If nth <= r.length / log2(r.length)^^2 (or is similarly close 
to r.length), use topNHeap with one heap and stop
* If nth <= r.length / log2(r.length) (or is similarly close to 
r.length), use topNHeap with two heaps and stop

* Take a quickselect step, adjust r and nth, and continue

That seems to work nicely for all values involved.

All - let me know how things can be further improved. Thx!


Andrei


Here goes the test which shows quadratic behavior for the new 
version:

http://dpaste.dzfl.pl/e4b3bc26c3cf
(dpaste kills the slow code before it completes the task)

My local timings with "-O -release -noboundscheck -inline":
building Element array: 4989 msecs
checking Element array: 5018 msecs
checking int array: 626 msecs

With "-debug -unittest":
building Element array: 8384 msecs
checking Element array: 8380 msecs
checking int array: 2987 msecs

If we change the length MAX_N to something observable, say, 50, 
the array is:
[0, 536870911, 2, 536870911, 4, 536870911, 6, 36, 8, 33, 10, 35, 
12, 536870911, 14, 32, 16, 34, 18, 536870911, 536870911, 22, 23, 
21, 20, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 31, 30, 536870911, 29, 
536870911, 28, 536870911, 27, 536870911, 26, 536870911, 25, 
536870911, 24, 536870911]


The old version (2.070.0-b2) could not be tricked with it, does 
it use random?


The inspiration is the paper "A Killer Adversary for Quicksort":
http://www.cs.dartmouth.edu/~doug/mdmspe.pdf
(I already mentioned it on the forums a while ago)

Ivan Kazmenko.



Re: topN using a heap

2016-01-18 Thread Andrei Alexandrescu via Digitalmars-d

On 1/18/16 6:55 PM, deadalnix wrote:

On Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu wrote:

unpredictableSeed uses the system clock as a source of randomness, so
we're good there. -- Andrei


I got problem with that even when crytographically secure randomness
wasn't needed more than once. A specific case included adding jitter on
some timeout to avoid all hosts to expire at the same time, and used
mersenne twister with a pseudo random seed. There still were way too
many hosts ending up with the same timeout.

System time is not enough.


How would this translate to a matter of selecting the pivot during sort? 
-- Andrei


Re: topN using a heap

2016-01-18 Thread Andrei Alexandrescu via Digitalmars-d

On 1/18/16 6:18 PM, Ilya wrote:

On Monday, 18 January 2016 at 20:45:56 UTC, Ivan Kazmenko wrote:

On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote:

Here goes the test which shows quadratic behavior for the new version:
http://dpaste.dzfl.pl/e4b3bc26c3cf
(dpaste kills the slow code before it completes the task)


Correction: this is the result of removing a uniform call in pull
3921.  Since we are supposed to discuss the improvements related to
pull 3934 here, I created a separate entry for the issue:
https://issues.dlang.org/show_bug.cgi?id=15583.


A RNGs don't improve worst case. It only changes an permutation for
worst case. --Ilya


Well it does improve things. The probability of hitting the worst case 
repeatedly is practically zero, and it's impossible to create an attack 
input.


I'm not sure whether we should worry about this. Probably we do because 
sort attacks are a press favorite.


The nice thing about not relying on randomness is that pure functions 
can call sort, topN etc.


As a sort of a compromise I was thinking of seeding the RNG with not 
only the length of the range, but also the integral representation of 
the address of the first element. This would still allow an attack if 
the input is always at the same address.



Thoughts?

Andrei


Re: topN using a heap

2016-01-18 Thread Ilya via Digitalmars-d

On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko wrote:

On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:
A RNGs don't improve worst case. It only changes an 
permutation for worst case. --Ilya


Still, use of RNG makes it impossible to construct the worst 
case beforehand, once and for all.  In that sense, this is a 
regression.


No, it is definitely possible, because RNGs are Pseudo-RNGs. 
--Ilya


Re: topN using a heap

2016-01-18 Thread Andrei Alexandrescu via Digitalmars-d

On 1/18/16 6:27 PM, Ivan Kazmenko wrote:

On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:

A RNGs don't improve worst case. It only changes an permutation for
worst case. --Ilya


Still, use of RNG makes it impossible to construct the worst case
beforehand, once and for all.  In that sense, this is a regression.


BTW I forgot to thank you for this analysis. This is good stuff. -- Andrei



Re: topN using a heap

2016-01-18 Thread Ivan Kazmenko via Digitalmars-d
On Monday, 18 January 2016 at 23:39:02 UTC, Andrei Alexandrescu 
wrote:

On 1/18/16 6:18 PM, Ilya wrote:
A RNGs don't improve worst case. It only changes an 
permutation for worst case. --Ilya


Well it does improve things. The probability of hitting the 
worst case repeatedly is practically zero, and it's impossible 
to create an attack input.


Possible, but only if the seed and previous usage of victim 
program's global RNG are also known to the attacker... at which 
point the victim may well have a bigger problem than just 
quadratic topN.


I'm not sure whether we should worry about this. Probably we do 
because sort attacks are a press favorite.


The nice thing about not relying on randomness is that pure 
functions can call sort, topN etc.


Is it feasible to have two overloads of sort and/or topN, one 
pure and the other with better complexity guarantees?


As a sort of a compromise I was thinking of seeding the RNG 
with not only the length of the range, but also the integral 
representation of the address of the first element. This would 
still allow an attack if the input is always at the same 
address.


Hmm.  Isn't using any pointers still breaking strong purity?  Say 
the GC moved our array, and it now has a different address.  A 
strongly pure function would order the elements exactly the same 
then.


The same goes about sorting, where the thing depending on pivot 
choice is the relative order of "equal" elements.


From a theoretical standpoint (not taking current D purity rules 
into account), I'd say using a pointer (which may be modified by 
GC) is as pure as just allowing a static RNG (a global one, or 
even another instance dedicated specifically to sort/topN).


Ivan Kazmenko.



Re: topN using a heap

2016-01-18 Thread Timon Gehr via Digitalmars-d

On 01/19/2016 12:55 AM, Ilya wrote:

On Monday, 18 January 2016 at 23:53:53 UTC, Timon Gehr wrote:

On 01/19/2016 12:50 AM, Ilya wrote:

...

1. Yes, probability of hitting the worst case repeatedly is is
practically zero. But RNGs do not change this probability.
2. It is possible to build attack for our RNGs, because they are
Pseudo-RNGs.
--Ilya


You also need to predict the seed. How do you do that?


We can not use unpredictable seed (like system clock) in pure functions.
--Ilya


Clearly. The point is, there already was an impure implementation of 
topN whose expected linear running time is still specified in the 
documentation. The current implementation does not fulfill this bound.


Re: topN using a heap

2016-01-18 Thread Andrei Alexandrescu via Digitalmars-d

On 1/18/16 6:51 PM, Ilya wrote:

On Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu wrote:

On 1/18/16 6:44 PM, Ilya wrote:

On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko wrote:

On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:

A RNGs don't improve worst case. It only changes an permutation for
worst case. --Ilya


Still, use of RNG makes it impossible to construct the worst case
beforehand, once and for all.  In that sense, this is a regression.


No, it is definitely possible, because RNGs are Pseudo-RNGs. --Ilya


unpredictableSeed uses the system clock as a source of randomness, so
we're good there. -- Andrei


Would this work for pure functions? --Ilya


Of course not. I think this back-and-forth takes away from the gist of 
things. So let me summarize what has happened:


1. topN was reportedly slow. It was using a random pivot. I made it use 
getPivot (deterministic) instead of a random pivot in 
https://github.com/D-Programming-Language/phobos/pull/3921. getPivot is 
also what sort uses.


2. Now that both functions use getPivot, I set out to improve it in 
https://github.com/D-Programming-Language/phobos/pull/3922. The problem 
is I couldn't make getPivot impure; pure functions already call sort, so 
making it impure would have been a regression.


3. So I made getPivot use just a bit of randomness taken from the length 
of the range.


4. sort was and is attackable before all of these changes

5. So now we have pure topN and sort (subject to the range offering pure 
primitives) but they are both attackable.


6. PRNGs don't have any other downside than making functions impure.

The way I see it we have these solutions:

(a) make topN still use a random pivot. That way there's no regression. 
Then proceed and make sort() avoid quadratic cases in other ways, e.g. 
switch to heapsort if performance degrades.


(b) Improve sort() first, then apply a similar strategy to improving 
topN. Do not use RNGs at all.


(c) Seek randomness in other places, e.g. address of elements, data 
hashes etc. Come with a solution that may still be attacked in narrow 
cases but make that unlikely enough to be a real concern.



Andrei



Re: topN using a heap

2016-01-18 Thread Ilya via Digitalmars-d

On Tuesday, 19 January 2016 at 00:02:08 UTC, Timon Gehr wrote:

On 01/19/2016 12:55 AM, Ilya wrote:

On Monday, 18 January 2016 at 23:53:53 UTC, Timon Gehr wrote:

On 01/19/2016 12:50 AM, Ilya wrote:

...

1. Yes, probability of hitting the worst case repeatedly is 
is

practically zero. But RNGs do not change this probability.
2. It is possible to build attack for our RNGs, because they 
are

Pseudo-RNGs.
--Ilya


You also need to predict the seed. How do you do that?


We can not use unpredictable seed (like system clock) in pure 
functions.

--Ilya


Clearly. The point is, there already was an impure 
implementation of topN whose expected linear running time is 
still specified in the documentation. The current 
implementation does not fulfill this bound.


There is already implementation with predictable seed. Proof: 
https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151 --Ilya


Re: topN using a heap

2016-01-18 Thread Ilya via Digitalmars-d
On Tuesday, 19 January 2016 at 01:04:03 UTC, Andrei Alexandrescu 
wrote:

On 1/18/16 7:46 PM, Ilya wrote:

On Tuesday, 19 January 2016 at 00:38:14 UTC, Timon Gehr wrote:

On 01/19/2016 01:12 AM, Ilya wrote:




There is already implementation with predictable seed. Proof:
https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151

--Ilya


The default RNG is seeded with unpredictableSeed. What is the 
point

you are trying to make?


unpredictableSeed is initialized only once and can be easily 
estimated.

--Ilya


https://github.com/D-Programming-Language/phobos/blob/v2.069.2/std/random.d#L1120

unpredictableSeed uses MinstdRand0. The sources of randomness 
used for initializing MinstdRand0 are the PID, the thread ID, 
and the system time at the moment of the seeding. Then at each 
subsequent call to unpredictableSeed, the time of the call is 
XORed with the current value of the MinstdRand0.


How do you think things could be improved?


Andrei


A good variant with minimal overhead is to call unpredictableSeed 
for sorting big arrays each time (one seed per array):
   a. A hacker would not able to estimate a seed using other API 
calls. For example, "give me a set of random numbers".
   b. A hacker would not be able to estimate a seed using a small 
arrays because they don't use RNGs. (and they have not any 
overhead!)
   c. A hacker would not be able to estimate a seed for big 
arrays, because attack based on time measurement would not work 
for big arrays.


Ilya


Re: topN using a heap

2016-01-18 Thread Ilya via Digitalmars-d
On Tuesday, 19 January 2016 at 01:04:03 UTC, Andrei Alexandrescu 
wrote:

On 1/18/16 7:46 PM, Ilya wrote:

On Tuesday, 19 January 2016 at 00:38:14 UTC, Timon Gehr wrote:

On 01/19/2016 01:12 AM, Ilya wrote:




There is already implementation with predictable seed. Proof:
https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151

--Ilya


The default RNG is seeded with unpredictableSeed. What is the 
point

you are trying to make?


unpredictableSeed is initialized only once and can be easily 
estimated.

--Ilya


https://github.com/D-Programming-Language/phobos/blob/v2.069.2/std/random.d#L1120

unpredictableSeed uses MinstdRand0. The sources of randomness 
used for initializing MinstdRand0 are the PID, the thread ID, 
and the system time at the moment of the seeding. Then at each 
subsequent call to unpredictableSeed, the time of the call is 
XORed with the current value of the MinstdRand0.


How do you think things could be improved?


Andrei


A good variant with minimal overhead is to call unpredictableSeed 
for sorting big arrays each time (one seed per array):
   a. A hacker would not able to estimate a seed using other API 
calls. For example, "give me a set of random numbers".
   b. A hacker would not be able to estimate a seed using a small 
arrays because they don't use RNGs. (and they have not any 
overhead!)
   c. A hacker would not be able to estimate a seed for big 
arrays, because attack based on time measurement would not work 
for big arrays.


EDIT:
  c. ... because each big array has own seed and time measurement 
would not work for big arrays with different seeds.


Ilya


Re: topN using a heap

2016-01-18 Thread Ilya via Digitalmars-d

On Monday, 18 January 2016 at 20:45:56 UTC, Ivan Kazmenko wrote:

On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote:
Here goes the test which shows quadratic behavior for the new 
version:

http://dpaste.dzfl.pl/e4b3bc26c3cf
(dpaste kills the slow code before it completes the task)


Correction: this is the result of removing a uniform call in 
pull 3921.  Since we are supposed to discuss the improvements 
related to pull 3934 here, I created a separate entry for the 
issue: https://issues.dlang.org/show_bug.cgi?id=15583.


A RNGs don't improve worst case. It only changes an permutation 
for worst case. --Ilya


Re: topN using a heap

2016-01-18 Thread Ivan Kazmenko via Digitalmars-d

On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:
A RNGs don't improve worst case. It only changes an permutation 
for worst case. --Ilya


Still, use of RNG makes it impossible to construct the worst case 
beforehand, once and for all.  In that sense, this is a 
regression.




Re: topN using a heap

2016-01-18 Thread Timon Gehr via Digitalmars-d

On 01/19/2016 12:50 AM, Ilya wrote:

...

1. Yes, probability of hitting the worst case repeatedly is is
practically zero. But RNGs do not change this probability.
2. It is possible to build attack for our RNGs, because they are
Pseudo-RNGs.
--Ilya


You also need to predict the seed. How do you do that?


Re: topN using a heap

2016-01-18 Thread Ilya via Digitalmars-d
On Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu 
wrote:

On 1/18/16 6:44 PM, Ilya wrote:
On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko 
wrote:

On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:
A RNGs don't improve worst case. It only changes an 
permutation for

worst case. --Ilya


Still, use of RNG makes it impossible to construct the worst 
case
beforehand, once and for all.  In that sense, this is a 
regression.


No, it is definitely possible, because RNGs are Pseudo-RNGs. 
--Ilya


unpredictableSeed uses the system clock as a source of 
randomness, so we're good there. -- Andrei


Would this work for pure functions? --Ilya


Re: topN using a heap

2016-01-18 Thread Ilya via Digitalmars-d
On Monday, 18 January 2016 at 23:39:02 UTC, Andrei Alexandrescu 
wrote:

On 1/18/16 6:18 PM, Ilya wrote:
On Monday, 18 January 2016 at 20:45:56 UTC, Ivan Kazmenko 
wrote:
On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko 
wrote:
Here goes the test which shows quadratic behavior for the 
new version:

http://dpaste.dzfl.pl/e4b3bc26c3cf
(dpaste kills the slow code before it completes the task)


Correction: this is the result of removing a uniform call in 
pull
3921.  Since we are supposed to discuss the improvements 
related to

pull 3934 here, I created a separate entry for the issue:
https://issues.dlang.org/show_bug.cgi?id=15583.


A RNGs don't improve worst case. It only changes an 
permutation for

worst case. --Ilya


Well it does improve things. The probability of hitting the 
worst case repeatedly is practically zero, and it's impossible 
to create an attack input.


I'm not sure whether we should worry about this. Probably we do 
because sort attacks are a press favorite.


The nice thing about not relying on randomness is that pure 
functions can call sort, topN etc.


As a sort of a compromise I was thinking of seeding the RNG 
with not only the length of the range, but also the integral 
representation of the address of the first element. This would 
still allow an attack if the input is always at the same 
address.



Thoughts?

Andrei


1. Yes, probability of hitting the worst case repeatedly is is 
practically zero. But RNGs do not change this probability.
2. It is possible to build attack for our RNGs, because they are 
Pseudo-RNGs.

--Ilya


Re: topN using a heap

2016-01-18 Thread Timon Gehr via Digitalmars-d

On 01/19/2016 12:39 AM, Andrei Alexandrescu wrote:

On 1/18/16 6:18 PM, Ilya wrote:

On Monday, 18 January 2016 at 20:45:56 UTC, Ivan Kazmenko wrote:

On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote:

Here goes the test which shows quadratic behavior for the new version:
http://dpaste.dzfl.pl/e4b3bc26c3cf
(dpaste kills the slow code before it completes the task)


Correction: this is the result of removing a uniform call in pull
3921.  Since we are supposed to discuss the improvements related to
pull 3934 here, I created a separate entry for the issue:
https://issues.dlang.org/show_bug.cgi?id=15583.


A RNGs don't improve worst case. It only changes an permutation for
worst case. --Ilya


Well it does improve things. The probability of hitting the worst case
repeatedly is practically zero, and it's impossible to create an attack
input.

I'm not sure whether we should worry about this. Probably we do because
sort attacks are a press favorite.

The nice thing about not relying on randomness is that pure functions
can call sort, topN etc.

As a sort of a compromise I was thinking of seeding the RNG with not
only the length of the range, but also the integral representation of
the address of the first element. This would still allow an attack if
the input is always at the same address.


Thoughts?

Andrei


https://en.wikipedia.org/wiki/Introselect


Re: topN using a heap

2016-01-18 Thread Ilya via Digitalmars-d
On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu 
wrote:

On 1/18/16 6:51 PM, Ilya wrote:
On Monday, 18 January 2016 at 23:49:36 UTC, Andrei 
Alexandrescu wrote:

On 1/18/16 6:44 PM, Ilya wrote:
On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko 
wrote:

On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:
A RNGs don't improve worst case. It only changes an 
permutation for

worst case. --Ilya


Still, use of RNG makes it impossible to construct the 
worst case
beforehand, once and for all.  In that sense, this is a 
regression.


No, it is definitely possible, because RNGs are Pseudo-RNGs. 
--Ilya


unpredictableSeed uses the system clock as a source of 
randomness, so

we're good there. -- Andrei


Would this work for pure functions? --Ilya


Of course not. I think this back-and-forth takes away from the 
gist of things. So let me summarize what has happened:


1. topN was reportedly slow. It was using a random pivot. I 
made it use getPivot (deterministic) instead of a random pivot 
in https://github.com/D-Programming-Language/phobos/pull/3921. 
getPivot is also what sort uses.


2. Now that both functions use getPivot, I set out to improve 
it in 
https://github.com/D-Programming-Language/phobos/pull/3922. The 
problem is I couldn't make getPivot impure; pure functions 
already call sort, so making it impure would have been a 
regression.


3. So I made getPivot use just a bit of randomness taken from 
the length of the range.


4. sort was and is attackable before all of these changes

5. So now we have pure topN and sort (subject to the range 
offering pure primitives) but they are both attackable.


6. PRNGs don't have any other downside than making functions 
impure.


The way I see it we have these solutions:

(a) make topN still use a random pivot. That way there's no 
regression. Then proceed and make sort() avoid quadratic cases 
in other ways, e.g. switch to heapsort if performance degrades.


(b) Improve sort() first, then apply a similar strategy to 
improving topN. Do not use RNGs at all.


(c) Seek randomness in other places, e.g. address of elements, 
data hashes etc. Come with a solution that may still be 
attacked in narrow cases but make that unlikely enough to be a 
real concern.



Andrei


(a) Hope no
(b) Yes
(c) Memory addresses may not work with user defined ranges and 
hashes could be slow.


(d) Make PRNGs optional. --Ilya


Re: topN using a heap

2016-01-18 Thread Ilya via Digitalmars-d

On Tuesday, 19 January 2016 at 00:38:14 UTC, Timon Gehr wrote:

On 01/19/2016 01:12 AM, Ilya wrote:




There is already implementation with predictable seed. Proof:
https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151
--Ilya


The default RNG is seeded with unpredictableSeed. What is the 
point you are trying to make?


unpredictableSeed is initialized only once and can be easily 
estimated. --Ilya


Re: topN using a heap

2016-01-18 Thread Andrei Alexandrescu via Digitalmars-d

On 1/18/16 7:46 PM, Ilya wrote:

On Tuesday, 19 January 2016 at 00:38:14 UTC, Timon Gehr wrote:

On 01/19/2016 01:12 AM, Ilya wrote:




There is already implementation with predictable seed. Proof:
https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151

--Ilya


The default RNG is seeded with unpredictableSeed. What is the point
you are trying to make?


unpredictableSeed is initialized only once and can be easily estimated.
--Ilya


https://github.com/D-Programming-Language/phobos/blob/v2.069.2/std/random.d#L1120

unpredictableSeed uses MinstdRand0. The sources of randomness used for 
initializing MinstdRand0 are the PID, the thread ID, and the system time 
at the moment of the seeding. Then at each subsequent call to 
unpredictableSeed, the time of the call is XORed with the current value 
of the MinstdRand0.


How do you think things could be improved?


Andrei



Re: topN using a heap

2016-01-18 Thread Ilya via Digitalmars-d
On Tuesday, 19 January 2016 at 01:04:03 UTC, Andrei Alexandrescu 
wrote:

On 1/18/16 7:46 PM, Ilya wrote:

On Tuesday, 19 January 2016 at 00:38:14 UTC, Timon Gehr wrote:

On 01/19/2016 01:12 AM, Ilya wrote:




There is already implementation with predictable seed. Proof:
https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151

--Ilya


The default RNG is seeded with unpredictableSeed. What is the 
point

you are trying to make?


unpredictableSeed is initialized only once and can be easily 
estimated.

--Ilya


https://github.com/D-Programming-Language/phobos/blob/v2.069.2/std/random.d#L1120

unpredictableSeed uses MinstdRand0. The sources of randomness 
used for initializing MinstdRand0 are the PID, the thread ID, 
and the system time at the moment of the seeding. Then at each 
subsequent call to unpredictableSeed, the time of the call is 
XORed with the current value of the MinstdRand0.


How do you think things could be improved?


Andrei


A good variant with minimal overhead is to call unpredictableSeed 
for sorting big arrays each time (one seed per array):
   a. A hacker would not be able to estimate a seed using other 
API calls. For example, "give me a set of random numbers".
   b. A hacker would not be able to estimate a seed using a small 
arrays because they don't use RNGs. (and they have not any 
overhead!)
   c. A hacker would not be able to estimate a seed for big 
arrays, because attack based on time measurement would not work 
for big arrays.


EDIT:
  c. ... because each big array has own seed and time measurement 
would not work for big arrays with different seeds.


__EDIT2__:
To be fair (c) is not really correct. To do (c) correct a thread 
local global seed should be used to generate unpredictableSeed 
for each big arrays.


Ilya


Re: topN using a heap

2016-01-18 Thread Ivan Kazmenko via Digitalmars-d
On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu 
wrote:

4. sort was and is attackable before all of these changes


No, sort utilizes Introsort (Quicksort but switch to Heapsort if 
recurse too deep): see 
https://github.com/D-Programming-Language/phobos/blob/2.067/std/algorithm/sorting.d#L1048-L1053.  This improvement happened a year or two ago, when algorithm.d was not even separated into sub-modules.


My adversary program (change "topN (a, MAX_N / 2)" to "sort (a)") 
admits that by not being able to slow the sort down.  But, if I 
comment out line 1053 to disable the switching, things go 
quadratic as expected.


L1053:depth = depth >= depth.max / 2 ? (depth / 3) * 2 : 
(depth * 2) / 3;


The same remedy can help for topN: if we called partition more 
than log (length) in base (3/2) times, switch to the heap 
implementation of topN regardless of whether n is close to the 
edge.


Ivan Kazmenko.



Re: topN using a heap

2016-01-18 Thread deadalnix via Digitalmars-d
On Tuesday, 19 January 2016 at 00:17:30 UTC, Andrei Alexandrescu 
wrote:
How would this translate to a matter of selecting the pivot 
during sort? -- Andrei


A large chunk of a given datacenter going quadratic at the same 
time.




Re: topN using a heap

2016-01-18 Thread Andrei Alexandrescu via Digitalmars-d

On 1/18/16 6:44 PM, Ilya wrote:

On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko wrote:

On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:

A RNGs don't improve worst case. It only changes an permutation for
worst case. --Ilya


Still, use of RNG makes it impossible to construct the worst case
beforehand, once and for all.  In that sense, this is a regression.


No, it is definitely possible, because RNGs are Pseudo-RNGs. --Ilya


unpredictableSeed uses the system clock as a source of randomness, so 
we're good there. -- Andrei


Re: topN using a heap

2016-01-18 Thread Ilya via Digitalmars-d

On Monday, 18 January 2016 at 23:55:38 UTC, Timon Gehr wrote:

On 01/19/2016 12:51 AM, Ilya wrote:
On Monday, 18 January 2016 at 23:49:36 UTC, Andrei 
Alexandrescu wrote:

On 1/18/16 6:44 PM, Ilya wrote:
On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko 
wrote:

[...]


No, it is definitely possible, because RNGs are Pseudo-RNGs. 
--Ilya


unpredictableSeed uses the system clock as a source of 
randomness, so

we're good there. -- Andrei


Would this work for pure functions? --Ilya


Only if they accept the RNG as an additional argument.


Exactly :) So, I hope we would not add Pseudo-RNGs in 
std.algorithm. --Ilya


Re: topN using a heap

2016-01-18 Thread Ilya via Digitalmars-d

On Monday, 18 January 2016 at 23:53:53 UTC, Timon Gehr wrote:

On 01/19/2016 12:50 AM, Ilya wrote:

...

1. Yes, probability of hitting the worst case repeatedly is is
practically zero. But RNGs do not change this probability.
2. It is possible to build attack for our RNGs, because they 
are

Pseudo-RNGs.
--Ilya


You also need to predict the seed. How do you do that?


We can not use unpredictable seed (like system clock) in pure 
functions. --Ilya


Re: topN using a heap

2016-01-18 Thread Timon Gehr via Digitalmars-d

On 01/19/2016 12:51 AM, Ilya wrote:

On Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu wrote:

On 1/18/16 6:44 PM, Ilya wrote:

On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko wrote:

On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:

A RNGs don't improve worst case. It only changes an permutation for
worst case. --Ilya


Still, use of RNG makes it impossible to construct the worst case
beforehand, once and for all.  In that sense, this is a regression.


No, it is definitely possible, because RNGs are Pseudo-RNGs. --Ilya


unpredictableSeed uses the system clock as a source of randomness, so
we're good there. -- Andrei


Would this work for pure functions? --Ilya


Only if they accept the RNG as an additional argument.


Re: topN using a heap

2016-01-18 Thread deadalnix via Digitalmars-d
On Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu 
wrote:
unpredictableSeed uses the system clock as a source of 
randomness, so we're good there. -- Andrei


I got problem with that even when crytographically secure 
randomness wasn't needed more than once. A specific case included 
adding jitter on some timeout to avoid all hosts to expire at the 
same time, and used mersenne twister with a pseudo random seed. 
There still were way too many hosts ending up with the same 
timeout.


System time is not enough.



Re: topN using a heap

2016-01-18 Thread Timon Gehr via Digitalmars-d

On 01/19/2016 01:12 AM, Ilya wrote:




There is already implementation with predictable seed. Proof:
https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151
--Ilya


The default RNG is seeded with unpredictableSeed. What is the point you 
are trying to make?


Re: topN using a heap

2016-01-18 Thread Ivan Kazmenko via Digitalmars-d
On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu 
wrote:

4. sort was and is attackable before all of these changes

(b) Improve sort() first, then apply a similar strategy to 
improving topN. Do not use RNGs at all.


Since point 4 is in fact already fixed a good while ago, my 
suggestion would be (b): to do the same for topN.


Introselect (https://en.wikipedia.org/wiki/Introselect), already 
mentioned in this thread, uses median-of-medians to achieve worst 
case O(n) performance if we recurse too deep.  There is already 
an issue suggesting to implement it: 
https://issues.dlang.org/show_bug.cgi?id=12141 (std.algorithm: 
implement deterministic topN).


In fact, the O(n log n) heap approach as it is implemented now 
could be faster than O(n) median-of-medians approach on 
reasonable inputs.  So, someone will have to implement 
median-of-medians and, well, measure.


At the very least, googling for "median of medians in practice" 
and such yields the tag wiki from StackOverflow.com: "The 
constant factor in the O(n) is large, and the algorithm is not 
commonly used in practice.".


Ivan Kazmenko.



Re: topN using a heap

2016-01-18 Thread Xinok via Digitalmars-d
On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu 
wrote:
Of course not. I think this back-and-forth takes away from the 
gist of things. So let me summarize what has happened:


1. topN was reportedly slow. It was using a random pivot. I 
made it use getPivot (deterministic) instead of a random pivot 
in https://github.com/D-Programming-Language/phobos/pull/3921. 
getPivot is also what sort uses.


2. Now that both functions use getPivot, I set out to improve 
it in 
https://github.com/D-Programming-Language/phobos/pull/3922. The 
problem is I couldn't make getPivot impure; pure functions 
already call sort, so making it impure would have been a 
regression.


3. So I made getPivot use just a bit of randomness taken from 
the length of the range.


4. sort was and is attackable before all of these changes

5. So now we have pure topN and sort (subject to the range 
offering pure primitives) but they are both attackable.


6. PRNGs don't have any other downside than making functions 
impure.


The way I see it we have these solutions:

(a) make topN still use a random pivot. That way there's no 
regression. Then proceed and make sort() avoid quadratic cases 
in other ways, e.g. switch to heapsort if performance degrades.


(b) Improve sort() first, then apply a similar strategy to 
improving topN. Do not use RNGs at all.


(c) Seek randomness in other places, e.g. address of elements, 
data hashes etc. Come with a solution that may still be 
attacked in narrow cases but make that unlikely enough to be a 
real concern.



Andrei


To be clear, sort is NOT attackable. Introsort is used for 
unstable sorting which begins with quick sort but falls back to 
heap sort after too many recursions to maintain O(n log n) 
running time. Timsort is used for stable sorting which is a 
variant of merge sort but still runs in O(n log n) time. In 
either case, you're guaranteed to have O(n log n) running time in 
the worst case.


On the other hand, someone can improve upon quick sort by 
choosing better pivots but be careful not to add too much 
overhead. A couple years ago, I tried choosing the pivot from a 
median of five but the result was as much as 10% slower. One idea 
is to begin initially choosing better pivots, but once the 
sublists fall below a certain length, simply choose the pivot 
from a median of three to avoid the extra overhead.


As for topN, there are two approaches I'm aware of that are 
deterministic without resorting to impurities or RNGs. The first 
is introselect which is similar to introsort in that it has a 
fall back algorithm. The other approach is to choose the pivot 
from a median of medians. My idea is a variant of the latter 
approach: Begin by choosing the pivot from a median of five. If 
you continuously choose bad pivots, take a larger sample of 
elements to choose the pivot from. Keep growing the sample as 
necessary.


Speaking of RNGs, they're technically pure as long as you always 
use the same seed. So what if we generated a random seed at the 
start of the process, but then reused that same seed over and 
over in pure functions for the duration of the process?


Re: topN using a heap

2016-01-18 Thread Ivan Kazmenko via Digitalmars-d

On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote:
Here goes the test which shows quadratic behavior for the new 
version:

http://dpaste.dzfl.pl/e4b3bc26c3cf
(dpaste kills the slow code before it completes the task)


Correction: this is the result of removing a uniform call in pull 
3921.  Since we are supposed to discuss the improvements related 
to pull 3934 here, I created a separate entry for the issue: 
https://issues.dlang.org/show_bug.cgi?id=15583.




Re: topN using a heap

2016-01-18 Thread Timon Gehr via Digitalmars-d

On 01/18/2016 01:00 PM, Ivan Kazmenko wrote:


The old version (2.070.0-b2) could not be tricked with it, does it use
random?


Yes, it selected the pivot uniformly at random using the global RNG. 
(This is also why the documentation says topN is O(n) in expectation.)


Re: topN using a heap

2016-01-17 Thread Andrei Alexandrescu via Digitalmars-d

On 01/17/2016 03:32 PM, Ivan Kazmenko wrote:

Here is a more verbose version.


OK, very nice. Thanks! I've modified topN to work as follows. In a loop:

* If nth <= r.length / log2(r.length)^^2 (or is similarly close to 
r.length), use topNHeap with one heap and stop
* If nth <= r.length / log2(r.length) (or is similarly close to 
r.length), use topNHeap with two heaps and stop

* Take a quickselect step, adjust r and nth, and continue

That seems to work nicely for all values involved.

All - let me know how things can be further improved. Thx!


Andrei


Re: topN using a heap

2016-01-17 Thread Ivan Kazmenko via Digitalmars-d
On Sunday, 17 January 2016 at 16:06:31 UTC, Andrei Alexandrescu 
wrote:

On 01/17/2016 06:41 AM, Ivan Kazmenko wrote:
The average case is O(n + (k log n log k)) for small enough k. 
So, any k
below roughly n / log^2 (n) makes the second summand less than 
the first.


I don't understand how you derived the average case complexity, 
and I don't understand how you derived the bound for k.


Regarding the complexity: for all k (small or large), it takes 
O(k) to build a heap. Then in the worst case we do (n - k) heap 
insertions, i.e. total complexity is O(k + (n - k) * log(k)). 
Is that correct?


Yeah, I have the same notion.

If k is any fraction of n, the worst case complexity comes to 
O(n + n log n) i.e. O(n log n). By your calculation, the 
average complexity comes to O(n log^2 n), i.e. greater than the 
worst case. So there's a mistake somewhere.


The upper bound O(n + k log k log n) is correct, but it's tight 
only for small k.  I'll explain below.



Could you please clarify matters? Thanks!


Here is a more verbose version.

Suppose for simplicity that the array consists real numbers of 
the same continuous random distribution.
(A discrete random distribution would complicate the analysis a 
bit, since the probability of two elements being equal would 
become greater than zero.)
Then, element a_i is the minimum of a_1, a_2, ..., a_i with 
probability 1/i.
Moreover, it is the second minimum with probability 1/i, the 
third minimum with probability 1/i, ..., the maximum of a_1, a_2, 
..., a_i with the same probability 1/i.
To prove this, we can show that, if we look at the ordering 
permutation p_1, p_2, ..., p_i such that of a_{p_1} < a_{p_2} < 
... a_{p_i}, each of the i! possible permutations is equally 
probable.


What immediately follows is that when we track k smallest 
elements and consider element i>k, the probability that it will 
change the k smallest elements encountered so far is k/i.
Summing that for i from k+1 to n, we get the expected number of 
heap insertions:

(k/(k+1)) + (k/(k+2)) + ... + k/n.

Now, this is just k multiplied by:
(1/1 + 1/2 + 1/3 + ... + 1/n) *minus*
(1/1 + 1/2 + 1/3 + ... + 1/k).
As these are harmonic series, the first line is asymptotically 
equal to log n, and the second line asymptotically equal to log k.
To summarize, the expected number of heap insertions for an array 
of random reals is k (log n - log k), and the complexity is O(n + 
k log k (log n - log k)).


What I did here for simplicity is just omit the "minus logarithm 
of k" part to get an upper bound.

For small enough k, the bound is tight.
For example, when k=sqrt(n), log n - log k = log(n/k) = (log n) / 
2.
On the other hand, when k=Theta(n), log n - log k = log(n/k) = 
Theta(1).


Ivan Kazmenko.



Re: topN using a heap

2016-01-17 Thread deadalnix via Digitalmars-d
On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu 
wrote:

https://github.com/D-Programming-Language/phobos/pull/3934

So, say you're looking for the smallest 10 elements out of 
100_000. The quickselect algorithm (which topN currently uses) 
will successively partition the set in (assuming the pivot 
choice works well) 50_000, 25_000, etc chunks all the way down 
to finding the smallest 10.


That's quite a bit of work, so 3934 uses an alternate strategy 
for finding the smallest 10:


1. Organize the first 11 elements into a max heap

2. Scan all other elements progressively. Whenever an element 
is found that is smaller than the largest in the heap, swap it 
with the largest in the heap then restore the heap property.


3. At the end, swap the largest in the heap with the 10th and 
you're done!


This is very effective, and is seldom referred in the selection 
literature. In fact, a more inefficient approach (heapify the 
entire range) is discussed more often.



Destroy!

Andrei


A common way to do it is to go quicksort, but only recurse on one 
side of the set. That should give log(n)^2 complexity on average.


Re: topN using a heap

2016-01-17 Thread deadalnix via Digitalmars-d
On Monday, 18 January 2016 at 01:38:16 UTC, Andrei Alexandrescu 
wrote:

On 1/17/16 8:07 PM, deadalnix wrote:
A common way to do it is to go quicksort, but only recurse on 
one side

of the set. That should give log(n)^2 complexity on average.


Yah, that's quickselect (which this work started from). It's 
linear, and you can't get top n in sublinear time because you 
need to look at all elements. -- Andrei


Yeah; forget about me, I was dumb on that one.


Re: topN using a heap

2016-01-17 Thread Andrei Alexandrescu via Digitalmars-d

On 1/17/16 8:07 PM, deadalnix wrote:

A common way to do it is to go quicksort, but only recurse on one side
of the set. That should give log(n)^2 complexity on average.


Yah, that's quickselect (which this work started from). It's linear, and 
you can't get top n in sublinear time because you need to look at all 
elements. -- Andrei




Re: topN using a heap

2016-01-17 Thread Ivan Kazmenko via Digitalmars-d
On Sunday, 17 January 2016 at 03:26:54 UTC, Andrei Alexandrescu 
wrote:

On 1/16/16 9:37 PM, Timon Gehr wrote:
Ivan's analysis suggests that even something significantly 
larger, like

n/log(n)² might work as an upper bound for k.


I'm not clear on how you got to that boundary. There are a few 
implementation details to be minded as well (quickly and 
carefully approximating the log etc). Could you please make a 
commented pull request against my own? Thanks! -- Andrei


The average case is O(n + (k log n log k)) for small enough k.  
So, any k below roughly n / log^2 (n) makes the second summand 
less than the first.


Of course, the constant not present in asymptotic analysis, as 
well as handling the worst case, may make this notion 
impractical.  Measure is the way :) .  The worst case is just a 
decreasing sequence, or almost decreasing if we want to play 
against branch prediction as well.




Re: topN using a heap

2016-01-17 Thread Ivan Kazmenko via Digitalmars-d

On Sunday, 17 January 2016 at 02:37:48 UTC, Timon Gehr wrote:

On 01/17/2016 03:09 AM, Andrei Alexandrescu wrote:

On 1/16/16 8:00 PM, Timon Gehr wrote:
The implementation falls back to topNHeap whenever k is 
within the first

or last ~n/8 elements and therefore is Ω(n log n) on average.


Correct. The pedantic approach would be to only do that when k 
is up to

log(n). -- Andrei



Ivan's analysis suggests that even something significantly 
larger, like n/log(n)² might work as an upper bound for k.
I don't think that meeting the documented runtime bounds 
amounts to pedantry. (Having linear average running time of 
course does not even imply linear expected running time, as is 
still written in the documentation.)


I'd suggest being able to switch between implementations.

Recall that, when sorting, we have SwapStrategy.stable which is, 
well, stable, but additionally guarantees n log n with a larger 
constant (MergeSort).


And we have SwapStrategy.unstable which uses Introsort, which, in 
turn, has smaller constant on sane inputs.


Here, Andrei's suggestion is to also have two implementations, 
let us call them TopNStrategy.quickSelect and TopNStrategy.heap.


The quickSelect one is O(n) on sane inputs but O(n^2) on an 
artificial worst case.


The heap implementation is also O(n) when k is close to 0 or n, 
but O(n log n) otherwise.  What's important is that it is also 
O(n log n) worst case.


The current proposal switches between strategies based on a 
heuristic (k < n/8 or k > 7n/8 if I understand correctly), which 
may not be the best strategy in all cases.


So, what can be done is to introduce TopNStrategy.auto which is 
the default and uses the (current or better) heuristic to switch 
between strategies, but also leave a way to explicitly select one 
of the two strategies, just like one can do now with sort!(..., 
SwapStrategy.whatWeExplicitlyWant).


All the code for both strategies is already there.  Each strategy 
has its benefits not covered by the other (quickSelect is faster 
for k~n and sane inputs, heap is faster for k close to boundary 
and special inputs).  So, provide the default but let the user 
choose.


Ivan Kazmenko.



Re: topN using a heap

2016-01-17 Thread Andrei Alexandrescu via Digitalmars-d

On 01/17/2016 06:41 AM, Ivan Kazmenko wrote:

The average case is O(n + (k log n log k)) for small enough k. So, any k
below roughly n / log^2 (n) makes the second summand less than the first.


I don't understand how you derived the average case complexity, and I 
don't understand how you derived the bound for k.


Regarding the complexity: for all k (small or large), it takes O(k) to 
build a heap. Then in the worst case we do (n - k) heap insertions, i.e. 
total complexity is O(k + (n - k) * log(k)). Is that correct?


If k is any fraction of n, the worst case complexity comes to O(n + n 
log n) i.e. O(n log n). By your calculation, the average complexity 
comes to O(n log^2 n), i.e. greater than the worst case. So there's a 
mistake somewhere.


Could you please clarify matters? Thanks!


Andrei



Re: topN using a heap

2016-01-17 Thread Andrei Alexandrescu via Digitalmars-d

On 01/17/2016 06:55 AM, Ivan Kazmenko wrote:

So, what can be done is to introduce TopNStrategy.auto which is the
default and uses the (current or better) heuristic to switch between
strategies, but also leave a way to explicitly select one of the two
strategies, just like one can do now with sort!(...,
SwapStrategy.whatWeExplicitlyWant).


A nice idea, but it seems a bit overengineered. The differences among 
approaches are rather subtle and explaining the circumstances under 
which one does better than the other is about as difficult as making the 
choice in the implementation.


BTW there's yet another approach:

1. Create a max heap for r[0 .. nth]

2. Create a min heap for r[nth .. $]

3. As long as r[nth] < r[0], swap them and restore the heap property in 
the two heaps


At the end of the process we have the smallest element in r[nth .. $], 
which is exactly in the r[nth] position, greater than or equal to the 
largest element in r[0 .. nth], which is exactly what the doctor prescribed.


Complexity of this turns rather nice. Step 1 is O(k), step 2 is O(n - 
k). In the worst case (r is sorted descending), step 3 will stop after k 
iterations and does k heap insertions in the two heaps. So overall 
complexity is O(n + k log(k) + k log(n)). Since k <= n, keep the largest 
terms: O(n + k log(n)). So if k is proportional to n / log(n), we get 
O(n). And that's worst case!


BTW I figured how to do stable partition. That'll come in a distinct PR.


Andrei



Re: topN using a heap

2016-01-16 Thread Andrei Alexandrescu via Digitalmars-d

On 1/16/16 8:00 PM, Timon Gehr wrote:

On 01/16/2016 10:28 PM, Ivan Kazmenko wrote:

On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:

...


To summarize:
For k<

The implementation falls back to topNHeap whenever k is within the first
or last ~n/8 elements and therefore is Ω(n log n) on average.


Correct. The pedantic approach would be to only do that when k is up to 
log(n). -- Andrei




Re: topN using a heap

2016-01-16 Thread Andrei Alexandrescu via Digitalmars-d

On 1/16/16 9:12 PM, Andrei Alexandrescu wrote:

On 1/16/16 4:58 PM, Ziad Hatahet via Digitalmars-d wrote:

On Sat, Jan 16, 2016 at 7:25 AM, Andrei Alexandrescu via Digitalmars-d
> wrote:


1. Organize the first 11 elements into a max heap


Why not the first 10?


So you get to put the appropriate element in the 11th position. -- Andrei


To clarify this using a degenerate case: say someone calls topN(r, 0) 
i.e. find the minimum. Then you'd need a mini-heap of exactly one 
element to track the smallest element found so far. -- Andrei




Re: topN using a heap

2016-01-16 Thread Ziad Hatahet via Digitalmars-d
On Sat, Jan 16, 2016 at 7:25 AM, Andrei Alexandrescu via Digitalmars-d <
digitalmars-d@puremagic.com> wrote:

>
> 1. Organize the first 11 elements into a max heap
>
>
Why not the first 10?


Re: topN using a heap

2016-01-16 Thread Ivan Kazmenko via Digitalmars-d
On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu 
wrote:
That's quite a bit of work, so 3934 uses an alternate strategy 
for finding the smallest 10:


1. Organize the first 11 elements into a max heap

2. Scan all other elements progressively. Whenever an element 
is found that is smaller than the largest in the heap, swap it 
with the largest in the heap then restore the heap property.


3. At the end, swap the largest in the heap with the 10th and 
you're done!


This is very effective, and is seldom referred in the selection 
literature. In fact, a more inefficient approach (heapify the 
entire range) is discussed more often.


Yeah, this sounds nice.  Suppose we have an array of length n and 
want to find the smallest k elements.  The first step (heapify) 
is O(k) operations.


1. Let us analyze the rest for "small enough" k.

1a. If the array is "random", the expected number of insertions 
is the sum of probabilities that each new element is inserted.  
This number is (k/(k+1)) + (k/(k+2)) + ... + k/n, which is k 
times a part of harmonic series, that is, on the order of k log 
n.  In each case, O(log k) operations are required to process the 
new element.  In total, the "average" case takes (n-k) + (k log n 
log k), a total of O(n) for small enough values of k.


1b. In the worst case (each element is less than the current 
top), this still requires (n-k) log k operations, so we can say 
the total is O(n log k).


2. For k of the same order as n, the algorithm is O(n log n), 
just as a regular HeapSort.


To summarize:
For k<

Re: topN using a heap

2016-01-16 Thread Ivan Kazmenko via Digitalmars-d
On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu 
wrote:
3. At the end, swap the largest in the heap with the 10th and 
you're done!


And why this?  Do we additionally require the k-th element to 
arrive exactly on k-th place?




Re: topN using a heap

2016-01-16 Thread Timon Gehr via Digitalmars-d

On 01/16/2016 10:28 PM, Ivan Kazmenko wrote:

On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:

...


To summarize:
For k<

The implementation falls back to topNHeap whenever k is within the first 
or last ~n/8 elements and therefore is Ω(n log n) on average.


Re: topN using a heap

2016-01-16 Thread Timon Gehr via Digitalmars-d

On 01/17/2016 03:09 AM, Andrei Alexandrescu wrote:

On 1/16/16 8:00 PM, Timon Gehr wrote:

On 01/16/2016 10:28 PM, Ivan Kazmenko wrote:

On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:

...


To summarize:
For k<

The implementation falls back to topNHeap whenever k is within the first
or last ~n/8 elements and therefore is Ω(n log n) on average.


Correct. The pedantic approach would be to only do that when k is up to
log(n). -- Andrei



Ivan's analysis suggests that even something significantly larger, like 
n/log(n)² might work as an upper bound for k.
I don't think that meeting the documented runtime bounds amounts to 
pedantry. (Having linear average running time of course does not even 
imply linear expected running time, as is still written in the 
documentation.)


Re: topN using a heap

2016-01-16 Thread Andrei Alexandrescu via Digitalmars-d

On 1/16/16 5:27 PM, Ivan Kazmenko wrote:

On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:

3. At the end, swap the largest in the heap with the 10th and you're
done!


And why this?  Do we additionally require the k-th element to arrive
exactly on k-th place?


Yes. -- Andrei



Re: topN using a heap

2016-01-16 Thread Andrei Alexandrescu via Digitalmars-d

On 1/16/16 4:58 PM, Ziad Hatahet via Digitalmars-d wrote:

On Sat, Jan 16, 2016 at 7:25 AM, Andrei Alexandrescu via Digitalmars-d
> wrote:


1. Organize the first 11 elements into a max heap


Why not the first 10?


So you get to put the appropriate element in the 11th position. -- Andrei



Re: topN using a heap

2016-01-16 Thread Andrei Alexandrescu via Digitalmars-d

On 1/16/16 9:37 PM, Timon Gehr wrote:

Ivan's analysis suggests that even something significantly larger, like
n/log(n)² might work as an upper bound for k.


I'm not clear on how you got to that boundary. There are a few 
implementation details to be minded as well (quickly and carefully 
approximating the log etc). Could you please make a commented pull 
request against my own? Thanks! -- Andrei




Re: topN using a heap

2016-01-16 Thread Ilya Yaroshenko via Digitalmars-d
On Sunday, 17 January 2016 at 03:26:54 UTC, Andrei Alexandrescu 
wrote:

On 1/16/16 9:37 PM, Timon Gehr wrote:
Ivan's analysis suggests that even something significantly 
larger, like

n/log(n)² might work as an upper bound for k.


I'm not clear on how you got to that boundary. There are a few 
implementation details to be minded as well (quickly and 
carefully approximating the log etc). Could you please make a 
commented pull request against my own? Thanks! -- Andrei


size_t.sizeof * 8 - bsr(n) should be good approximation for 
sorting. --Ilya


topN using a heap

2016-01-16 Thread Andrei Alexandrescu via Digitalmars-d

https://github.com/D-Programming-Language/phobos/pull/3934

So, say you're looking for the smallest 10 elements out of 100_000. The 
quickselect algorithm (which topN currently uses) will successively 
partition the set in (assuming the pivot choice works well) 50_000, 
25_000, etc chunks all the way down to finding the smallest 10.


That's quite a bit of work, so 3934 uses an alternate strategy for 
finding the smallest 10:


1. Organize the first 11 elements into a max heap

2. Scan all other elements progressively. Whenever an element is found 
that is smaller than the largest in the heap, swap it with the largest 
in the heap then restore the heap property.


3. At the end, swap the largest in the heap with the 10th and you're done!

This is very effective, and is seldom referred in the selection 
literature. In fact, a more inefficient approach (heapify the entire 
range) is discussed more often.



Destroy!

Andrei