Re: [HACKERS] Randomisation for ensuring nlogn complexity in quicksort

2013-07-02 Thread Robert Haas
On Tue, Jul 2, 2013 at 12:33 AM, Atri Sharma atri.j...@gmail.com wrote:
 If you want to get a useful response to your emails, consider
 including a statement of what you think the problem is and why you
 think your proposed changes will help.  Consider offering a test case
 that performs badly and an analysis of the reason why.

 Right, thanks for that. I will keep that in mind.

 I was thinking about *mostly sorted* datasets, consider the following:

 10 11 12 4 5 6 1 2

I think if you'll try it you'll find that we perform quite well on
data sets of this kind - and if you read the code you'll see why.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Randomisation for ensuring nlogn complexity in quicksort

2013-07-02 Thread Atri Sharma
On Tue, Jul 2, 2013 at 5:20 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jul 2, 2013 at 12:33 AM, Atri Sharma atri.j...@gmail.com wrote:
 If you want to get a useful response to your emails, consider
 including a statement of what you think the problem is and why you
 think your proposed changes will help.  Consider offering a test case
 that performs badly and an analysis of the reason why.

 Right, thanks for that. I will keep that in mind.

 I was thinking about *mostly sorted* datasets, consider the following:

 10 11 12 4 5 6 1 2

 I think if you'll try it you'll find that we perform quite well on
 data sets of this kind - and if you read the code you'll see why.

Right, let me read the code again from that viewpoint.

Thanks a ton for your help!

Regards,

Atri


--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Randomisation for ensuring nlogn complexity in quicksort

2013-07-02 Thread Peter Geoghegan
On Tue, Jul 2, 2013 at 5:04 AM, Atri Sharma atri.j...@gmail.com wrote:
 I think if you'll try it you'll find that we perform quite well on
 data sets of this kind - and if you read the code you'll see why.

 Right, let me read the code again from that viewpoint.

In my opinion, it would be worthwhile reading the original Bentley and
McIlroy paper [1] and using what you learned to write a patch that
adds comments throughout the canonical qsort_arg, and perhaps the
other variants.

[1] 
http://www.enseignement.polytechnique.fr/informatique/profs/Luc.Maranget/421/09/bentley93engineering.pdf
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Randomisation for ensuring nlogn complexity in quicksort

2013-07-02 Thread Claudio Freire
On Tue, Jul 2, 2013 at 12:36 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Jul 2, 2013 at 5:04 AM, Atri Sharma atri.j...@gmail.com wrote:
 I think if you'll try it you'll find that we perform quite well on
 data sets of this kind - and if you read the code you'll see why.

 Right, let me read the code again from that viewpoint.

 In my opinion, it would be worthwhile reading the original Bentley and
 McIlroy paper [1] and using what you learned to write a patch that
 adds comments throughout the canonical qsort_arg, and perhaps the
 other variants.

 [1] 
 http://www.enseignement.polytechnique.fr/informatique/profs/Luc.Maranget/421/09/bentley93engineering.pdf

That's weird, it doesn't seem as sophisticated as even libc's introsort.

Perhaps an introsort[1] approach wouldn't hurt: do the quick and dirty
median selection pg is already doing (or a better one if a better one
is found), but check recursion depth/input size ratios.

When across K recursive calls the input set hasn't been halved in
size, switch to median of medians to guard off against quadratic
complexity.

[1] http://en.wikipedia.org/wiki/Selection_algorithm#Introselect


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Randomisation for ensuring nlogn complexity in quicksort

2013-07-01 Thread jasmine
My PostgresSQL (9.2) is crashing after certain load tests. Currently,
postgressql is crashing when simulatenously 800 to 1000 threads are run on a
10 million records schema. Not sure, if we have to tweak some more
parameters of postgres. 



-
jasmine
--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Randomisation-for-ensuring-nlogn-complexity-in-quicksort-tp5761907p5762011.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Randomisation for ensuring nlogn complexity in quicksort

2013-07-01 Thread David Fetter
On Mon, Jul 01, 2013 at 04:42:04AM -0700, jasmine wrote:
 My PostgresSQL (9.2) is crashing after certain load tests. Currently,
 postgressql is crashing when simulatenously 800 to 1000 threads are run on a
 10 million records schema. Not sure, if we have to tweak some more
 parameters of postgres. 

Jasmine,

Please start your own thread rather than replying to some unrelated
thread.

In the particular case above, please also to provide reproducible test
cases including the exact version(s) of PostgreSQL to which they apply
(9.2.4, e.g.) along with output from same so other people can see what
you're talking about and actually help.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Randomisation for ensuring nlogn complexity in quicksort

2013-07-01 Thread Atri Sharma
On Mon, Jul 1, 2013 at 10:02 PM, David Fetter da...@fetter.org wrote:
 On Mon, Jul 01, 2013 at 04:42:04AM -0700, jasmine wrote:
 My PostgresSQL (9.2) is crashing after certain load tests. Currently,
 postgressql is crashing when simulatenously 800 to 1000 threads are run on a
 10 million records schema. Not sure, if we have to tweak some more
 parameters of postgres.

 Jasmine,

 Please start your own thread rather than replying to some unrelated
 thread.

 In the particular case above, please also to provide reproducible test
 cases including the exact version(s) of PostgreSQL to which they apply
 (9.2.4, e.g.) along with output from same so other people can see what
 you're talking about and actually help.

Not sure if it belongs to -general instead of -hackers.

Regards,

Atri

--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Randomisation for ensuring nlogn complexity in quicksort

2013-07-01 Thread Robert Haas
On Sun, Jun 30, 2013 at 8:30 AM, Atri Sharma atri.j...@gmail.com wrote:
 I have been reading the recent discussion and was researching a bit, and I 
 think that we should really go with the idea of randomising the input data(if 
 it is not completely presorted), to ensure that we do not get quadratic 
 complexity.

That doesn't ensure any such thing.  It just makes it less likely.
But we have existing guards that also make that unlikely, so I'm not
sure what we'd be gaining.

 One easy way to do that could be to take a sample of the data set, and take a 
 pivot out of it. Still a better way could be to take multiple samples which 
 are spread of the data set, select a value from each of them, and then take a 
 cumulative pivot(median,maybe).

We pretty much do that already.

 This shouldn't be too complex, and should give us a fixed nlogn complexity 
 even for wild data sets, without affecting existing normal data sets that are 
 present in every day transactions. I even believe that those data sets will 
 also benefit from the above optimisation.

The only method of selecting a pivot for quicksort that obtain O(n lg
n) run time with 100% certainty is have a magical oracle inside the
computer that tells you in fixed time and with perfect accuracy which
pivot you should select.

If you want to get a useful response to your emails, consider
including a statement of what you think the problem is and why you
think your proposed changes will help.  Consider offering a test case
that performs badly and an analysis of the reason why.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Randomisation for ensuring nlogn complexity in quicksort

2013-07-01 Thread Claudio Freire
On Mon, Jul 1, 2013 at 4:32 PM, Robert Haas robertmh...@gmail.com wrote:
 This shouldn't be too complex, and should give us a fixed nlogn complexity 
 even for wild data sets, without affecting existing normal data sets that 
 are present in every day transactions. I even believe that those data sets 
 will also benefit from the above optimisation.

 The only method of selecting a pivot for quicksort that obtain O(n lg
 n) run time with 100% certainty is have a magical oracle inside the
 computer that tells you in fixed time and with perfect accuracy which
 pivot you should select.


Doesn't a linear median algorithm (say median of medians) get you O(n lg n)?

Granted, with a huge constant (I think 4)... but it should still be O(n lg n).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Randomisation for ensuring nlogn complexity in quicksort

2013-07-01 Thread Robert Haas
On Mon, Jul 1, 2013 at 3:54 PM, Claudio Freire klaussfre...@gmail.com wrote:
 On Mon, Jul 1, 2013 at 4:32 PM, Robert Haas robertmh...@gmail.com wrote:
 This shouldn't be too complex, and should give us a fixed nlogn complexity 
 even for wild data sets, without affecting existing normal data sets that 
 are present in every day transactions. I even believe that those data sets 
 will also benefit from the above optimisation.

 The only method of selecting a pivot for quicksort that obtain O(n lg
 n) run time with 100% certainty is have a magical oracle inside the
 computer that tells you in fixed time and with perfect accuracy which
 pivot you should select.

 Doesn't a linear median algorithm (say median of medians) get you O(n lg n)?

 Granted, with a huge constant (I think 4)... but it should still be O(n lg n).

No.   Thinking about this a little more, I believe the way it works
out is that any algorithm for picking the median that guarantees that
a certain *percentage* of the tuples will be in the smaller partition
will have O(n lg n) complexity, but any algorithm that only guarantees
that a fixed *number* of tuples in the smaller partition is still
quadratic in complexity.  In the case of a median algorithm, you're
only guaranteed to have 2 elements in the smaller partition, which is
a constant.  If you take a median of medians, you're guaranteed to
have 8 elements in the smaller partition, which is bigger, but still a
constant.

The reason why this doesn't matter much in practice is because the
data distribution that causes quadratic behavior for median-of-medians
is not one which is likely to occur in the real world and will
probably only come up if chosen by an adversary who is attempting to
make your life miserable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Randomisation for ensuring nlogn complexity in quicksort

2013-07-01 Thread Claudio Freire
On Mon, Jul 1, 2013 at 5:12 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Jul 1, 2013 at 3:54 PM, Claudio Freire klaussfre...@gmail.com wrote:
 On Mon, Jul 1, 2013 at 4:32 PM, Robert Haas robertmh...@gmail.com wrote:
 This shouldn't be too complex, and should give us a fixed nlogn complexity 
 even for wild data sets, without affecting existing normal data sets that 
 are present in every day transactions. I even believe that those data sets 
 will also benefit from the above optimisation.

 The only method of selecting a pivot for quicksort that obtain O(n lg
 n) run time with 100% certainty is have a magical oracle inside the
 computer that tells you in fixed time and with perfect accuracy which
 pivot you should select.

 Doesn't a linear median algorithm (say median of medians) get you O(n lg n)?

 Granted, with a huge constant (I think 4)... but it should still be O(n lg 
 n).

 No.   Thinking about this a little more, I believe the way it works
 out is that any algorithm for picking the median that guarantees that
 a certain *percentage* of the tuples will be in the smaller partition
 will have O(n lg n) complexity, but any algorithm that only guarantees
 that a fixed *number* of tuples in the smaller partition is still
 quadratic in complexity.  In the case of a median algorithm, you're
 only guaranteed to have 2 elements in the smaller partition, which is
 a constant.  If you take a median of medians, you're guaranteed to
 have 8 elements in the smaller partition, which is bigger, but still a
 constant.


What?

A median of medians algorithm will guarantee floor(N/2) elements on
the smaller. That's the definition of median.

Note that I'm referring to picking the actual median of all tuples,
not just a sample. That's slow, but it guarantees O(n log n).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Randomisation for ensuring nlogn complexity in quicksort

2013-07-01 Thread Peter Geoghegan
On Mon, Jul 1, 2013 at 12:32 PM, Robert Haas robertmh...@gmail.com wrote:
 I have been reading the recent discussion and was researching a bit, and I 
 think that we should really go with the idea of randomising the input 
 data(if it is not completely presorted), to ensure that we do not get 
 quadratic complexity.

 That doesn't ensure any such thing.  It just makes it less likely.
 But we have existing guards that also make that unlikely, so I'm not
 sure what we'd be gaining.

+1

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Randomisation for ensuring nlogn complexity in quicksort

2013-07-01 Thread Robert Haas
On Mon, Jul 1, 2013 at 4:31 PM, Claudio Freire klaussfre...@gmail.com wrote:
 What?

 A median of medians algorithm will guarantee floor(N/2) elements on
 the smaller. That's the definition of median.

 Note that I'm referring to picking the actual median of all tuples,
 not just a sample. That's slow, but it guarantees O(n log n).

Ah, OK.  Well, yes, that would guarantee O(n lg n).  But, as you say,
it would be slow.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Randomisation for ensuring nlogn complexity in quicksort

2013-07-01 Thread Atri Sharma
On Tue, Jul 2, 2013 at 1:02 AM, Robert Haas robertmh...@gmail.com wrote:
 On Sun, Jun 30, 2013 at 8:30 AM, Atri Sharma atri.j...@gmail.com wrote:
 I have been reading the recent discussion and was researching a bit, and I 
 think that we should really go with the idea of randomising the input 
 data(if it is not completely presorted), to ensure that we do not get 
 quadratic complexity.

 That doesn't ensure any such thing.  It just makes it less likely.
 But we have existing guards that also make that unlikely, so I'm not
 sure what we'd be gaining.

 One easy way to do that could be to take a sample of the data set, and take 
 a pivot out of it. Still a better way could be to take multiple samples 
 which are spread of the data set, select a value from each of them, and then 
 take a cumulative pivot(median,maybe).

 We pretty much do that already.

 This shouldn't be too complex, and should give us a fixed nlogn complexity 
 even for wild data sets, without affecting existing normal data sets that 
 are present in every day transactions. I even believe that those data sets 
 will also benefit from the above optimisation.

 The only method of selecting a pivot for quicksort that obtain O(n lg
 n) run time with 100% certainty is have a magical oracle inside the
 computer that tells you in fixed time and with perfect accuracy which
 pivot you should select.

 If you want to get a useful response to your emails, consider
 including a statement of what you think the problem is and why you
 think your proposed changes will help.  Consider offering a test case
 that performs badly and an analysis of the reason why.

Right, thanks for that. I will keep that in mind.

I was thinking about *mostly sorted* datasets, consider the following:

10 11 12 4 5 6 1 2

(Just off my head, sorry if I missed something).

Now, the above data set is made up of number of rotation of a sorted
dataset, so is mostly sorted, albeit with some disordering.

My point is that these kind of datasets(not necessarily the above one)
can lead to a bad choice of pivot, and hence give us a complexity
which is below NlogN.

I know we have a check for pre sorted inputs, but wasn't sure how we
deal with mostly sorted inputs, as quick sort likes disorder in input.

I agree with Claudio's idea. One thing to keep in mind is that we
don't do quick sort for large data sets anyway, and move to external
merge sort for it. So, we could think of using median of medians
algorithm for the purpose.

Another thing I would like to investigate is our implementation of
quick sort's performance(and maybe external merge sort as well) on
multiword keys.

Regards,

Atri


--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Randomisation for ensuring nlogn complexity in quicksort

2013-06-30 Thread Atri Sharma
Hi all,

I have been reading the recent discussion and was researching a bit, and I 
think that we should really go with the idea of randomising the input data(if 
it is not completely presorted), to ensure that we do not get quadratic 
complexity.

One easy way to do that could be to take a sample of the data set, and take a 
pivot out of it. Still a better way could be to take multiple samples which are 
spread of the data set, select a value from each of them, and then take a 
cumulative pivot(median,maybe).

Anyways, I really think that if we do not go with the above ideas, then, we 
should some how factor in the degree of randomness of the input data when 
making the decision between quicksort and external merge sort for a set of rows.

This shouldn't be too complex, and should give us a fixed nlogn complexity even 
for wild data sets, without affecting existing normal data sets that are 
present in every day transactions. I even believe that those data sets will 
also benefit from the above optimisation.

Thoughts/Comments?

Regards,
Atri

Sent from my iPad

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers