Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-12-23 Thread Bryce Cutt
Because there is no nice way in PostgreSQL (that I know of) to derive a histogram after a join (on an intermediate result) currently usingMostCommonValues is only enabled on a join when the outer (probe) side is a table scan (seq scan only actually). See getMostCommonValues (soon to be called

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-12-23 Thread Robert Haas
On Tue, Dec 23, 2008 at 2:21 AM, Bryce Cutt pandas...@gmail.com wrote: Because there is no nice way in PostgreSQL (that I know of) to derive a histogram after a join (on an intermediate result) currently usingMostCommonValues is only enabled on a join when the outer (probe) side is a table

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-12-23 Thread Joshua Tolley
On Tue, Dec 23, 2008 at 09:22:27AM -0500, Robert Haas wrote: On Tue, Dec 23, 2008 at 2:21 AM, Bryce Cutt pandas...@gmail.com wrote: Because there is no nice way in PostgreSQL (that I know of) to derive a histogram after a join (on an intermediate result) currently usingMostCommonValues is

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-12-23 Thread Robert Haas
It's equivalent to our assumption that distributions of values in columns in the same table are independent. Making that assumption in this case would probably result in occasional dramatic speed improvements similar to the ones we've seen in less complex joins, offset by just-as-occasional

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-12-23 Thread Joshua Tolley
On Tue, Dec 23, 2008 at 10:14:29AM -0500, Robert Haas wrote: It's equivalent to our assumption that distributions of values in columns in the same table are independent. Making that assumption in this case would probably result in occasional dramatic speed improvements similar to the ones

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-12-22 Thread Joshua Tolley
On Sun, Dec 21, 2008 at 10:25:59PM -0500, Robert Haas wrote: [Some performance testing.] I (finally!) have a chance to post my performance testing results... my apologies for the really long delay. Excuses omitted Unfortunately I'm not seeing wonderful speedups with the particular queries I did

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-12-21 Thread Robert Haas
[Some performance testing.] I ran this query 10x with this patch applied, and then 10x again with enable_hashjoin_usestatmvcs set to false to disable the optimization: select sum(1) from (select * from part, lineitem where p_partkey = l_partkey) x; With the optimization enabled, the query took

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-12-20 Thread Bryce Cutt
Robert, I thoroughly appreciate the constructive criticism. The compile errors are due to my development process being convoluted. I will endeavor to not waste your time in the future with errors caused by my development process. I have updated the code to follow the conventions and

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-12-17 Thread Robert Haas
Dr. Lawrence: I'm still working on reviewing this patch. I've managed to load the sample TPCH data from tpch1g1z.zip after changing the line endings to UNIX-style and chopping off the trailing vertical bars. (If anyone is interested, I have the results of pg_dump | bzip2 -9 on the resulting

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-12-17 Thread Lawrence, Ramon
Robert, You do not need to use qgen.exe to generate queries as you are not running the TPC-H benchmark test. Attached is an example of the 22 sample TPC-H queries according to the benchmark. We have not tested using the TPC-H queries for this particular patch and only use the TPC-H database

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-12-15 Thread Robert Haas
I have to admit that I haven't fully grokked what this patch is about just yet, so what follows is mostly a coding style review at this point. It would help a lot if you could add some comments to the new functions that are being added to explain the purpose of each at a very high level. There's

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-24 Thread Lawrence, Ramon
-Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED] I'm a tad worried about what happens when the values that are frequently occurring in the outer relation are also frequently occurring in the inner (which hardly seems an improbable case). Don't you stand a severe risk of

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-20 Thread Tom Lane
Lawrence, Ramon [EMAIL PROTECTED] writes: We propose a patch that improves hybrid hash join's performance for large multi-batch joins where the probe relation has skew. ... The basic idea is to keep build relation tuples in a small in-memory hash table that have join values that are

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-10 Thread Joshua Tolley
On Wed, Nov 05, 2008 at 04:06:11PM -0800, Bryce Cutt wrote: The error is causes by me Asserting against the wrong variable. I never noticed this as I apparently did not have assertions turned on on my development machine. That is fixed now and with the new patch version I have attached all

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-06 Thread Joshua Tolley
On Wed, Nov 5, 2008 at 5:06 PM, Bryce Cutt [EMAIL PROTECTED] wrote: The error is causes by me Asserting against the wrong variable. I never noticed this as I apparently did not have assertions turned on on my development machine. That is fixed now and with the new patch version I have

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-06 Thread Simon Riggs
On Thu, 2008-11-06 at 15:33 -0700, Joshua Tolley wrote: Stay tuned. Minor question on this patch. AFAICS there is another patch that seems to be aiming at exactly the same use case. Jonah's Bloom filter patch. Shouldn't we have a dust off to see which one is best? Or at least a discussion to

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-06 Thread Joshua Tolley
On Thu, Nov 6, 2008 at 3:52 PM, Simon Riggs [EMAIL PROTECTED] wrote: On Thu, 2008-11-06 at 15:33 -0700, Joshua Tolley wrote: Stay tuned. Minor question on this patch. AFAICS there is another patch that seems to be aiming at exactly the same use case. Jonah's Bloom filter patch. Shouldn't

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-06 Thread Lawrence, Ramon
-Original Message- Minor question on this patch. AFAICS there is another patch that seems to be aiming at exactly the same use case. Jonah's Bloom filter patch. Shouldn't we have a dust off to see which one is best? Or at least a discussion to test whether they overlap? Perhaps

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-06 Thread Joshua Tolley
On Thu, Nov 6, 2008 at 5:31 PM, Lawrence, Ramon [EMAIL PROTECTED] wrote: -Original Message- Minor question on this patch. AFAICS there is another patch that seems to be aiming at exactly the same use case. Jonah's Bloom filter patch. Shouldn't we have a dust off to see which one

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-05 Thread Joshua Tolley
On Mon, Oct 20, 2008 at 03:42:49PM -0700, Lawrence, Ramon wrote: We propose a patch that improves hybrid hash join's performance for large multi-batch joins where the probe relation has skew. I'm running into problems with this patch. It applies cleanly, and the technique you provided for

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-05 Thread Joshua Tolley
On Mon, Oct 20, 2008 at 03:42:49PM -0700, Lawrence, Ramon wrote: We propose a patch that improves hybrid hash join's performance for large multi-batch joins where the probe relation has skew. I also recommend modifying docs/src/sgml/config.sgml to include the enable_hashjoin_usestatmcvs

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-05 Thread Tom Lane
Joshua Tolley [EMAIL PROTECTED] writes: On Mon, Oct 20, 2008 at 03:42:49PM -0700, Lawrence, Ramon wrote: We propose a patch that improves hybrid hash join's performance for large multi-batch joins where the probe relation has skew. I also recommend modifying docs/src/sgml/config.sgml to

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-05 Thread Joshua Tolley
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Wed, Nov 5, 2008 at 8:20 AM, Tom Lane wrote: Joshua Tolley writes: On Mon, Oct 20, 2008 at 03:42:49PM -0700, Lawrence, Ramon wrote: We propose a patch that improves hybrid hash join's performance for large multi-batch joins where the probe

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-05 Thread Bryce Cutt
The error is causes by me Asserting against the wrong variable. I never noticed this as I apparently did not have assertions turned on on my development machine. That is fixed now and with the new patch version I have attached all assertions are passing with your query and my test queries. I

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-05 Thread Joshua Tolley
On Wed, Nov 05, 2008 at 04:06:11PM -0800, Bryce Cutt wrote: The error is causes by me Asserting against the wrong variable. I never noticed this as I apparently did not have assertions turned on on my development machine. That is fixed now and with the new patch version I have attached all

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-02 Thread Lawrence, Ramon
Joshua, Thank you for offering to review the patch. The easiest way to test would be to generate your own TPC-H data and load it into a database for testing. I have posted the TPC-H generator at: http://people.ok.ubc.ca/rlawrenc/TPCHSkew.zip The generator can produce skewed data sets. It was

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-02 Thread Joshua Tolley
On Sun, Nov 2, 2008 at 4:48 PM, Lawrence, Ramon [EMAIL PROTECTED] wrote: Joshua, Thank you for offering to review the patch. The easiest way to test would be to generate your own TPC-H data and load it into a database for testing. I have posted the TPC-H generator at:

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-02 Thread Tom Lane
Lawrence, Ramon [EMAIL PROTECTED] writes: The easiest way to test would be to generate your own TPC-H data and load it into a database for testing. I have posted the TPC-H generator at: http://people.ok.ubc.ca/rlawrenc/TPCHSkew.zip The generator can produce skewed data sets. It was produced

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-02 Thread Lawrence, Ramon
From: Tom Lane [mailto:[EMAIL PROTECTED] What alternatives are there for people who do not run Windows? regards, tom lane The TPC-H generator is a standard code base provided at http://www.tpc.org/tpch/. We have been able to compile this code on Linux. However, we

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-01 Thread Joshua Tolley
On Mon, Oct 20, 2008 at 4:42 PM, Lawrence, Ramon [EMAIL PROTECTED] wrote: We propose a patch that improves hybrid hash join's performance for large multi-batch joins where the probe relation has skew. Project name: Histojoin Patch file: histojoin_v1.patch This patch implements the Histojoin

[HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-10-20 Thread Lawrence, Ramon
We propose a patch that improves hybrid hash join's performance for large multi-batch joins where the probe relation has skew. Project name: Histojoin Patch file: histojoin_v1.patch This patch implements the Histojoin join algorithm as an optional feature added to the standard Hybrid Hash