Re: [PATCHES] hash index improving v3
Can we consider this hash thread closed/completed? --- Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: Thinks: Why not just sort all of the time and skip the debate entirely? The sort is demonstrably a loser for smaller indexes. Admittedly, if the index is small then the sort can't cost all that much, but if the (correct) threshold is some large fraction of shared_buffers then it could still take awhile on installations with lots-o-buffers. The other side of that coin is that it's not clear this is really worth arguing about, much less exposing a separate parameter for. regards, tom lane -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches -- Bruce Momjian [EMAIL PROTECTED]http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Wed, 2008-09-24 at 12:04 -0400, Bruce Momjian wrote: Can we consider this hash thread closed/completed? Please -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
Simon Riggs [EMAIL PROTECTED] writes: maintenance_work_mem is already used for 3 separate operations that bear little resemblance to each other. If it's appropriate for all of those then its appropriate for this usage also. No, it isn't. The fundamental point here is that this isn't a memory allocation parameter; it's a switchover threshold between two different behaviors. regards, tom lane -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Tue, 2008-09-23 at 08:16 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: maintenance_work_mem is already used for 3 separate operations that bear little resemblance to each other. If it's appropriate for all of those then its appropriate for this usage also. No, it isn't. The fundamental point here is that this isn't a memory allocation parameter; it's a switchover threshold between two different behaviors. That's a little confusing since sorts switch their behaviour also, but use (some form of) work_mem, which is *also* their max allocation. I see the difficulty in understanding the algorithm's behaviour now. So shared_buffers is the wrong parameter, but even if we had a parameter it would be very difficult to set it. Thinks: Why not just sort all of the time and skip the debate entirely? I thought the main use case was for larger indexes, since that's when the number of levels in the index is significantly less than btrees? Do we need to optimise creation time of smaller hash indexes at all? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
Simon Riggs [EMAIL PROTECTED] writes: Thinks: Why not just sort all of the time and skip the debate entirely? The sort is demonstrably a loser for smaller indexes. Admittedly, if the index is small then the sort can't cost all that much, but if the (correct) threshold is some large fraction of shared_buffers then it could still take awhile on installations with lots-o-buffers. The other side of that coin is that it's not clear this is really worth arguing about, much less exposing a separate parameter for. regards, tom lane -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Tue, 2008-09-23 at 09:13 -0400, Tom Lane wrote: The other side of that coin is that it's not clear this is really worth arguing about, much less exposing a separate parameter for. Agreed. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Fri, Sep 12, 2008 at 8:29 AM, Kenneth Marshall [EMAIL PROTECTED] wrote: On Thu, Sep 11, 2008 at 08:51:53PM -0600, Alex Hunsaker wrote: On Thu, Sep 11, 2008 at 9:24 AM, Kenneth Marshall [EMAIL PROTECTED] wrote: Alex, I meant to check the performance with increasing numbers of collisions, not increasing size of the hashed item. In other words, something like this: for ($coll=500; $i=100; $i=$i*2) { for ($i=0; $i=100; $i++) { hash(int8 $i); } # add the appropriate number of collisions, distributed evenly to # minimize the packing overrun problem for ($dup=0; $dup=$coll; $dup++) { hash(int8 MAX_INT + $dup * 100/$coll); } } Ken *doh* right something like this... create or replace function create_test_hash() returns bool as $$ declare coll integer default 500; -- tweak this to where create index gets really slow max_coll integer default 100; begin loop execute 'create table test_hash_'|| coll ||'(num int8);'; execute 'insert into test_hash_'|| coll ||' (num) select n from generate_series(0, '|| max_coll ||') as n;'; execute 'insert into test_hash_'|| coll ||' (num) select (n+4294967296) * '|| max_col ||'/'|| coll ||'::int from generate_series(0, '|| coll ||') as n;'; coll := coll * 2; exit when coll = max_coll; end loop; return true; end; $$ language 'plpgsql'; And then benchmark each table, and for extra credit cluster the table on the index and benchmark that. Also obviously with the hashint8 which just ignores the top 32 bits. Right? Yes, that is exactly right. Ken Ok I finally found time to do this, In summary looks like v5 scales about the same as cvs head when the collisions are spread evenly (obviously not HEAD with the hash patch applied...). I couldn't test cluster because we can't cluster on hash indexes... benchmark with 50,000,000 rows and 500 collisions: index creation time: head: 326116.647 ms v5: 269509.026 ms pgbench -n -c1 -T600 -f q.sql hash head: tps = 3226.605611 v5: tps = 3412.64 (excluding connections establishing) 50,000,000 rows and 32,768,000 collisions index time: head: 576600.967 ms v5: 487949.490 ms pgbench -n -c1 -T500 -f q.sql hash head: tps = 3105.270185 v5: tps = 3382.25782 You can see each result from 500 all the way up to 32,768,000 collision in the attached result.out Attached files: create_test_hash.sql: function I used to create the test tables result.out: output from bench.pl which shows the pgbench results and the create index times bench.pl: stupid little perl script to test pgbench each of the created tables from create_test_hash.pl bench.pl Description: Perl program create_test_hash.sql Description: Binary data result.out Description: Binary data -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Sun, Sep 14, 2008 at 8:16 PM, Tom Lane [EMAIL PROTECTED] wrote: BTW, one thing I noticed was that the hash index build time for the wide column case got a lot worse after applying the patch (from 56 to 237 sec). The reason for this turned out to be that with the smaller predicted index size, the code decided not to use the pre-sorting method that was recently added. Reducing effective_cache_size to less than the index size brought the time back down, to about 54 sec. So it would seem that effective_cache_size is too large a cutoff value. I'm considering changing hashbuild to switch over at shared_buffers instead of effective_cache_size --- any thoughts about that? Switching to shared_buffers gets my vote, on my test table with 50,000,000 rows it takes about 8 minutes to create an index using the default effective_cache_size. With effective_cache_size set to 6GB (machine has 8GB) its still going an hour later. -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Mon, Sep 22, 2008 at 7:57 PM, Alex Hunsaker [EMAIL PROTECTED] wrote: 50,000,000 rows and 32,768,000 collisions I should mention thats 50 million rows + 32 million more or 62,768,000 rows which explains some of the added index creation time... index time: head: 576600.967 ms v5: 487949.490 ms -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
Jonah H. Harris [EMAIL PROTECTED] writes: On Mon, Sep 22, 2008 at 11:25 PM, Alex Hunsaker [EMAIL PROTECTED] wrote: On Sun, Sep 14, 2008 at 8:16 PM, Tom Lane [EMAIL PROTECTED] wrote: I'm considering changing hashbuild to switch over at shared_buffers instead of effective_cache_size --- any thoughts about that? Switching to shared_buffers gets my vote, on my test table with 50,000,000 rows it takes about 8 minutes to create an index using the default effective_cache_size. With effective_cache_size set to 6GB (machine has 8GB) its still going an hour later. Agreed. I think using shared_buffers as a cutoff is a much better idea as well. Already done in CVS a week or so back, but thanks for following up with some confirmation. regards, tom lane -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Tue, 2008-09-23 at 00:05 -0400, Tom Lane wrote: Jonah H. Harris [EMAIL PROTECTED] writes: On Mon, Sep 22, 2008 at 11:25 PM, Alex Hunsaker [EMAIL PROTECTED] wrote: On Sun, Sep 14, 2008 at 8:16 PM, Tom Lane [EMAIL PROTECTED] wrote: I'm considering changing hashbuild to switch over at shared_buffers instead of effective_cache_size --- any thoughts about that? Switching to shared_buffers gets my vote, on my test table with 50,000,000 rows it takes about 8 minutes to create an index using the default effective_cache_size. With effective_cache_size set to 6GB (machine has 8GB) its still going an hour later. Agreed. I think using shared_buffers as a cutoff is a much better idea as well. Already done in CVS a week or so back, but thanks for following up with some confirmation. I think nobody noticed that change, or the discussion. I wasn't very happy with effective_cache_size and not happy with shared_buffers either. If building hash indexes is memory critical then we just need to say so and encourage others to set memory use correctly. People are already aware that maintenance_work_mem needs to be increased for large index builds and we will confuse people if we ignore that and use another parameter instead. At the very least, a controllable parameter is the only appropriate choice for production systems, not one that cannot be changed without restart. It would also throw away any chance of having pg_restore of hash indexes run in parallel, since it will not be able to control memory usage. Setting maintenance_work_mem higher than shared buffers is easily possible now and we should just use that rather than try to prejudge how people will and can configure their systems. If maintenance_work_mem default is too low, lets increase it. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
Simon Riggs [EMAIL PROTECTED] writes: I wasn't very happy with effective_cache_size and not happy with shared_buffers either. If building hash indexes is memory critical then we just need to say so and encourage others to set memory use correctly. People are already aware that maintenance_work_mem needs to be increased for large index builds and we will confuse people if we ignore that and use another parameter instead. I think you've got this completely backwards. The general theory about maintenance_work_mem is set it as large as you can stand it. The issue at hand here is that the crossover point for hash index sort building seems to be a good deal less than all-the-memory-you-have. Perhaps there is a case for giving this behavior its very own configuration parameter; but seeing that we still don't have all that much of a use case for hash indexes at all, I don't feel a need to do that yet. In any case, tying it to maintenance_work_mem is certainly wrong. regards, tom lane -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Tue, 2008-09-23 at 00:48 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: I wasn't very happy with effective_cache_size and not happy with shared_buffers either. If building hash indexes is memory critical then we just need to say so and encourage others to set memory use correctly. People are already aware that maintenance_work_mem needs to be increased for large index builds and we will confuse people if we ignore that and use another parameter instead. I think you've got this completely backwards. The general theory about maintenance_work_mem is set it as large as you can stand it. The issue at hand here is that the crossover point for hash index sort building seems to be a good deal less than all-the-memory-you-have. There's an optimal point for btree builds using sorts that is a good deal less also, so I don't get that. Plus, if you give it all-the-memory-you-have there won't be anything left for anything else. You might set it that high for an empty database load but you don't set it that high in production unless you want to see swapping when you create large indexes. maintenance_work_mem is already used for 3 separate operations that bear little resemblance to each other. If it's appropriate for all of those then its appropriate for this usage also. Doing it otherwise is going to mean more than 50% of people do the wrong thing, which would be a shame because what's been done looks very cool. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
I did some testing of my own on the hash index patch, and mostly seem to have confirmed Alex's results. I used a table like this: create table tab (f1 serial primary key, f2 some-datatype); create index ind on tab using hash (f2); and populated it with 2 million rows of data; then timed queries like this: select * from tab a join tab b using(f2) where f2 = (select f2 from tab c where c.f1 = (select int4(random() * 2e6))); using pgbench like this: pgbench -n -c 1 -T 60 -M prepared -f query.sql hashdb To test wide indexed columns I used a text column with entries of 100 random characters (identical to Alex's test). I saw a performance gain of about 50% with an empty kernel cache (26.9 vs 41.9 tps), dropping to about 14% once the table and index were fully swapped in (4185 vs 4764 tps). This was in a C-locale database. Presumably the win would have been significantly greater in a non-C-locale test, but I didn't try it. To test narrow indexed columns I made f2 a bigint containing 200 consecutive integers. Here I saw about a 5% improvement with either empty cache (48.1 vs 50.5 tps) or full cache (4590 vs 4800 tps). This is not too surprising since about the same amount of I/O is needed either way, and bigint comparison is very cheap. (This is a 64-bit machine, and it's defaulting to pass-by-value for bigints, so value comparisons are hardly more expensive than hashcode comparisons.) But the patch still wins a bit by being able to do binary search within index pages. In both of the above cases there were only a negligible number of hash collisions. I made up another test case, still 2 million bigints, but with values chosen so that almost every entry had a hash collision with another entry (a small fraction had two or even 3 collisions). This showed about a 25% slowdown compared to CVS HEAD with empty cache (49.9 tps down to 37.2), decreasing to 3% with full cache (4609 vs 4482 tps). I experimented with some variant queries that did more hash index fetches per query, and saw penalties as high as 50%. However, this is surely by far the worst case for the patch: no savings in index size, and a ridiculously high collision rate. Lastly, for comparison's sake I tried the wide column case with a btree instead of hash index. This gave me 31.5 tps with empty cache, 4749 tps with full cache. Note that the btree is losing significantly to the patched hash index in empty-cache conditions --- this presumably reflects the much larger index size causing more I/O. I'm thinking that we should go ahead and apply the patch. AFAIR this is the first example we've ever seen of hash beating btree for fetch performance, and it's pretty obvious that removing the indexed value from the index contents is the reason for the win. So I've lost interest in the alternative that involved storing both hashcode and indexed value. We now see a possible production use-case for hash, namely indexing wide column values. BTW, one thing I noticed was that the hash index build time for the wide column case got a lot worse after applying the patch (from 56 to 237 sec). The reason for this turned out to be that with the smaller predicted index size, the code decided not to use the pre-sorting method that was recently added. Reducing effective_cache_size to less than the index size brought the time back down, to about 54 sec. So it would seem that effective_cache_size is too large a cutoff value. I'm considering changing hashbuild to switch over at shared_buffers instead of effective_cache_size --- any thoughts about that? regards, tom lane -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
Alex Hunsaker napsal(a): On Wed, Sep 10, 2008 at 10:27 AM, Alex Hunsaker [EMAIL PROTECTED] wrote: On Wed, Sep 10, 2008 at 8:47 AM, Zdenek Kotala [EMAIL PROTECTED] wrote: What locale did you use? It would be nice to have also comparing between C and any UTF8 locale. I think we should see big benefit when non C locale is used. Err yes this was UTF8, Ill see about doing a C locale. And here it with a C locale: pgbench -c1 -n -t10 -f bench.sql cvs head: tps = 5142.784262 v5: tps = 6169.405965 If I look on both results C UTF8difference --- cvs head: 51405050-2% v5: 61705750-7% improvement:20% 14% than I little bit wonder. I personally expected bigger difference of UTF8 comparing between CVS a v5. This test also shows that locale selection has bigger impact on performance in v5 case, but result is still better than cvs head. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Thu, Sep 11, 2008 at 08:51:53PM -0600, Alex Hunsaker wrote: On Thu, Sep 11, 2008 at 9:24 AM, Kenneth Marshall [EMAIL PROTECTED] wrote: Alex, I meant to check the performance with increasing numbers of collisions, not increasing size of the hashed item. In other words, something like this: for ($coll=500; $i=100; $i=$i*2) { for ($i=0; $i=100; $i++) { hash(int8 $i); } # add the appropriate number of collisions, distributed evenly to # minimize the packing overrun problem for ($dup=0; $dup=$coll; $dup++) { hash(int8 MAX_INT + $dup * 100/$coll); } } Ken *doh* right something like this... create or replace function create_test_hash() returns bool as $$ declare coll integer default 500; -- tweak this to where create index gets really slow max_coll integer default 100; begin loop execute 'create table test_hash_'|| coll ||'(num int8);'; execute 'insert into test_hash_'|| coll ||' (num) select n from generate_series(0, '|| max_coll ||') as n;'; execute 'insert into test_hash_'|| coll ||' (num) select (n+4294967296) * '|| max_col ||'/'|| coll ||'::int from generate_series(0, '|| coll ||') as n;'; coll := coll * 2; exit when coll = max_coll; end loop; return true; end; $$ language 'plpgsql'; And then benchmark each table, and for extra credit cluster the table on the index and benchmark that. Also obviously with the hashint8 which just ignores the top 32 bits. Right? Yes, that is exactly right. Ken -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Fri, Sep 12, 2008 at 3:14 AM, Zdenek Kotala [EMAIL PROTECTED] wrote: Alex Hunsaker napsal(a): On Wed, Sep 10, 2008 at 10:27 AM, Alex Hunsaker [EMAIL PROTECTED] wrote: On Wed, Sep 10, 2008 at 8:47 AM, Zdenek Kotala [EMAIL PROTECTED] wrote: What locale did you use? It would be nice to have also comparing between C and any UTF8 locale. I think we should see big benefit when non C locale is used. Err yes this was UTF8, Ill see about doing a C locale. And here it with a C locale: pgbench -c1 -n -t10 -f bench.sql cvs head: tps = 5142.784262 v5: tps = 6169.405965 If I look on both results C UTF8difference --- cvs head: 51405050-2% v5: 61705750-7% improvement:20% 14% than I little bit wonder. I personally expected bigger difference of UTF8 comparing between CVS a v5. This test also shows that locale selection has bigger impact on performance in v5 case, but result is still better than cvs head. Right, I think part of it is I need to try again with a larger dataset... The reason I did three runs before was because it was variable between (v5 between 5700 and 6200). Also the query im doing ming be two simplistic because it runs to fast, thats part of why I get the variable speed between runs (i think...) Suggestions? -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Wed, Sep 10, 2008 at 10:17:31PM -0600, Alex Hunsaker wrote: On Wed, Sep 10, 2008 at 9:49 PM, Alex Hunsaker [EMAIL PROTECTED] wrote: On Wed, Sep 10, 2008 at 7:04 AM, Kenneth Marshall [EMAIL PROTECTED] wrote: On Tue, Sep 09, 2008 at 07:23:03PM -0600, Alex Hunsaker wrote: On Tue, Sep 9, 2008 at 7:48 AM, Kenneth Marshall [EMAIL PROTECTED] wrote: I think that the glacial speed for generating a big hash index is the same problem that the original code faced. Yeah sorry, I was not saying it was a new problem with the patch. Err at least not trying to :) *Both* of them had been running at 18+ (I finally killed them sometime Sunday or around +32 hours...) It would be useful to have an equivalent test for the hash-only index without the modified int8 hash function, since that would be more representative of its performance. The collision rates that I was observing in my tests of the old and new mix() functions was about 2 * (1/1) of what you test generated. You could just test against the integers between 1 and 200. Sure but then its pretty much just a general test of patch vs no patch. i.e. How do we measure how much longer collisions take when the new patch makes things faster? That's what I was trying to measure... Though I apologize I don't think that was clearly stated anywhere... Right, I agree that we need to benchmark the collision processing time difference. I am not certain that two data points is useful information. There are 469 collisions with our current hash function on the integers from 1 to 200. What about testing the performance at power-of-2 multiples of 500, i.e. 500, 1000, 2000, 4000, 8000,... Unless you adjust the fill calculation for the CREATE INDEX, I would stop once the time to create the index spikes. It might also be useful to see if a CLUSTER affects the performance as well. What do you think of that strategy? Not sure it will be a good benchmark of collision processing. Then again you seem to have studied the hash algo closer than me. Ill go see about doing this. Stay tuned. Assuming I understood you correctly, And I probably didn't this does not work very well because you max out at 27,006 values before you get this error: ERROR: index row size 8152 exceeds hash maximum 8144 HINT: Values larger than a buffer page cannot be indexed. So is a power-of-2 multiple of 500 not simply: x = 500; while(1) { print x; x *= 2; } ? Alex, I meant to check the performance with increasing numbers of collisions, not increasing size of the hashed item. In other words, something like this: for ($coll=500; $i=100; $i=$i*2) { for ($i=0; $i=100; $i++) { hash(int8 $i); } # add the appropriate number of collisions, distributed evenly to # minimize the packing overrun problem for ($dup=0; $dup=$coll; $dup++) { hash(int8 MAX_INT + $dup * 100/$coll); } } Ken -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Thu, Sep 11, 2008 at 9:24 AM, Kenneth Marshall [EMAIL PROTECTED] wrote: Alex, I meant to check the performance with increasing numbers of collisions, not increasing size of the hashed item. In other words, something like this: for ($coll=500; $i=100; $i=$i*2) { for ($i=0; $i=100; $i++) { hash(int8 $i); } # add the appropriate number of collisions, distributed evenly to # minimize the packing overrun problem for ($dup=0; $dup=$coll; $dup++) { hash(int8 MAX_INT + $dup * 100/$coll); } } Ken *doh* right something like this... create or replace function create_test_hash() returns bool as $$ declare coll integer default 500; -- tweak this to where create index gets really slow max_coll integer default 100; begin loop execute 'create table test_hash_'|| coll ||'(num int8);'; execute 'insert into test_hash_'|| coll ||' (num) select n from generate_series(0, '|| max_coll ||') as n;'; execute 'insert into test_hash_'|| coll ||' (num) select (n+4294967296) * '|| max_col ||'/'|| coll ||'::int from generate_series(0, '|| coll ||') as n;'; coll := coll * 2; exit when coll = max_coll; end loop; return true; end; $$ language 'plpgsql'; And then benchmark each table, and for extra credit cluster the table on the index and benchmark that. Also obviously with the hashint8 which just ignores the top 32 bits. Right? -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
Alex Hunsaker napsal(a): wide: # NOTE not on the same machine as the narrow test was run! # spit out 2, 000, 000 random 100 length strings perl gen.pl data.sql create table test_hash (wide text); copy test_hash from './data.sql'; create index test_hash_num_idx on test_hash using hash (wide); bench.sql: select a.wide from test_hash as a inner join test_hash as b on b.wide = a.wide where a.wide = 'BJNORSLMITGKHJCWDBLKLYRSJTVPTYXZJPWNBKXGHYFNDHRAKNFMDHRMUXLDXNTRBJMTHPGPBFJZPAENZXDHAHCUSCJTUPUXWCXUH'; # ^ that string is in data.sql # 3 runs each pgbench -c1 -n -t10 -f bench.sql cvs head: tps = 5073.463498, 5110.620923, 4955.347610 v5: tps = 5870.681336, 5740.007837, 5699.002942 What locale did you use? It would be nice to have also comparing between C and any UTF8 locale. I think we should see big benefit when non C locale is used. Thanks Zdenek -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Wed, Sep 10, 2008 at 8:47 AM, Zdenek Kotala [EMAIL PROTECTED] wrote: What locale did you use? It would be nice to have also comparing between C and any UTF8 locale. I think we should see big benefit when non C locale is used. Err yes this was UTF8, Ill see about doing a C locale. -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Wed, Sep 10, 2008 at 10:27 AM, Alex Hunsaker [EMAIL PROTECTED] wrote: On Wed, Sep 10, 2008 at 8:47 AM, Zdenek Kotala [EMAIL PROTECTED] wrote: What locale did you use? It would be nice to have also comparing between C and any UTF8 locale. I think we should see big benefit when non C locale is used. Err yes this was UTF8, Ill see about doing a C locale. And here it with a C locale: pgbench -c1 -n -t10 -f bench.sql cvs head: tps = 5142.784262 v5: tps = 6169.405965 and just for fun a the same using a btree index pgbench -c1 -n -t10 -f bench.sql cvs head: tps = 5234.680407 v5: tps = 5286.08252 -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Wed, Sep 10, 2008 at 7:04 AM, Kenneth Marshall [EMAIL PROTECTED] wrote: On Tue, Sep 09, 2008 at 07:23:03PM -0600, Alex Hunsaker wrote: On Tue, Sep 9, 2008 at 7:48 AM, Kenneth Marshall [EMAIL PROTECTED] wrote: I think that the glacial speed for generating a big hash index is the same problem that the original code faced. Yeah sorry, I was not saying it was a new problem with the patch. Err at least not trying to :) *Both* of them had been running at 18+ (I finally killed them sometime Sunday or around +32 hours...) It would be useful to have an equivalent test for the hash-only index without the modified int8 hash function, since that would be more representative of its performance. The collision rates that I was observing in my tests of the old and new mix() functions was about 2 * (1/1) of what you test generated. You could just test against the integers between 1 and 200. Sure but then its pretty much just a general test of patch vs no patch. i.e. How do we measure how much longer collisions take when the new patch makes things faster? That's what I was trying to measure... Though I apologize I don't think that was clearly stated anywhere... Right, I agree that we need to benchmark the collision processing time difference. I am not certain that two data points is useful information. There are 469 collisions with our current hash function on the integers from 1 to 200. What about testing the performance at power-of-2 multiples of 500, i.e. 500, 1000, 2000, 4000, 8000,... Unless you adjust the fill calculation for the CREATE INDEX, I would stop once the time to create the index spikes. It might also be useful to see if a CLUSTER affects the performance as well. What do you think of that strategy? Not sure it will be a good benchmark of collision processing. Then again you seem to have studied the hash algo closer than me. Ill go see about doing this. Stay tuned. -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Wed, Sep 10, 2008 at 9:49 PM, Alex Hunsaker [EMAIL PROTECTED] wrote: On Wed, Sep 10, 2008 at 7:04 AM, Kenneth Marshall [EMAIL PROTECTED] wrote: On Tue, Sep 09, 2008 at 07:23:03PM -0600, Alex Hunsaker wrote: On Tue, Sep 9, 2008 at 7:48 AM, Kenneth Marshall [EMAIL PROTECTED] wrote: I think that the glacial speed for generating a big hash index is the same problem that the original code faced. Yeah sorry, I was not saying it was a new problem with the patch. Err at least not trying to :) *Both* of them had been running at 18+ (I finally killed them sometime Sunday or around +32 hours...) It would be useful to have an equivalent test for the hash-only index without the modified int8 hash function, since that would be more representative of its performance. The collision rates that I was observing in my tests of the old and new mix() functions was about 2 * (1/1) of what you test generated. You could just test against the integers between 1 and 200. Sure but then its pretty much just a general test of patch vs no patch. i.e. How do we measure how much longer collisions take when the new patch makes things faster? That's what I was trying to measure... Though I apologize I don't think that was clearly stated anywhere... Right, I agree that we need to benchmark the collision processing time difference. I am not certain that two data points is useful information. There are 469 collisions with our current hash function on the integers from 1 to 200. What about testing the performance at power-of-2 multiples of 500, i.e. 500, 1000, 2000, 4000, 8000,... Unless you adjust the fill calculation for the CREATE INDEX, I would stop once the time to create the index spikes. It might also be useful to see if a CLUSTER affects the performance as well. What do you think of that strategy? Not sure it will be a good benchmark of collision processing. Then again you seem to have studied the hash algo closer than me. Ill go see about doing this. Stay tuned. Assuming I understood you correctly, And I probably didn't this does not work very well because you max out at 27,006 values before you get this error: ERROR: index row size 8152 exceeds hash maximum 8144 HINT: Values larger than a buffer page cannot be indexed. So is a power-of-2 multiple of 500 not simply: x = 500; while(1) { print x; x *= 2; } ? -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Sat, Sep 06, 2008 at 08:23:05PM -0600, Alex Hunsaker wrote: On Sat, Sep 6, 2008 at 1:09 PM, Tom Lane [EMAIL PROTECTED] wrote: For the convenience of anyone intending to test, here is an updated patch against CVS HEAD that incorporates Alex's fix. Here are the results for a table containing 1 million entries that will generate hash collisions. It paints a bad picture for the patch but then again im not sure how relevant the issue is. For example yesterday I imported a table with 10 million collisions and the create index is still running (now at about ~18 hours). Maybe we should warn if there are lots of collisions when creating the index and suggest you use a btree? Anyway here are the results. I think that the glacial speed for generating a big hash index is the same problem that the original code faced. Because of the collisions you are unable to achieve the correct packing that the code assumes. This results in the splitting/copying of every page in the hash index, a very slow proposition. I had suggested adding some additional parameters like fillfactor to accomodate these sorts of situations. Since your test cuts the effective fill by 2 because of the many collisions, you would need to adjust that calculation to avoid the tremendous amount of random I/O generated by that mis-assumption. ./pgbench -c1 -n -t10 -f bench_createindex.sql cvs head: tps = 0.002169 v5 : tps = 0.002196 pgbench -c1 -n -t1000 -f bench_bitmap.sql cvs head: tps = 24.011871 v5: tps = 2.543123 pgbench -c1 -n -t1000 -f bench_index.sql cvs head: tps = 51.614502 v5: tps = 3.205542 pgbench -c1 -n -t1000 -f bench_seqscan.sql cvs head: tps = 8.553318 v5: tps = 9.836091 Table created via: create table test_hash (num int8); ./hash | psql -c 'copy test_hash from stdin;' It would be useful to have an equivalent test for the hash-only index without the modified int8 hash function, since that would be more representative of its performance. The collision rates that I was observing in my tests of the old and new mix() functions was about 2 * (1/1) of what you test generated. You could just test against the integers between 1 and 200. Ken -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Tue, Sep 9, 2008 at 7:48 AM, Kenneth Marshall [EMAIL PROTECTED] wrote: I think that the glacial speed for generating a big hash index is the same problem that the original code faced. Yeah sorry, I was not saying it was a new problem with the patch. Err at least not trying to :) *Both* of them had been running at 18+ (I finally killed them sometime Sunday or around +32 hours...) It would be useful to have an equivalent test for the hash-only index without the modified int8 hash function, since that would be more representative of its performance. The collision rates that I was observing in my tests of the old and new mix() functions was about 2 * (1/1) of what you test generated. You could just test against the integers between 1 and 200. Sure but then its pretty much just a general test of patch vs no patch. i.e. How do we measure how much longer collisions take when the new patch makes things faster? That's what I was trying to measure... Though I apologize I don't think that was clearly stated anywhere... Now of course it still would be interesting... And if its only to 2,000,000 I can still use the modified int8 or just use the int4 one... Anyway Here are the numbers: create table test_hash(num int8); insert into test_hash (num) select generate_series(1, 200); create index test_hash_num_idx on test_hash (num); pgbench -c1 -n -t1 -f bench_index.sql cvs head: tps = 3161.500511 v5: tps = 7248.808839 BTW Im still planning on doing a wide vs narrow test... sometime... :) -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Tue, Sep 9, 2008 at 7:23 PM, Alex Hunsaker [EMAIL PROTECTED] wrote: create table test_hash(num int8); insert into test_hash (num) select generate_series(1, 200); create index test_hash_num_idx on test_hash (num); pgbench -c1 -n -t1 -f bench_index.sql cvs head: tps = 3161.500511 v5: tps = 7248.808839 Whoa... Ignore those numbers! It would bee nice to actually test with a hash index... and my cvs head atm had cassert on.. sorry about that. create table test_hash(num int8); insert into test_hash (num) select generate_series(1, 200); create index test_hash_num_idx on test_hash using hash (num); pgbench -c1 -n -t10 -f bench_index.sql cvs head: tps = 7345.69432 v5: tps = 7526.290462 -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Tue, Sep 9, 2008 at 7:23 PM, Alex Hunsaker [EMAIL PROTECTED] wrote: BTW Im still planning on doing a wide vs narrow test... sometime... :) narrow: (exactly the same as what I just did in the other post) create table test_hash(num int8); insert into test_hash (num) select generate_series(1, 200); create index test_hash_num_idx on test_hash using hash (num); pgbench -c1 -n -t10 -f bench_index.sql cvs head: tps = 7345.69432 v5: tps = 7526.290462 wide: # NOTE not on the same machine as the narrow test was run! # spit out 2, 000, 000 random 100 length strings perl gen.pl data.sql create table test_hash (wide text); copy test_hash from './data.sql'; create index test_hash_num_idx on test_hash using hash (wide); bench.sql: select a.wide from test_hash as a inner join test_hash as b on b.wide = a.wide where a.wide = 'BJNORSLMITGKHJCWDBLKLYRSJTVPTYXZJPWNBKXGHYFNDHRAKNFMDHRMUXLDXNTRBJMTHPGPBFJZPAENZXDHAHCUSCJTUPUXWCXUH'; # ^ that string is in data.sql # 3 runs each pgbench -c1 -n -t10 -f bench.sql cvs head: tps = 5073.463498, 5110.620923, 4955.347610 v5: tps = 5870.681336, 5740.007837, 5699.002942 gen.pl Description: Perl program -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
It is nice test case. It should be part of hash index regression test. Zdenek Alex Hunsaker napsal(a): Ok now that I made it so it actually *test* collisions, with the patch it always returns all rows that matched the hashed key. For example (non lobotomized inthash8, I just brute forced some collisions for 0 so that anyone can try this) create table test_hash (num int8); insert into test_hash (num) values (0), (1805671691), (3294821164), (4294967297); create index test_hash_num_idx on test_hash using hash (num); select * from test_hash where num = 0; num 0 1805671691 3294821164 4294967297 set enable_bitmapscan to off; select * from test_hash where num = 0; num - 0 CVS HEAD, 8_3_STABLE obviously work if I force the index scan -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
Tom Lane napsal(a): Alex Hunsaker [EMAIL PROTECTED] writes: On Fri, Sep 5, 2008 at 2:21 PM, Alex Hunsaker [EMAIL PROTECTED] wrote: Ok now that I made it so it actually *test* collisions, with the patch it always returns all rows that matched the hashed key. And here is the fix, we just forget to set the recheck flag for bitmap scans. For the convenience of anyone intending to test, here is an updated patch against CVS HEAD that incorporates Alex's fix. Tom, I attach cleanup patch which we discussed last commit fest. It introduce new macros HashGetMetaPage and HashGetBitmap and of course, it break backward on disk format compatibility which we will break it anyway when Xiao's patch will be committed. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql bÄžné podadresáÅe: pgsql_hash.be486df40421/src a pgsql_hash/src bÄžné podadresáÅe: pgsql_hash.be486df40421/src/backend a pgsql_hash/src/backend bÄžné podadresáÅe: pgsql_hash.be486df40421/src/include a pgsql_hash/src/include bÄžné podadresáÅe: pgsql_hash.be486df40421/src/backend/access a pgsql_hash/src/backend/access bÄžné podadresáÅe: pgsql_hash.be486df40421/src/backend/access/hash a pgsql_hash/src/backend/access/hash diff -rc pgsql_hash.be486df40421/src/backend/access/hash/hash.c pgsql_hash/src/backend/access/hash/hash.c *** pgsql_hash.be486df40421/src/backend/access/hash/hash.c po záŠ8 16:14:00 2008 --- pgsql_hash/src/backend/access/hash/hash.c po záŠ8 16:14:00 2008 *** *** 527,533 * each bucket. */ metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE); ! metap = (HashMetaPage) BufferGetPage(metabuf); orig_maxbucket = metap-hashm_maxbucket; orig_ntuples = metap-hashm_ntuples; memcpy(local_metapage, metap, sizeof(local_metapage)); --- 527,533 * each bucket. */ metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE); ! metap = HashPageGetMeta(BufferGetPage(metabuf)); orig_maxbucket = metap-hashm_maxbucket; orig_ntuples = metap-hashm_ntuples; memcpy(local_metapage, metap, sizeof(local_metapage)); *** *** 629,635 /* Write-lock metapage and check for split since we started */ metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_WRITE, LH_META_PAGE); ! metap = (HashMetaPage) BufferGetPage(metabuf); if (cur_maxbucket != metap-hashm_maxbucket) { --- 629,635 /* Write-lock metapage and check for split since we started */ metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_WRITE, LH_META_PAGE); ! metap = HashPageGetMeta(BufferGetPage(metabuf)); if (cur_maxbucket != metap-hashm_maxbucket) { diff -rc pgsql_hash.be486df40421/src/backend/access/hash/hashinsert.c pgsql_hash/src/backend/access/hash/hashinsert.c *** pgsql_hash.be486df40421/src/backend/access/hash/hashinsert.c po záŠ8 16:14:00 2008 --- pgsql_hash/src/backend/access/hash/hashinsert.c po záŠ8 16:14:00 2008 *** *** 69,75 /* Read the metapage */ metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE); ! metap = (HashMetaPage) BufferGetPage(metabuf); /* * Check whether the item can fit on a hash page at all. (Eventually, we --- 69,75 /* Read the metapage */ metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE); ! metap = HashPageGetMeta(BufferGetPage(metabuf)); /* * Check whether the item can fit on a hash page at all. (Eventually, we diff -rc pgsql_hash.be486df40421/src/backend/access/hash/hashovfl.c pgsql_hash/src/backend/access/hash/hashovfl.c *** pgsql_hash.be486df40421/src/backend/access/hash/hashovfl.c po záŠ8 16:14:00 2008 --- pgsql_hash/src/backend/access/hash/hashovfl.c po záŠ8 16:14:00 2008 *** *** 187,193 _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE); _hash_checkpage(rel, metabuf, LH_META_PAGE); ! metap = (HashMetaPage) BufferGetPage(metabuf); /* start search at hashm_firstfree */ orig_firstfree = metap-hashm_firstfree; --- 187,193 _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE); _hash_checkpage(rel, metabuf, LH_META_PAGE); ! metap = HashPageGetMeta(BufferGetPage(metabuf)); /* start search at hashm_firstfree */ orig_firstfree = metap-hashm_firstfree; *** *** 450,456 /* Read the metapage so we can determine which bitmap page to use */ metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE); ! metap = (HashMetaPage) BufferGetPage(metabuf); /* Identify which bit to set */ ovflbitno = blkno_to_bitno(metap, ovflblkno); --- 450,456 /* Read the metapage so we can determine which bitmap page to use */ metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE); ! metap = HashPageGetMeta(BufferGetPage(metabuf)); /* Identify which bit to set */ ovflbitno = blkno_to_bitno(metap, ovflblkno);
Re: [PATCHES] hash index improving v3
Zdenek Kotala [EMAIL PROTECTED] writes: I attach cleanup patch which we discussed last commit fest. It introduce new macros HashGetMetaPage and HashGetBitmap and of course, it break backward on disk format compatibility which we will break it anyway when Xiao's patch will be committed. If we accept that patch, I'm okay with including this one too. Still hoping for some performance testing though. (Note that this one isn't going to affect performance, so no need to include it for that purpose.) regards, tom lane -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
Alex Hunsaker [EMAIL PROTECTED] writes: - tbm_add_tuples(tbm, scan-xs_ctup.t_self, 1, false); + tbm_add_tuples(tbm, scan-xs_ctup.t_self, 1, true); Hah, I bet that explains Jonah's complaint that recheck didn't seem to be happening in his tests. Nice catch. regards, tom lane -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
Alex Hunsaker [EMAIL PROTECTED] writes: On Fri, Sep 5, 2008 at 2:21 PM, Alex Hunsaker [EMAIL PROTECTED] wrote: Ok now that I made it so it actually *test* collisions, with the patch it always returns all rows that matched the hashed key. And here is the fix, we just forget to set the recheck flag for bitmap scans. For the convenience of anyone intending to test, here is an updated patch against CVS HEAD that incorporates Alex's fix. regards, tom lane binwQERo3PM5e.bin Description: hash-v5.patch.gz -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Sat, Sep 6, 2008 at 1:09 PM, Tom Lane [EMAIL PROTECTED] wrote: For the convenience of anyone intending to test, here is an updated patch against CVS HEAD that incorporates Alex's fix. Here are the results for a table containing 1 million entries that will generate hash collisions. It paints a bad picture for the patch but then again im not sure how relevant the issue is. For example yesterday I imported a table with 10 million collisions and the create index is still running (now at about ~18 hours). Maybe we should warn if there are lots of collisions when creating the index and suggest you use a btree? Anyway here are the results. ./pgbench -c1 -n -t10 -f bench_createindex.sql cvs head: tps = 0.002169 v5 : tps = 0.002196 pgbench -c1 -n -t1000 -f bench_bitmap.sql cvs head: tps = 24.011871 v5: tps = 2.543123 pgbench -c1 -n -t1000 -f bench_index.sql cvs head: tps = 51.614502 v5: tps = 3.205542 pgbench -c1 -n -t1000 -f bench_seqscan.sql cvs head: tps = 8.553318 v5: tps = 9.836091 Table created via: create table test_hash (num int8); ./hash | psql -c 'copy test_hash from stdin;' bench_create.sql Description: Binary data bench_index.sql Description: Binary data bench_seqscan.sql Description: Binary data int8collide.patch Description: Binary data #include stdio.h #include stdlib.h #include limits.h int main(void) { unsigned long y = 0; unsigned cnt = 0; while(cnt 100) { y += UINT_MAX; y += 1; printf(%ld\n, y); cnt++; } } -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
2008/9/6 Alex Hunsaker [EMAIL PROTECTED]: pgbench -c1 -n -t1000 -f bench_bitmap.sql cvs head: tps = 24.011871 v5: tps = 2.543123 oops forgot to attach bench_bitmap.sql bench_bitmap.sql Description: Binary data -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
Ok now that I made it so it actually *test* collisions, with the patch it always returns all rows that matched the hashed key. For example (non lobotomized inthash8, I just brute forced some collisions for 0 so that anyone can try this) create table test_hash (num int8); insert into test_hash (num) values (0), (1805671691), (3294821164), (4294967297); create index test_hash_num_idx on test_hash using hash (num); select * from test_hash where num = 0; num 0 1805671691 3294821164 4294967297 set enable_bitmapscan to off; select * from test_hash where num = 0; num - 0 CVS HEAD, 8_3_STABLE obviously work if I force the index scan -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
] On Fri, Sep 5, 2008 at 2:21 PM, Alex Hunsaker [EMAIL PROTECTED] wrote: Ok now that I made it so it actually *test* collisions, with the patch it always returns all rows that matched the hashed key. And here is the fix, we just forget to set the recheck flag for bitmap scans. *** a/src/backend/access/hash/hash.c --- b/src/backend/access/hash/hash.c *** *** 317,323 hashgetbitmap(PG_FUNCTION_ARGS) /* Save tuple ID, and continue scanning */ if (add_tuple) { ! tbm_add_tuples(tbm, scan-xs_ctup.t_self, 1, false); ntids++; } --- 317,323 /* Save tuple ID, and continue scanning */ if (add_tuple) { ! tbm_add_tuples(tbm, scan-xs_ctup.t_self, 1, true); ntids++; } -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
I wrote: You have the unique-versus-not dimension, On second thought, actually not. What we want to look at is the penalty for false matches due to *distinct* key values that happen to have the same hash codes. Your test case for all-the-same is using all the same key values, which means it'll hit the heap a lot, but none of those will be wasted trips. So what we need for testing is a few different key values that hash to the same code. Not sure about an easy way to find such. regards, tom lane -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Thu, Sep 4, 2008 at 7:13 PM, Tom Lane [EMAIL PROTECTED] wrote: Alex Hunsaker [EMAIL PROTECTED] writes: Ok let me know if this is to naive of an approach or not hitting the right cases you want tested. You have the unique-versus-not dimension, but I'm also wondering about narrow vs wide index keys (say about 8 bytes vs 50-100 or so). In the former case we're not saving any index space by storing only the hash code, so these could be expected to have different performance behaviors. Arg yes... I just read the last part of your mail in this thread. I think it was the one on -hackers that talked about narrow vs wide... so I figured I would just try to do what the thread where you posted the patch talked about namley the below: So my thinking right now is that we should just test this patch as-is. If it doesn't show really horrid performance when there are lots of hash key collisions, we should forget the store-both-things idea and just go with this. So I thought, lets try to generate lots of hash collisions... obviosly though using the same key wont do that... Not sure what I was thinking As for specifics of the suggested scripts: * might be better to do select count(*) not select 1, so that client communication is minimized Yar. * check that the queries actually use the indexes (not sure that the proposed switch settings ensure this, not to mention you didn't create the indexes) Well I was assuming I could just test the speed of a hash join... * make sure the pgbench transaction counts are large enough to ensure significant runtime * the specific table sizes suggested are surely not large enough Ok -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Thu, Sep 4, 2008 at 7:45 PM, Tom Lane [EMAIL PROTECTED] wrote: So what we need for testing is a few different key values that hash to the same code. Not sure about an easy way to find such. Hrm, well I have not really looked at the hash algorithm but I assume we could just reduce the number of buckets? -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
Alex Hunsaker [EMAIL PROTECTED] writes: On Thu, Sep 4, 2008 at 7:45 PM, Tom Lane [EMAIL PROTECTED] wrote: So what we need for testing is a few different key values that hash to the same code. Not sure about an easy way to find such. Hrm, well I have not really looked at the hash algorithm but I assume we could just reduce the number of buckets? No, we need fully equal hash keys, else the code won't visit the heap. I guess one thing we could do for testing purposes is lobotomize one of the datatype-specific hash functions. For instance, make int8_hash return the input mod 2^32, ignoring the upper bytes. Then it'd be easy to compute different int8s that hash to the same thing. regards, tom lane -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
Alex Hunsaker [EMAIL PROTECTED] writes: On Thu, Sep 4, 2008 at 7:13 PM, Tom Lane [EMAIL PROTECTED] wrote: * check that the queries actually use the indexes (not sure that the proposed switch settings ensure this, not to mention you didn't create the indexes) Well I was assuming I could just test the speed of a hash join... Uh, no, hash joins have nearly zip to do with hash indexes. They rely on the same per-datatype support functions but that's the end of the commonality. regards, tom lane -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Thu, Sep 4, 2008 at 8:17 PM, Tom Lane [EMAIL PROTECTED] wrote: I guess one thing we could do for testing purposes is lobotomize one of the datatype-specific hash functions. For instance, make int8_hash return the input mod 2^32, ignoring the upper bytes. Then it'd be easy to compute different int8s that hash to the same thing. Heh Ok im slowly getting there... So we lobotomize hashint8 and then time how long it takes to make an index on a table... something like: create table test_hash(num int8); (obviously on a 64 bit machine) int main(void) { unsigned long y = 0; unsigned cnt = 0; printf(insert into test_hash (num) values ); //while(cnt != LONG_MAX/UINT_MAX) while(cnt 1000) { y += UINT_MAX; printf((%ld), , y); cnt++; } printf((0);\n); } ./a.out | psql pgbench -c 1 -t1000 -n -f test.sql test.sql: create index test_hash_num_idx on test_hash using hash (num); drop index test_hash_num_idx; For both pre and post patch just to make sure post patch is not worse than pre patch??? If im still way off and its not to much trouble want to give me a test case to run =) ? Or maybe because hash collisions should be fairly rare its not something to really worry about? -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Mon, 2008-08-18 at 09:46 +0800, Xiao Meng wrote: There's minor change against the previous one( http://archives.postgresql.org/pgsql-hackers/2008-07/msg01183.php ). * merge branch master(Aug 16) into the patch * clean code and make some comment Performance result is here http://wiki.postgresql.org/wiki/Gsoc08-hashindex It seems hash index is a little better on index creation and selection. But maybe it's in the range of noise, I'm not sure. I'd like to try it with a bigger dataset (e.g. table with 10GB) but there is not enough space in my computer. Anyone interest can make a test on a bigger data set. You don't give the text of the query used to do these performance tests, so I can't validate your test results. Right now it seems strange that the index is larger than a btree, yet the performance tests show that 3 times as much I/O was used accessing the btree. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
With the help of David Fetter, you can get the copy by git clone http://git.postgresql.org/git/~davidfetter/hash/.githttp://git.postgresql.org/git/%7Edavidfetter/hash/.git It's in the branch gsoc-hash. Thank you, David. -- Best Regards, Xiao Meng DKERC, Harbin Institute of Technology, China Gtalk: [EMAIL PROTECTED] MSN: [EMAIL PROTECTED] http://xiaomeng.yo2.cn