Re: [PATCHES] hash index improving v3
I wrote: You have the unique-versus-not dimension, On second thought, actually not. What we want to look at is the penalty for false matches due to *distinct* key values that happen to have the same hash codes. Your test case for all-the-same is using all the same key values, which means it'll hit the heap a lot, but none of those will be wasted trips. So what we need for testing is a few different key values that hash to the same code. Not sure about an easy way to find such. regards, tom lane -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Thu, Sep 4, 2008 at 7:13 PM, Tom Lane [EMAIL PROTECTED] wrote: Alex Hunsaker [EMAIL PROTECTED] writes: Ok let me know if this is to naive of an approach or not hitting the right cases you want tested. You have the unique-versus-not dimension, but I'm also wondering about narrow vs wide index keys (say about 8 bytes vs 50-100 or so). In the former case we're not saving any index space by storing only the hash code, so these could be expected to have different performance behaviors. Arg yes... I just read the last part of your mail in this thread. I think it was the one on -hackers that talked about narrow vs wide... so I figured I would just try to do what the thread where you posted the patch talked about namley the below: So my thinking right now is that we should just test this patch as-is. If it doesn't show really horrid performance when there are lots of hash key collisions, we should forget the store-both-things idea and just go with this. So I thought, lets try to generate lots of hash collisions... obviosly though using the same key wont do that... Not sure what I was thinking As for specifics of the suggested scripts: * might be better to do select count(*) not select 1, so that client communication is minimized Yar. * check that the queries actually use the indexes (not sure that the proposed switch settings ensure this, not to mention you didn't create the indexes) Well I was assuming I could just test the speed of a hash join... * make sure the pgbench transaction counts are large enough to ensure significant runtime * the specific table sizes suggested are surely not large enough Ok -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Thu, Sep 4, 2008 at 7:45 PM, Tom Lane [EMAIL PROTECTED] wrote: So what we need for testing is a few different key values that hash to the same code. Not sure about an easy way to find such. Hrm, well I have not really looked at the hash algorithm but I assume we could just reduce the number of buckets? -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
Heikki Linnakangas [EMAIL PROTECTED] writes: After reading the wikipedia article on Boyer-Moore search algorithm, it looks to me like this patch actually implements the simpler Boyer-Moore-Horspool algorithm that only uses one lookup table. That's probably fine, as it ought to be faster on small needles and haystacks because it requires less effort to build the lookup tables, even though the worst-case performance is worse. It should still be faster than what we have now. Hmm. B-M-H has worst case search speed O(M*N) (where M = length of pattern, N = length of search string); whereas full B-M is O(N). Maybe we should build the second table when M is large? Although realistically that is probably gilding the lily, since frankly there haven't been many real-world complaints about the speed of these functions anyway ... The skip table really should be constructed only once in text_position_start and stored in TextPositionState. That would make a big difference to the performance of those functions that call text_position_next repeatedly: replace_text, split_text and text_to_array. +1. The heuristic about big a skip table to make may need some adjustment as well, since it seems to be considering only a single search. regards, tom lane -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
Alex Hunsaker [EMAIL PROTECTED] writes: On Thu, Sep 4, 2008 at 7:45 PM, Tom Lane [EMAIL PROTECTED] wrote: So what we need for testing is a few different key values that hash to the same code. Not sure about an easy way to find such. Hrm, well I have not really looked at the hash algorithm but I assume we could just reduce the number of buckets? No, we need fully equal hash keys, else the code won't visit the heap. I guess one thing we could do for testing purposes is lobotomize one of the datatype-specific hash functions. For instance, make int8_hash return the input mod 2^32, ignoring the upper bytes. Then it'd be easy to compute different int8s that hash to the same thing. regards, tom lane -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
Alex Hunsaker [EMAIL PROTECTED] writes: On Thu, Sep 4, 2008 at 7:13 PM, Tom Lane [EMAIL PROTECTED] wrote: * check that the queries actually use the indexes (not sure that the proposed switch settings ensure this, not to mention you didn't create the indexes) Well I was assuming I could just test the speed of a hash join... Uh, no, hash joins have nearly zip to do with hash indexes. They rely on the same per-datatype support functions but that's the end of the commonality. regards, tom lane -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Thu, Sep 4, 2008 at 8:17 PM, Tom Lane [EMAIL PROTECTED] wrote: I guess one thing we could do for testing purposes is lobotomize one of the datatype-specific hash functions. For instance, make int8_hash return the input mod 2^32, ignoring the upper bytes. Then it'd be easy to compute different int8s that hash to the same thing. Heh Ok im slowly getting there... So we lobotomize hashint8 and then time how long it takes to make an index on a table... something like: create table test_hash(num int8); (obviously on a 64 bit machine) int main(void) { unsigned long y = 0; unsigned cnt = 0; printf(insert into test_hash (num) values ); //while(cnt != LONG_MAX/UINT_MAX) while(cnt 1000) { y += UINT_MAX; printf((%ld), , y); cnt++; } printf((0);\n); } ./a.out | psql pgbench -c 1 -t1000 -n -f test.sql test.sql: create index test_hash_num_idx on test_hash using hash (num); drop index test_hash_num_idx; For both pre and post patch just to make sure post patch is not worse than pre patch??? If im still way off and its not to much trouble want to give me a test case to run =) ? Or maybe because hash collisions should be fairly rare its not something to really worry about? -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches