Re: [PATCHES] hash index improving v3

2008-09-04 Thread Tom Lane
I wrote:
 You have the unique-versus-not dimension,

On second thought, actually not.  What we want to look at is the penalty
for false matches due to *distinct* key values that happen to have the
same hash codes.  Your test case for all-the-same is using all the same
key values, which means it'll hit the heap a lot, but none of those will
be wasted trips.

So what we need for testing is a few different key values that hash to
the same code.  Not sure about an easy way to find such.

regards, tom lane

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] hash index improving v3

2008-09-04 Thread Alex Hunsaker
On Thu, Sep 4, 2008 at 7:13 PM, Tom Lane [EMAIL PROTECTED] wrote:
 Alex Hunsaker [EMAIL PROTECTED] writes:
 Ok let me know if this is to naive of an approach or not hitting the
 right cases you want tested.

 You have the unique-versus-not dimension, but I'm also wondering about
 narrow vs wide index keys (say about 8 bytes vs 50-100 or so).  In the
 former case we're not saving any index space by storing only the hash
 code, so these could be expected to have different performance
 behaviors.

Arg yes... I just read the last part of your mail in this thread.  I
think it was the one on -hackers that talked about narrow vs wide...
so I figured I would just try to do what the thread where you posted
the patch talked about namley the below:

So my thinking right now is that we should just test this patch as-is.
If it doesn't show really horrid performance when there are lots of
hash key collisions, we should forget the store-both-things idea and
just go with this.

So I thought, lets try to generate lots of hash collisions... obviosly
though using the same key wont do that... Not sure what I was thinking

 As for specifics of the suggested scripts:

 * might be better to do select count(*) not select 1, so that client
 communication is minimized

Yar.

 * check that the queries actually use the indexes (not sure that the
 proposed switch settings ensure this, not to mention you didn't create
 the indexes)

Well I was assuming I could just test the speed of a hash join...

 * make sure the pgbench transaction counts are large enough to ensure
 significant runtime
 * the specific table sizes suggested are surely not large enough

Ok

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] hash index improving v3

2008-09-04 Thread Alex Hunsaker
On Thu, Sep 4, 2008 at 7:45 PM, Tom Lane [EMAIL PROTECTED] wrote:
 So what we need for testing is a few different key values that hash to
 the same code.  Not sure about an easy way to find such.

Hrm, well I have not really looked at the hash algorithm but I assume
we could just reduce the number of buckets?

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

2008-09-04 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 After reading the wikipedia article on Boyer-Moore search algorithm, it 
 looks to me like this patch actually implements the simpler 
 Boyer-Moore-Horspool algorithm that only uses one lookup table. That's 
 probably fine, as it ought to be faster on small needles and haystacks 
 because it requires less effort to build the lookup tables, even though 
 the worst-case performance is worse. It should still be faster than what 
 we have now.

Hmm.  B-M-H has worst case search speed O(M*N) (where M = length of
pattern, N = length of search string); whereas full B-M is O(N).
Maybe we should build the second table when M is large?  Although
realistically that is probably gilding the lily, since frankly there
haven't been many real-world complaints about the speed of these
functions anyway ...

 The skip table really should be constructed only once in 
 text_position_start and stored in TextPositionState. That would make a 
 big difference to the performance of those functions that call 
 text_position_next repeatedly: replace_text, split_text and text_to_array.

+1.  The heuristic about big a skip table to make may need some
adjustment as well, since it seems to be considering only a single
search.

regards, tom lane

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] hash index improving v3

2008-09-04 Thread Tom Lane
Alex Hunsaker [EMAIL PROTECTED] writes:
 On Thu, Sep 4, 2008 at 7:45 PM, Tom Lane [EMAIL PROTECTED] wrote:
 So what we need for testing is a few different key values that hash to
 the same code.  Not sure about an easy way to find such.

 Hrm, well I have not really looked at the hash algorithm but I assume
 we could just reduce the number of buckets?

No, we need fully equal hash keys, else the code won't visit the heap.

I guess one thing we could do for testing purposes is lobotomize one of
the datatype-specific hash functions.  For instance, make int8_hash
return the input mod 2^32, ignoring the upper bytes.  Then it'd be easy
to compute different int8s that hash to the same thing.

regards, tom lane

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] hash index improving v3

2008-09-04 Thread Tom Lane
Alex Hunsaker [EMAIL PROTECTED] writes:
 On Thu, Sep 4, 2008 at 7:13 PM, Tom Lane [EMAIL PROTECTED] wrote:
 * check that the queries actually use the indexes (not sure that the
 proposed switch settings ensure this, not to mention you didn't create
 the indexes)

 Well I was assuming I could just test the speed of a hash join...

Uh, no, hash joins have nearly zip to do with hash indexes.  They rely
on the same per-datatype support functions but that's the end of the
commonality.

regards, tom lane

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] hash index improving v3

2008-09-04 Thread Alex Hunsaker
On Thu, Sep 4, 2008 at 8:17 PM, Tom Lane [EMAIL PROTECTED] wrote:
 I guess one thing we could do for testing purposes is lobotomize one of
 the datatype-specific hash functions.  For instance, make int8_hash
 return the input mod 2^32, ignoring the upper bytes.  Then it'd be easy
 to compute different int8s that hash to the same thing.


Heh Ok im slowly getting there... So we lobotomize hashint8 and then
time how long it takes to make an index on a table... something like:
create table test_hash(num int8);

(obviously on a 64 bit machine)
int main(void)
{
unsigned long y = 0;
unsigned cnt = 0;

printf(insert into test_hash (num) values );

//while(cnt != LONG_MAX/UINT_MAX)
while(cnt  1000)
{
y += UINT_MAX;

printf((%ld), , y);

cnt++;
}

printf((0);\n);

}

./a.out | psql

pgbench -c 1 -t1000 -n -f test.sql

test.sql:
create index test_hash_num_idx on test_hash using hash (num);
drop index test_hash_num_idx;

For both pre and post patch just to make sure post patch is not worse
than pre patch???

If im still way off and its not to much trouble want to give me a test
case to run =) ?

Or maybe because hash collisions should be fairly rare its not
something to really worry about?

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches