Re: [PERFORM] Array indexes, GIN?

2007-03-02 Thread Jeff Davis
On Thu, 2007-03-01 at 19:59 -0800, Adam L Beberg wrote:
 On the surface, looks like a job for GIN, but GIN seems undocumented, 
 specifically mentions it doesn't support the deletes we'll have many of 
 since it's designed for word searching apparently, the performance 

It can delete an entry for one of the keys of an index, it just can't
delete the key itself when the number of entries goes down to zero.
Because you only have O(100K) possible keys, that shouldn't be a
problem. The GIN indexes can reclaim space. If they couldn't, they
wouldn't be nearly as useful. 

The time when you run into problems is when you have a huge, sparsely
populated keyspace, with a huge number of keys contained by no tuples in
the table.

However, for your application, GIN still might not be the right answer.
GIN can only return tuples which do contain some matching keys, it won't
return the number of matching keys in that tuple (that's not the job of
an index). 

Let's run some numbers:

 * percentage of tuples returned = 100K rows out of the 10M = 1%
 * tuples per page = 8192 bytes / 32 (tuple header) + 8 (bigint) + 80
(10 bigints) = ~70. Let's say it's 50 due to some free space.

Based on those numbers, the GIN index is basically going to say get
every other page. PostgreSQL will optimize that into a sequential scan
because it makes no sense to do a random fetch for every other page.

So, the fastest way you can do this (that I can see) is just fetch every
tuple and count the number of matches in each array. You know your data
better than I do, so replace those numbers with real ones and see if it
still makes sense.

The basic rule is that an index scan is useful only if it reduces the
number of disk pages you need to fetch enough to make up for the extra
cost of random I/O.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[PERFORM] Array indexes, GIN?

2007-03-01 Thread Adam L Beberg
I need to cross reference 2 tables. There are O(10M) A's, each has an 
ordered set of 10 of the O(100K) B's associated with it. The dominant 
query will be finding the A's and their count associated with a given 
list of ~1k B's i.e. if 2 of the listed B's are in A's set of 10, it's 
(A,2), and we should get back ~100K rows. The good news is we only need 
to run this brutal query every couple minutes, but the row updates will 
flow fast.


Luckily this is PostgreSQL, so the simple solution seems to be

  CREATE TABLE xref( A bigint, B bigint[10] ); -- A is primary key

which cuts down the table overhead. O(10M) rows w/array.

On the surface, looks like a job for GIN, but GIN seems undocumented, 
specifically mentions it doesn't support the deletes we'll have many of 
since it's designed for word searching apparently, the performance 
implications are undocumented. I searched, I read, and even IRC'd, and 
it seems like GIN is just not used much.


Is GIN right? Will this work at all? Will it run fast enough to function?

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Array indexes, GIN?

2007-03-01 Thread Josh Berkus
Adam,

 On the surface, looks like a job for GIN, but GIN seems undocumented,
 specifically mentions it doesn't support the deletes we'll have many of
 since it's designed for word searching apparently, the performance
 implications are undocumented. I searched, I read, and even IRC'd, and
 it seems like GIN is just not used much.

It's new (as of 8.2).  And the authors, Oleg and Teodor, are notorious for 
skimpy documetentation.

I'd start with the code in INTARRAY contrib module (also by Teodor) and bug 
them on pgsql-hackers about helping you implement a GIN index for arrays.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Array indexes, GIN?

2007-03-01 Thread Oleg Bartunov

On Thu, 1 Mar 2007, Josh Berkus wrote:


Adam,


On the surface, looks like a job for GIN, but GIN seems undocumented,
specifically mentions it doesn't support the deletes we'll have many of
since it's designed for word searching apparently, the performance
implications are undocumented. I searched, I read, and even IRC'd, and
it seems like GIN is just not used much.


It's new (as of 8.2).  And the authors, Oleg and Teodor, are notorious for
skimpy documetentation.


We're getting better, we have 72 pages written about new FTS :)



I'd start with the code in INTARRAY contrib module (also by Teodor) and bug
them on pgsql-hackers about helping you implement a GIN index for arrays.


GIN already has support for one dimensional arrays and intarray, particularly,
too has support of GiN.


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] Array indexes, GIN?

2007-03-01 Thread Adam L Beberg

Oleg Bartunov wrote on 3/1/2007 10:45 PM:

On Thu, 1 Mar 2007, Josh Berkus wrote:


Adam,


On the surface, looks like a job for GIN, but GIN seems undocumented,
specifically mentions it doesn't support the deletes we'll have many of
since it's designed for word searching apparently, the performance
implications are undocumented. I searched, I read, and even IRC'd, and
it seems like GIN is just not used much.


It's new (as of 8.2).  And the authors, Oleg and Teodor, are notorious 
for

skimpy documetentation.


We're getting better, we have 72 pages written about new FTS :)


I'm guessing FTS is not quite done since you matched 'FTS' to 'GIN' ;)

GIN already has support for one dimensional arrays and intarray, 
particularly, too has support of GiN.


Great, so can GIN handle my situation? I'm a little unsure what to make 
of Note: There is no delete operation for ET. in particular since I'm 
dealing with large numbers.


--
Adam L. Beberg
http://www.mithral.com/~beberg/

---(end of broadcast)---
TIP 6: explain analyze is your friend