[PERFORM] ANTI-JOIN needs table, index scan not possible?

2011-03-11 Thread hans wulf
Hi,

I need an ANTI-JOIN (not exists SELECT something from table.../ left join table 
WHERE table.id IS NULL) on the same table. Acutally I have an index to serve 
the not exists question, but the query planner chooses to to a bitmap heap scan.

The table has 100 Mio rows, so doing a heap scan is messed up...

It would be really fast if Postgres could compare the to indicies. Does 
Postgres have to visit the table for this ANTI-JOIN?

I know the table has to be visitied at some point to serve the MVCC, but why so 
early? Can NOT ESISTS only be fixed by the table, because it could miss 
soemthing otherwise?

-- 
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] big joins not converging

2011-03-11 Thread fork
Dan Ancona da at vizbang.com writes:

  his is a choice between  
 developing some in-house capacity for this and sending people to  
 various vendors so we'll probably lean on the vendors for now, at  
 least while we work on it. 

I would try to do the record matching in house and see how far you get, even if
you are talking to vendors concurrently.  You might get lucky, and you will
learn a lot about your data and how much to expect and pay for vendor solutions.

I would: Try building multi column indices on both tables for what you think are
the same rows, and match deterministically (if you have a key like social
security, then do this again on full names).  Examine your data to see what
hits, what misses, what hits multiple.  If you know there is a good and an
iffy table, you can use a left outer, otherwise you need a full outer.   Then
put all your leftovers from each into new tables, and try again with something
fuzzy.

If you build the indices and use = and it is still slow, ask again here --
that shouldn't happen.

 And you're right fork, Record Linkage is in fact an entire academic  
 discipline!

Indeed.  Look for blocking and editing with your data first, I think.

I find this problem pretty interesting, so I would love to hear your results.  I
am right now matching building permits to assessor parcels  I wish I was
using PG ...


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] ANTI-JOIN needs table, index scan not possible?

2011-03-11 Thread Maciek Sakrejda
 I know the table has to be visitied at some point to serve the MVCC, but why 
 so early? Can NOT ESISTS only be fixed by the table, because it could miss 
 soemthing otherwise?

Possibly because the index entries you're anti-joining against may
point to deleted tuples, so you would erroneously omit rows from the
join result if you skip the visibility check?

---
Maciek Sakrejda | System Architect | Truviso

1065 E. Hillsdale Blvd., Suite 215
Foster City, CA 94404
(650) 242-3500 Main
www.truviso.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Table partitioning problem

2011-03-11 Thread Robert Haas
On Thu, Mar 10, 2011 at 3:25 AM, Samba GUEYE samba.gu...@intesens.com wrote:
 The measure insertion is successful but problems raise up when inserting in
 the simpleMeasure table because it can't find the foreign key inserted the
 measure table and do not look at the partitionned tables

Yes, that's how it works.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-11 Thread fork
Marti Raudsepp marti at juffo.org writes:

 If you don't mind long recovery times in case of a crash, set
 checkpoint_segments to ~100 and checkpoint_completion_target=0.9; this
 will improve write throughput significantly.

Sounds good.

 Also, if you don't mind CORRUPTing your database after a crash,
 setting fsync=off and full_page_writes=off gives another significant
 boost.

I probably won't do this... ;)

 UPDATE on a table with many indexes will probably be slower. If you
 want to speed up this part, use INSERT INTO x SELECT and take this
 chance to partition your table, 

Like the following?  Will it rebuild the indexes in a sensical way?

BEGIN;
CREATE TABLE tempfoo as SELECT *, foo + bar AS newcol FROM bar;
TRUNCATE foo;
ALTER TABLE foo ADD COLUMN newcol;
INSERT INTO foo SELECT * FROM tempfoo;
DROP TABLE tempfoo;
END;

 such that each individual partition
 and most indexes will fit in your cache. 

Is there a rule of thumb on tradeoffs in a partitioned table?  About half the
time, I will want to do GROUP BY's that use the partition column, but about half
the time I won't.  (I would use the partition column whatever I am most likely
to cluster by in a single big table, right?)

For example, I might intuitively partition by age5 (into 20 tables like tab00,
tab05, tab10, etc). Often a query would be SELECT ... FROM PARENTTABLE GROUP BY
age5, race, etc, but often it would be GROUP BY state or whatever with no
age5 component.

I know I can experiment ;), but it takes a while to load anything, and i would
rather stand on the shoulders.

Thanks so much for all your helps!




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] ANTI-JOIN needs table, index scan not possible?

2011-03-11 Thread hans wulf
Thanks for the answer.

so there's no way around this problem? A nice index bitmap merge thing would be 
super fast. Big table ANTI JOIN queries with only a few results expected, are 
totally broken, if this is true. 

This way the query breaks my neck. This is a massive downside of postgres which 
makes this kind of query impossible. Mysql gives you the answer in a few 
seconds :-(



 Possibly because the index entries you're anti-joining against may
 point to deleted tuples, so you would erroneously omit rows from the
 join result if you skip the visibility check?
 
 ---
 Maciek Sakrejda | System Architect | Truviso
 
 1065 E. Hillsdale Blvd., Suite 215
 Foster City, CA 94404
 (650) 242-3500 Main
 www.truviso.com

-- 
Schon gehört? GMX hat einen genialen Phishing-Filter in die
Toolbar eingebaut! http://www.gmx.net/de/go/toolbar

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] ANTI-JOIN needs table, index scan not possible?

2011-03-11 Thread Kenneth Marshall
On Fri, Mar 11, 2011 at 06:54:39PM +0100, hans wulf wrote:
 Thanks for the answer.
 
 so there's no way around this problem? A nice index bitmap merge thing would 
 be super fast. Big table ANTI JOIN queries with only a few results expected, 
 are totally broken, if this is true. 
 
 This way the query breaks my neck. This is a massive downside of postgres 
 which makes this kind of query impossible. Mysql gives you the answer in a 
 few seconds :-(
 
 

Super! I am glad that MySQL can meet your needs. No software is
perfect and you should definitely chose based on your use-case.

Regards,
Ken

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] ANTI-JOIN needs table, index scan not possible?

2011-03-11 Thread Kevin Grittner
Kenneth Marshall k...@rice.edu wrote:
 On Fri, Mar 11, 2011 at 06:54:39PM +0100, hans wulf wrote:
 so there's no way around this problem? A nice index bitmap merge
 thing would be super fast. Big table ANTI JOIN queries with only
 a few results expected, are totally broken, if this is true. 
 
 This way the query breaks my neck. This is a massive downside of
 postgres which makes this kind of query impossible. Mysql gives
 you the answer in a few seconds :-(
 
 Super! I am glad that MySQL can meet your needs. No software is
 perfect and you should definitely chose based on your use-case.
 
Well, as far as I can see we haven't yet seen near enough
information to diagnose the issue, suggest alternative ways to write
the query which might perform better, or determine whether there's
an opportunity to improve the optimizer here.
 
Hans, please read this page and provide more detail:
 
http://wiki.postgresql.org/wiki/SlowQueryQuestions
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance