Re: [PERFORM] big joins not converging
Dan Ancona da at vizbang.com writes: his is a choice between developing some in-house capacity for this and sending people to various vendors so we'll probably lean on the vendors for now, at least while we work on it. I would try to do the record matching in house and see how far you get, even if you are talking to vendors concurrently. You might get lucky, and you will learn a lot about your data and how much to expect and pay for vendor solutions. I would: Try building multi column indices on both tables for what you think are the same rows, and match deterministically (if you have a key like social security, then do this again on full names). Examine your data to see what hits, what misses, what hits multiple. If you know there is a good and an iffy table, you can use a left outer, otherwise you need a full outer. Then put all your leftovers from each into new tables, and try again with something fuzzy. If you build the indices and use = and it is still slow, ask again here -- that shouldn't happen. And you're right fork, Record Linkage is in fact an entire academic discipline! Indeed. Look for blocking and editing with your data first, I think. I find this problem pretty interesting, so I would love to hear your results. I am right now matching building permits to assessor parcels I wish I was using PG ... -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?
Marti Raudsepp marti at juffo.org writes: If you don't mind long recovery times in case of a crash, set checkpoint_segments to ~100 and checkpoint_completion_target=0.9; this will improve write throughput significantly. Sounds good. Also, if you don't mind CORRUPTing your database after a crash, setting fsync=off and full_page_writes=off gives another significant boost. I probably won't do this... ;) UPDATE on a table with many indexes will probably be slower. If you want to speed up this part, use INSERT INTO x SELECT and take this chance to partition your table, Like the following? Will it rebuild the indexes in a sensical way? BEGIN; CREATE TABLE tempfoo as SELECT *, foo + bar AS newcol FROM bar; TRUNCATE foo; ALTER TABLE foo ADD COLUMN newcol; INSERT INTO foo SELECT * FROM tempfoo; DROP TABLE tempfoo; END; such that each individual partition and most indexes will fit in your cache. Is there a rule of thumb on tradeoffs in a partitioned table? About half the time, I will want to do GROUP BY's that use the partition column, but about half the time I won't. (I would use the partition column whatever I am most likely to cluster by in a single big table, right?) For example, I might intuitively partition by age5 (into 20 tables like tab00, tab05, tab10, etc). Often a query would be SELECT ... FROM PARENTTABLE GROUP BY age5, race, etc, but often it would be GROUP BY state or whatever with no age5 component. I know I can experiment ;), but it takes a while to load anything, and i would rather stand on the shoulders. Thanks so much for all your helps! -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Tuning massive UPDATES and GROUP BY's?
Given that doing a massive UPDATE SET foo = bar || ' ' || baz; on a 12 million row table (with about 100 columns -- the US Census PUMS for the 2005-2009 ACS) is never going to be that fast, what should one do to make it faster? I set work_mem to 2048MB, but it currently is only using a little bit of memory and CPU. (3% and 15% according to top; on a SELECT DISTINCT ... LIMIT earlier, it was using 70% of the memory). The data is not particularly sensitive; if something happened and it rolled back, that wouldnt be the end of the world. So I don't know if I can use dangerous setting for WAL checkpoints etc. There are also aren't a lot of concurrent hits on the DB, though a few. I am loathe to create a new table from a select, since the indexes themselves take a really long time to build. As the title alludes, I will also be doing GROUP BY's on the data, and would love to speed these up, mostly just for my own impatience... -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?
Merlin Moncure mmoncure at gmail.com writes: I am loathe to create a new table from a select, since the indexes themselves take a really long time to build. you are aware that updating the field for the entire table, especially if there is an index on it (or any field being updated), will cause all your indexes to be rebuilt anyways? when you update a record, it gets a new position in the table, and a new index entry with that position. insert/select to temp, + truncate + insert/select back is usually going to be faster and will save you the reindex/cluster. otoh, if you have foreign keys it can be a headache. Hmph. I guess I will have to find a way to automate it, since there will be a lot of times I want to do this. As the title alludes, I will also be doing GROUP BY's on the data, and would love to speed these up, mostly just for my own impatience... need to see the query here to see if you can make them go faster. I guess I was hoping for a blog entry on general guidelines given a DB that is really only for batch analysis versus transaction processing. Like put all your temp tables on a different disk or whatever. I will post specifics later. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] big joins not converging
Steve Atkins steve at blighty.com writes: On Mar 10, 2011, at 1:25 PM, Dan Ancona wrote: Hi postgressers - As part of my work with voter file data, I pretty regularly have to join one large-ish (over 500k rows) table to another. Sometimes this is via a text field (countyname) + integer (voter id). I've noticed sometimes this converges and sometimes it doesn't, seemingly regardless of how I index things. By converge you mean finish running -- converge has a lot of other overtones for us amateur math types. Note that I think you are doing record linkage which is a stepchild academic of its own these days. It might bear some research. THere is also a CDC matching program for text files freely downloadalbe to windows (ack), if you hunt for it. For now, my first thought is that you should try a few different matches, maybe via PL/PGSQL functions, cascading the non-hits to the next step in the process while shrinking your tables. upcase and delete all spaces, etc. First use equality on all columns, which should be able to use indices, and separate those records. Then try equality on a few columns. Then try some super fuzzy regexes on a few columns. Etc. You will also have to give some thought to scoring a match, with perfection a one, but, say, name and birthday the same with all else different a .75, etc. Also, soundex(), levenshtein, and other fuzzy string tools are your friend. I want to write a version of SAS's COMPGED for Postgres, but I haven't got round to it yet. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance