Re: [PERFORM] big joins not converging

2011-03-11 Thread fork
Dan Ancona da at vizbang.com writes:

  his is a choice between  
 developing some in-house capacity for this and sending people to  
 various vendors so we'll probably lean on the vendors for now, at  
 least while we work on it. 

I would try to do the record matching in house and see how far you get, even if
you are talking to vendors concurrently.  You might get lucky, and you will
learn a lot about your data and how much to expect and pay for vendor solutions.

I would: Try building multi column indices on both tables for what you think are
the same rows, and match deterministically (if you have a key like social
security, then do this again on full names).  Examine your data to see what
hits, what misses, what hits multiple.  If you know there is a good and an
iffy table, you can use a left outer, otherwise you need a full outer.   Then
put all your leftovers from each into new tables, and try again with something
fuzzy.

If you build the indices and use = and it is still slow, ask again here --
that shouldn't happen.

 And you're right fork, Record Linkage is in fact an entire academic  
 discipline!

Indeed.  Look for blocking and editing with your data first, I think.

I find this problem pretty interesting, so I would love to hear your results.  I
am right now matching building permits to assessor parcels  I wish I was
using PG ...


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-11 Thread fork
Marti Raudsepp marti at juffo.org writes:

 If you don't mind long recovery times in case of a crash, set
 checkpoint_segments to ~100 and checkpoint_completion_target=0.9; this
 will improve write throughput significantly.

Sounds good.

 Also, if you don't mind CORRUPTing your database after a crash,
 setting fsync=off and full_page_writes=off gives another significant
 boost.

I probably won't do this... ;)

 UPDATE on a table with many indexes will probably be slower. If you
 want to speed up this part, use INSERT INTO x SELECT and take this
 chance to partition your table, 

Like the following?  Will it rebuild the indexes in a sensical way?

BEGIN;
CREATE TABLE tempfoo as SELECT *, foo + bar AS newcol FROM bar;
TRUNCATE foo;
ALTER TABLE foo ADD COLUMN newcol;
INSERT INTO foo SELECT * FROM tempfoo;
DROP TABLE tempfoo;
END;

 such that each individual partition
 and most indexes will fit in your cache. 

Is there a rule of thumb on tradeoffs in a partitioned table?  About half the
time, I will want to do GROUP BY's that use the partition column, but about half
the time I won't.  (I would use the partition column whatever I am most likely
to cluster by in a single big table, right?)

For example, I might intuitively partition by age5 (into 20 tables like tab00,
tab05, tab10, etc). Often a query would be SELECT ... FROM PARENTTABLE GROUP BY
age5, race, etc, but often it would be GROUP BY state or whatever with no
age5 component.

I know I can experiment ;), but it takes a while to load anything, and i would
rather stand on the shoulders.

Thanks so much for all your helps!




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-10 Thread fork
Given that doing a massive UPDATE SET foo = bar || ' ' || baz; on a 12 million
row table (with about 100 columns -- the US Census PUMS for the 2005-2009 ACS)
is never going to be that fast, what should one do to make it faster?

I set work_mem to 2048MB, but it currently is only using a little bit of memory
and CPU. (3% and 15% according to top; on a SELECT DISTINCT ... LIMIT earlier,
it was using 70% of the memory).

The data is not particularly sensitive; if something happened and it rolled
back, that wouldnt be the end of the world.  So I don't know if I can use
dangerous setting for WAL checkpoints etc.   There are also aren't a lot of
concurrent hits on the DB, though a few.

I am loathe to create a new table from a select, since the indexes themselves
take a really long time to build.

As the title alludes, I will also be doing GROUP BY's on the data, and would
love to speed these up, mostly just for my own impatience...  




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-10 Thread fork
Merlin Moncure mmoncure at gmail.com writes:

  I am loathe to create a new table from a select, since the indexes 
  themselves
  take a really long time to build.
 
 you are aware that updating the field for the entire table, especially
 if there is an index on it (or any field being updated), will cause
 all your indexes to be rebuilt anyways?  when you update a record, it
 gets a new position in the table, and a new index entry with that
 position.  
 insert/select to temp, + truncate + insert/select back is
 usually going to be faster and will save you the reindex/cluster.
 otoh, if you have foreign keys it can be a headache.

Hmph.  I guess I will have to find a way to automate it, since there will be a
lot of times I want to do this. 

  As the title alludes, I will also be doing GROUP BY's on the data, and would
  love to speed these up, mostly just for my own impatience...
 
 need to see the query here to see if you can make them go faster.

I guess I was hoping for a blog entry on general guidelines given a DB that is
really only for batch analysis versus transaction processing.  Like put all
your temp tables on a different disk or whatever.  I will post specifics later.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] big joins not converging

2011-03-10 Thread fork
Steve Atkins steve at blighty.com writes:

 
 
 On Mar 10, 2011, at 1:25 PM, Dan Ancona wrote:
 
  Hi postgressers -
  
  As part of my work with voter file data, I pretty regularly have to join one
large-ish (over 500k rows) table
 to another. Sometimes this is via a text field (countyname) + integer (voter
id). I've noticed sometimes
 this converges and sometimes it doesn't, seemingly regardless of how I index
things. 

By converge you mean finish running -- converge has a lot of other
overtones for us amateur math types.

Note that I think you are doing record linkage which is a stepchild academic
of its own these days.  It might bear some research.  THere is also a CDC
matching program for text files freely downloadalbe to windows (ack), if you
hunt for it.

For now, my first thought is that you should try a few different matches, maybe
via PL/PGSQL functions, cascading the non-hits to the next step in the process
while shrinking your tables. upcase and delete all spaces, etc. First use
equality on all columns, which should be able to use indices, and separate those
records.  Then try equality on a few columns.  Then try some super fuzzy regexes
on a few columns.  Etc.  

You will also have to give some thought to scoring a match, with perfection a
one, but, say, name and birthday the same with all else different a .75, etc.

Also, soundex(), levenshtein, and other fuzzy string tools are your friend.  I
want to write a version of SAS's COMPGED for Postgres, but I haven't got round
to it yet.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance