Re: [PERFORM] Queryplan within FTS/GIN index -search.

2009-11-03 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov writes:
 Tom Lane t...@sss.pgh.pa.us wrote:
 Any sane text search application is going to try to filter out
 common words as stopwords; it's only the failure to do that that's
 making this run slow.
  
 I'd rather have the index used for the selective test, and apply
 the remaining tests to the rows retrieved from the heap.
 
 Uh, that was exactly my point.  Indexing common words is a waste.
 
Perhaps I'm missing something.  My point was that there are words
which are too common to be useful for index searches, yet uncommon
enough to usefully limit the results.  These words could typically
benefit from tsearch2 style parsing and dictionaries; so declaring
them as stop words would be bad from a functional perspective, yet
searching an index for them would be bad from a performance
perspective.
 
One solution would be for the users to rigorously identify all of
these words, include them on one stop word list but not another,
include *two* tsvector columns in the table (with and without the
iffy words), index only the one with the larger stop word list, and
generate two tsquery values to search the two different columns.  Best
of both worlds.  Sort of.  The staff time to create and maintain such
a list would obviously be costly and writing the queries would be
error-prone.
 
Second best would be to somehow recognize the iffy words and exclude
them from the index and the index search phase, but apply the check
when the row is retrieved from the heap.  I really have a hard time
seeing how the conditional exclusion from the index could be
accomplished, though.  Next best would be to let them fall into the
index, but exclude top level ANDed values from the index search,
applying them only to the recheck when the row is read from the heap. 
The seems, at least conceptually, like it could be done.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Problem with database performance, Debian 4gb ram ?

2009-11-03 Thread Kevin Grittner
Grant Masan grant.mas...@gmail.com wrote:
 
 max_connections = 80
 shared_buffers = 512MB
 temp_buffers = 8MB
 work_mem = 20MB
 maintenance_work_mem = 384MB
 wal_buffers = 8MB
 checkpoint_segments = 128MB
 effective_cache_size = 2304MB
 checkpoint_timeout = 1h
 
Pending further information, these seem sane to me.
 
 cpu_tuple_cost = 0.0030
 cpu_index_tuple_cost = 0.0010
 cpu_operator_cost = 0.0005
 
Why did you make these adjustments?  I usually have to change the
ratio between page and cpu costs toward the other direction.  Unless
you have carefully measured performance with and without these changes
and found a clear win with these, I would recommend going back to the
defaults for these three and tuning from there.
 
 fsync = off
 
Only use this if you can afford to lose all data in the database. 
(There are some situations where this is actually OK, but they are
unusual.)
 
As others have stated, though, we'd need more information to really
give much useful advice.  An EXPLAIN ANALYZE of a query which isn't
performing to expectations would be helpful, especially if you include
the table definitions (with indexes) of all tables involved in the
query.  Output from vmstat or iostat with a fairly small interval (I
usually use 1) while the query is running would be useful, too.
 
Knowing the exact version of PostgreSQL (like from SELECT version();)
would be useful, as well as knowing more about you disk array and
controller(s).
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Queryplan within FTS/GIN index -search.

2009-11-03 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes:
 Perhaps I'm missing something.  My point was that there are words
 which are too common to be useful for index searches, yet uncommon
 enough to usefully limit the results.  These words could typically
 benefit from tsearch2 style parsing and dictionaries; so declaring
 them as stop words would be bad from a functional perspective, yet
 searching an index for them would be bad from a performance
 perspective.

Right, but the original complaint in this thread was that a GIN index is
slow about searching for very common terms.  The answer to that clearly
is to not index common terms, rather than worry about making the case
a bit faster.

It may well be that Jesper's identified a place where the GIN code could
be improved --- it seems like having the top-level search logic be more
aware of the AND/OR structure of queries would be useful.  But the
particular example shown here doesn't make a very good case for that,
because it's hard to tell how much of a penalty would be taken in more
realistic examples.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Queryplan within FTS/GIN index -search.

2009-11-03 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 
 The answer to that clearly is to not index common terms
 
My understanding is that we don't currently get statistics on how
common the terms in a tsvector column are until we ANALYZE the *index*
created from it.  Seems like sort of a Catch 22.  Also, if we exclude
words which are in the tsvector from the index on the tsvector, we
need to know what words were excluded so we know not to search on them
as well as forcing the recheck of the full tsquery (unless this always
happens already?).
 
 It may well be that Jesper's identified a place where the GIN code
 could be improved
 
My naive assumption has been that it would be possible to get an
improvement without touching the index logic, by changing this part of
the query plan:
 
 Index Cond: (ftsbody_body_fts @@ to_tsquery
('TERM1  TERM2  TERM3  TERM4  TERM5'::text))
 
to something like this:
 
 Index Cond: (ftsbody_body_fts @@ to_tsquery
('TERM1'::text))
 
and count on this doing the rest:
 
   Recheck Cond: (ftsbody_body_fts @@ to_tsquery
('TERM1  TERM2  TERM3  TERM4  TERM5'::text))
 
I'm wondering if anyone has ever confirmed that probing for the more
frequent term through the index is *ever* a win, versus using the
index for the most common of the top level AND conditions and doing
the rest on recheck.  That seems like a dangerous assumption from
which to start.
 
 But the particular example shown here doesn't make a very good case
 for that, because it's hard to tell how much of a penalty would be
 taken in more realistic examples.
 
Fair enough.  We're in the early stages of moving to tsearch2 and I
haven't run across this yet in practice.  If I do, I'll follow up.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Queryplan within FTS/GIN index -search.

2009-11-03 Thread Kevin Grittner
Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 
 I'm wondering if anyone has ever confirmed that probing for the more
 frequent term through the index is *ever* a win, versus using the
 index for the most common of the top level AND conditions and doing
 the rest on recheck.
 
s/most/least/
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Problem with database performance, Debian 4gb ram ?

2009-11-03 Thread Jeff Janes
On Tue, Nov 3, 2009 at 7:13 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Grant Masan grant.mas...@gmail.com wrote:


 cpu_tuple_cost = 0.0030
 cpu_index_tuple_cost = 0.0010
 cpu_operator_cost = 0.0005

 Why did you make these adjustments?  I usually have to change the
 ratio between page and cpu costs toward the other direction.

Is that because the database is mostly cached in memory?  If I take the
documented descriptions of the costs parameters at face value, I find
that cpu_tuple_cost should be even lower yet.


Cheer,

Jeff

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Free memory usage Sol10, 8.2.9

2009-11-03 Thread Subbiah Stalin-XCGF84
All,
 
I'm trying to understand the free memory usage and why it falls below
17G sometimes and what could be causing it. Any pointers would be
appreciated.
 
r...@prod1 # prtconf
System Configuration:  Sun Microsystems  sun4u
Memory size: 32768 Megabytes
 
[postg...@prod1 ~]$ vmstat 5 10
 kthr  memorypagedisk  faults
cpu
 r b w   swap  free  re  mf pi po fr de sr 1m 1m 1m m1   in   sy   cs us
sy id
 0 0 0 51713480 21130304 58 185 325 104 104 0 0 23 3 7 1 488 604  573  1
2 97
 0 0 0 51048768 18523456 6 10 0 192 192 0 0 4  0  3  0  527  753  807  2
1 97
 0 0 0 51713480 21130304 58 185 325 104 104 0 0 1 23 3 7 488 604  573  1
2 97
 0 0 0 51067112 18538472 0 1 0 171 171 0 0  4  8  0  4  522  573  740  2
1 97
 0 0 0 51072744 18542992 0 0 0 187 187 0 0  0 22  0  7  532  657  780  2
1 97
 0 0 0 51069944 18540736 146 1729 3 174 174 0 0 0 9 0 3 526 3227  944  4
5 91
 0 0 0 51065728 18537360 32 33 0 192 192 0 0 0 20 0  3  522 1147  927  3
2 95
 0 0 0 51065728 18537336 0 0 0 190 190 0 0  0 26  0  3  517  628  789  2
1 97
 0 0 0 51065728 18537336 0 0 0 168 168 0 0  0 25  0 11  517  668  810  2
2 96
 0 0 0 51062960 18535152 0 165 2 190 190 0 0 14 29 0 4  552  732  808  2
1 97
 
prstat -am
 
 NPROC USERNAME  SWAP   RSS MEMORY  TIME  CPU
21 postgres 8312M 8300M25% 112:24:15 2.1%
53 root  347M  236M   0.7% 130:52:02 0.1%
 7 daemon708M  714M   2.2%  21:53:05 0.0%
 4 mot  5552K   15M   0.0%   0:00:00 0.0%
 1 smmsp1384K 5480K   0.0%   0:00:59 0.0%


[postg...@prod1]$ ps -eaf | grep postgres | wc -l
24
 
max_connections = 600
shared_buffers = 8000MB 
temp_buffers = 8MB
work_mem = 2MB  
maintenance_work_mem = 256MB
max_fsm_pages = 2048000
max_fsm_relations = 2000
effective_cache_size = 4000MB
 
Thanks,
Stalin

 
 


Re: [PERFORM] Queryplan within FTS/GIN index -search.

2009-11-03 Thread Jesper Krogh
Tom Lane wrote:
 It may well be that Jesper's identified a place where the GIN code could
 be improved --- it seems like having the top-level search logic be more
 aware of the AND/OR structure of queries would be useful.  But the
 particular example shown here doesn't make a very good case for that,
 because it's hard to tell how much of a penalty would be taken in more
 realistic examples.

With a term sitting in:
80% of the docs the penalty is: x23
60% of the docs the penalty is: x17
40% of the docs the penalty is: x13
of doing
vectorcol @@ ts_query('term  commonterm')
compared to
vectorcol @@ ts_query('term) and vectorcol @@ ts_query('commonterm');
where term is non-existing (or rare).

(in query execution performance on a fully memory recident dataset,
doing test with drop_caches and restart pg to simulate a dead disk the
numbers are a bit higher).

http://article.gmane.org/gmane.comp.db.postgresql.performance/22496/match=

Would you ever quantify a term sitting in 60-80% as a stop-word candidate?

I dont know if x13 in execution performance is worth hunting or there
are lower hanging fruits sitting in the fts-search-system.

This is essentially the penalty the user will get for adding a terms to
their search that rarely restricts the results.

In term of the usual set theory that databases work in, a search for a
stop-word translated into the full set. This is just not the case in
where it throws a warning and returns the empty set. This warning can be
caught by application code to produce the correct result to the users,
but just slightly more complex queries dont do this:

ftstest=# select id from ftstest where body_fts @@ to_tsquery('random |
the') limit 10;
 id

(0 rows)

Here I would have expected the same error.. I basically have to hook in
the complete stop-word dictionary in a FTS-preparser to give the user
the expected results or have I missed a feature somwhere?

My reason for not pushing commonterms into the stopword list is that
they actually perform excellent in PG.

Same body as usual, but commonterm99 is sitting in 99% of the documents.

ftstest=# set enable_seqscan=off;
SET
ftstest=# explain analyze select id from ftstest where body_fts @@
to_tsquery('commonterm99');
QUERY
PLAN
--
 Bitmap Heap Scan on ftstest  (cost=1051476.74..1107666.07 rows=197887
width=4) (actual time=51.036..121.348 rows=197951 loops=1)
   Recheck Cond: (body_fts @@ to_tsquery('commonterm99'::text))
   -  Bitmap Index Scan on ftstest_gin_idx  (cost=0.00..1051427.26
rows=197887 width=0) (actual time=49.602..49.602 rows=197951 loops=1)
 Index Cond: (body_fts @@ to_tsquery('commonterm99'::text))
 Total runtime: 147.350 ms
(5 rows)

ftstest=# set enable_seqscan=on;
SET
ftstest=# explain analyze select id from ftstest where body_fts @@
to_tsquery('commonterm99');
QUERY PLAN

--
 Seq Scan on ftstest  (cost=0.00..56744.00 rows=197887 width=4) (actual
time=0.086..7134.384 rows=197951 loops=1)
   Filter: (body_fts @@ to_tsquery('commonterm99'::text))
 Total runtime: 7194.182 ms
(3 rows)



So in order to get the result with a speedup of more than x50 I simply
cannot add these terms to the stop-words because then the first query
would resolve to an error and getting results would then be up to the
second query.

My bet is that doing a seq_scan will never be beneficial for this type
of query.

As far as I can see the only consequence of simply not remove stop-words
at all is a (fairly small) increase in index-size. It seems to me that
stop-words were invented when it was hard to get more than 2GB of memory
into a computer to get the index-size reduced to a size that better
could fit into memory. But nowadays it seems like the downsides are hard
to see?

Jesper
-- 
Jesper

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimizer + bind variables

2009-11-03 Thread Craig Ringer
David Kerr wrote:
 Does/is it possible for the PG optimizer come up with differnet plans when 
 you're using bind variables vs when you send static values?

Yes, if the bind variable form causes your DB access driver to use a
server-side prepared statement. Pg can't use its statistics to improve
its query planning if it doesn't have a value for a parameter when it's
building the query plan.

Whether a server-side prepared statement is used or not depends on how
you're connecting to the database - ie your DB access driver and
version. If you're using JDBC, I *think* the JDBC driver does parameter
placement client-side unless you're using a JDBC prepared statement and
the JDBC prepared statement is re-used several times, at which point it
sets up a server-side prepared statement. AFAIK otherwise it uses
client-side (or Pg protocol level) parameter placement.

 if it's possible for the plan to be different how can i generate an
 xplan for the bind version?

xplan = explain? If so:

Use PREPARE to prepare a statement with the params, then use:

EXPLAIN EXECUTE prepared_statement_name(params);

eg:

x= PREPARE blah AS SELECT * FROM generate_series(1,100);
PREPARE
x= EXPLAIN EXECUTE blah;
   QUERY PLAN

 Function Scan on generate_series  (cost=0.00..12.50 rows=1000 width=4)
(1 row)

--
Craig Ringer

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Optimizer + bind variables

2009-11-03 Thread David Kerr
Does/is it possible for the PG optimizer come up with differnet plans when 
you're using bind variables vs when you send static values?

like  if my query was

select * from users (add a bunch of complex joins) where username = 'dave'
vs
select * from users (add a bunch of complex joins) where username = '?'

In oracle they are frequently different.

if it's possible for the plan to be different how can i generate an
xplan for the bind version?

Thanks!

Dave

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Free memory usage Sol10, 8.2.9

2009-11-03 Thread Jeremy Harris

On 11/03/2009 07:16 PM, Subbiah Stalin-XCGF84 wrote:

All,

I'm trying to understand the free memory usage and why it falls below
17G sometimes and what could be causing it. Any pointers would be
appreciated.

r...@prod1 # prtconf
System Configuration:  Sun Microsystems  sun4u
Memory size: 32768 Megabytes

[postg...@prod1 ~]$ vmstat 5 10
  kthr  memorypagedisk  faults
cpu
  r b w   swap  free  re  mf pi po fr de sr 1m 1m 1m m1   in   sy   cs us
sy id
  0 0 0 51713480 21130304 58 185 325 104 104 0 0 23 3 7 1 488 604  573  1
2 97
  0 0 0 51048768 18523456 6 10 0 192 192 0 0 4  0  3  0  527  753  807  2
1 97


Memory used by the OS for caching files is no longer free.
Free memory is wasted memory.
-J

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] maintaining a reference to a fetched row

2009-11-03 Thread Craig Ringer
Brian Karlak wrote:

 The setup is relatively simple: there is a central queue table in
 postgres.  Worker daemons do a bounded, ordered, limited SELECT to grab
 a row, which they lock by setting a value in the queue.status column. 

You can probably do an UPDATE ... RETURNING to turn that into one
operation - but that won't work with a cursor :-(

 My question is this: is there some way that I can keep a cursor /
 pointer / reference / whatever to the row I fetched originally, so that
 I don't have to search for it again when I'm ready to write results?

You could use a cursor, but it won't work if you're locking rows by
testing a 'status' flag, because that requires the worker to commit the
transaction (so others can see the status flag) before starting work. A
cursor only exists within a transaction.

BEGIN;
DECLARE curs CURSOR FOR SELECT * FROM queue ORDER BY queue_id LIMIT 1;
FETCH NEXT FROM curs;
--
-- Set the status - but nobody else can see the change yet because we
-- haven't committed! We'll have a Pg row lock on the record due to the
-- UPDATE, preventing other UPDATEs but not other SELECTs.
--
-- We can't start work until the transaction commits, but committing
-- will close the cursor.
--
UPDATE queue SET status = 1 WHERE CURRENT OF curs;


I don't have a good answer for you there. Perhaps using Pg's locking to
do your queueing, rather than updating a status flag, might let you use
a cursor? Have a look at the list archives - there's been a fair bit of
discussion of queuing mechanisms.

--
Craig Ringer

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] High Frequency Inserts to Postgres Database vs Writing to a File

2009-11-03 Thread Jay Manni
Hi:

I have an application wherein a process needs to read data from a stream and 
store the records for further analysis and reporting. The data in the stream is 
in the form of variable length records with clearly defined fields - so it can 
be stored in a database or in a file. The only caveat is that the rate of 
records coming in the stream could be several 1000 records a second.

The design choice I am faced with currently is whether to use a postgres 
database or a flat file for this purpose. My application already maintains a 
postgres (8.3.4) database for other reasons - so it seemed like the 
straightforward thing to do. However I am concerned about the performance 
overhead of writing several 1000 records a second to the database. The same 
database is being used simultaneously for other activities as well and I do not 
want those to be adversely affected by this operation (especially the query 
times). The advantage of running complex queries to mine the data in various 
different ways is very appealing but the performance concerns are making me 
wonder if just using a flat file to store the data would be a better approach.

Anybody have any experience in high frequency writes to a postgres database?

- Jay


Re: [PERFORM] Optimizer + bind variables

2009-11-03 Thread David Kerr
On Wed, Nov 04, 2009 at 07:43:16AM +0800, Craig Ringer wrote:
- David Kerr wrote:
-  Does/is it possible for the PG optimizer come up with differnet plans when 
-  you're using bind variables vs when you send static values?
- 
- Yes, if the bind variable form causes your DB access driver to use a
- server-side prepared statement. Pg can't use its statistics to improve
- its query planning if it doesn't have a value for a parameter when it's
- building the query plan.

hmm, that's a little unclear to me.

let's assume that the application is using prepare:

Assuming the database hasn't changed, would:
PREPARE bla1 as SELECT * from users where username = '$1';
explain execute bla1

give the same output as
explain select * from users where username = 'dave';

?

- Whether a server-side prepared statement is used or not depends on how
- you're connecting to the database - ie your DB access driver and
- version. If you're using JDBC, I *think* the JDBC driver does parameter
- placement client-side unless you're using a JDBC prepared statement and
- the JDBC prepared statement is re-used several times, at which point it
- sets up a server-side prepared statement. AFAIK otherwise it uses
- client-side (or Pg protocol level) parameter placement.
that's interesting, i'll need to find out which mine are using, probably
a mix of both.

-  if it's possible for the plan to be different how can i generate an
-  xplan for the bind version?
- 
- xplan = explain? If so:
yeah, sorry.

- Use PREPARE to prepare a statement with the params, then use:
- 
- EXPLAIN EXECUTE prepared_statement_name(params);
- 
- eg:
- 
- x= PREPARE blah AS SELECT * FROM generate_series(1,100);
- PREPARE
- x= EXPLAIN EXECUTE blah;
-QUERY PLAN
- 
-  Function Scan on generate_series  (cost=0.00..12.50 rows=1000 width=4)
- (1 row)

great thanks!

Dave

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] maintaining a reference to a fetched row

2009-11-03 Thread Brian Karlak


On Nov 3, 2009, at 4:03 PM, Craig Ringer wrote:

I don't have a good answer for you there. Perhaps using Pg's locking  
to
do your queueing, rather than updating a status flag, might let you  
use
a cursor? Have a look at the list archives - there's been a fair bit  
of

discussion of queuing mechanisms.


This is an interesting idea.  I'll see what I can find in the  
archives.  It will likely take a bit of refactoring, but such is  
life ...


Thanks!
Brian

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] maintaining a reference to a fetched row

2009-11-03 Thread Brian Karlak

Hello All --

I have a simple queuing application written on top of postgres which  
I'm trying to squeeze some more performance out of.


The setup is relatively simple: there is a central queue table in  
postgres.  Worker daemons do a bounded, ordered, limited SELECT to  
grab a row, which they lock by setting a value in the queue.status  
column.  When the task is complete, results are written back to the  
row.  The system is designed to allow multiple concurrent daemons to  
access a queue.  At any one time, we expect 1-5M active items on the  
queue.


Now this design is never going to win any performance awards against a  
true queuing system like Active/Rabbit/Zero MQ, but it's tolerably  
fast for our applications.  Fetch/mark times are about 1ms,  
independent of the number of items on the queue.  This is acceptable  
considering that our tasks take ~50ms to run.


However, the writing of results back to the row takes ~5ms, which is  
slower than I'd like.  It seems that this is because I need to to do  
an index scan on the queue table to find the row I just fetched.


My question is this: is there some way that I can keep a cursor /  
pointer / reference / whatever to the row I fetched originally, so  
that I don't have to search for it again when I'm ready to write  
results?


Thanks in advance for any pointers you can provide.

Brian

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimizer + bind variables

2009-11-03 Thread David Kerr
On Wed, Nov 04, 2009 at 11:02:22AM +1100, Chris wrote:
- David Kerr wrote:
- On Wed, Nov 04, 2009 at 07:43:16AM +0800, Craig Ringer wrote:
- - David Kerr wrote:
- No.
- 
- This is explained in the notes here:
- 
- http://www.postgresql.org/docs/current/static/sql-prepare.html

sigh and i've read that before too.

On the upside, then it behaves like I would expect it to, which is
good.

Thanks

Dave

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] High Frequency Inserts to Postgres Database vs Writing to a File

2009-11-03 Thread Scott Marlowe
On Tue, Nov 3, 2009 at 8:12 PM, Jay Manni jma...@fireeye.com wrote:
 Hi:



 I have an application wherein a process needs to read data from a stream and
 store the records for further analysis and reporting. The data in the stream
 is in the form of variable length records with clearly defined fields – so
 it can be stored in a database or in a file. The only caveat is that the
 rate of records coming in the stream could be several 1000 records a second.



 The design choice I am faced with currently is whether to use a postgres
 database or a flat file for this purpose. My application already maintains a

A common approach is to store them in flat files, then insert the flat
files at a later time so that if the db falls behind no data is lost.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] High Frequency Inserts to Postgres Database vs Writing to a File

2009-11-03 Thread David Saracini
could be several 1000 records a second.

So, are there periods when there are no/few records coming in?  Do the 
records/data/files really need to be persisted?  

The following statement makes me think you should go the flat file route:

The advantage of running complex queries to mine the data in various different 
ways is very appealing

Please don't be offended, but that sounds a little like feature creep.  I've 
found that it's best to keep it simple and don't do a bunch of work now for 
what might be requested in the future.

I know it's not exactly what you were looking for...  Just food for thought.

Best of luck!

David




From: Jay Manni jma...@fireeye.com
To: pgsql-performance@postgresql.org pgsql-performance@postgresql.org
Sent: Tue, November 3, 2009 7:12:29 PM
Subject: [PERFORM] High Frequency Inserts to Postgres Database vs Writing to a 
File

 
Hi:
 
I have an application
wherein a process needs to read data from a stream and store the records for
further analysis and reporting. The data in the stream is in the form of
variable length records with clearly defined fields – so it can be stored in a
database or in a file. The only caveat is that the rate of records coming in
the stream could be several 1000 records a second. 
 
The design choice I am
faced with currently is whether to use a postgres database or a flat file for
this purpose. My application already maintains a postgres (8.3.4) database for
other reasons – so it seemed like the straightforward thing to do. However I am
concerned about the performance overhead of writing several 1000 records a
second to the database. The same database is being used simultaneously for
other activities as well and I do not want those to be adversely affected by
this operation (especially the query times). The advantage of running complex
queries to mine the data in various different ways is very appealing but the
performance concerns are making me wonder if just using a flat file to store
the data would be a better approach. 
 
Anybody have any
experience in high frequency writes to a postgres database?
 
- Jay

Re: [PERFORM] Queryplan within FTS/GIN index -search.

2009-11-03 Thread Kevin Grittner
I wrote:
 Tom Lane t...@sss.pgh.pa.us wrote:
 
 But the particular example shown here doesn't make a very good case
 for that, because it's hard to tell how much of a penalty would be
 taken in more realistic examples.
  
 Fair enough.  We're in the early stages of moving to tsearch2 and I
 haven't run across this yet in practice.  If I do, I'll follow up.
 
We have a staging database which allowed some limited testing quickly.
While it's real production data, we haven't been gathering this type
of data long, so it's got relatively few rows; therefore, it wasn't
feasible to try any tests which would be disk-bound, so I primed the
cache for all of these, and they are all totally served from cache. 
For various reasons which I'll omit unless asked, we do our text
searches through functions which take a selection string, turn it
into a tsquery with a little extra massaging on our part, run the
query with a minimum ranking to return, and return a set of records
ordered by the ranking in descending sequence.
 
Under these conditions there is a slight performance gain in adding an
additional test which matches 1356 out of 1691 rows.  Not surprisingly
for a fully cached query set, timings were very consistent from run to
run.  While undoubtedly a little unusual in approach, this is
production software run against real-world data.  I confirmed that it
is using the GIN index on the tsvector for these runs.
 
By the way, the tsearch2 features have been received very well so
far.  One of the first reactions from most users is surprise at how
fast it is.  :-)  Anyway, our production results don't confirm the
issue shown with the artificial test data.
 
 
scca= select count(*) from DocThumbnail where text is not null;
 count
---
  1691
(1 row)

Time: 0.619 ms


scca= select count(*) from (select DocThumbnail_text_rank('guardian
ad litem', 0.1)) x;
 count
---
41
(1 row)

Time: 19.394 ms


scca= select count(*) from (select DocThumbnail_text_rank('guardian
ad litem attorney', 0.1)) x;
 count
---
 4
(1 row)

Time: 16.434 ms


scca= select count(*) from (select
DocThumbnail_text_rank('attorney', 0.1)) x;
 count
---
  1356
(1 row)

Time: 415.056 ms


scca= select count(*) from (select DocThumbnail_text_rank('guardian
ad litem party', 0.1)) x;
 count
---
 2
(1 row)

Time: 16.290 ms


scca= select count(*) from (select DocThumbnail_text_rank('party',
0.1)) x;
 count
---
   935
(1 row)

Time: 386.941 ms


-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] maintaining a reference to a fetched row

2009-11-03 Thread Tom Lane
Brian Karlak zen...@metaweb.com writes:
 My question is this: is there some way that I can keep a cursor /  
 pointer / reference / whatever to the row I fetched originally, so  
 that I don't have to search for it again when I'm ready to write  
 results?

If you don't expect any updates to the row meanwhile, ctid might serve.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Problem with database performance, Debian 4gb ram ?

2009-11-03 Thread Kevin Grittner
Jeff Janes jeff.ja...@gmail.com wrote:
 On Tue, Nov 3, 2009 at 7:13 AM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:
 Grant Masan grant.mas...@gmail.com wrote:


 cpu_tuple_cost = 0.0030
 cpu_index_tuple_cost = 0.0010
 cpu_operator_cost = 0.0005

 Why did you make these adjustments?  I usually have to change the
 ratio between page and cpu costs toward the other direction.
 
 Is that because the database is mostly cached in memory?  If I take
 the documented descriptions of the costs parameters at face value, I
 find that cpu_tuple_cost should be even lower yet.
 
Right, the optimizer doesn't model caching effects very well, so I
find that in practice I have to fudge these from their putative
meanings to allow for typical caching.  Even with only a small
fraction of the database cached, the heavily accessed indexes tend to
be fairly well cached, so overall performance improves markedly by
dropping random_page_cost to about 2, even in our lowest-percentage-
cached databases.
 
I've occasionally tried using the defaults for that GUC, which has
always resulted in user complaints about unacceptable performance of
important queries.  While I tend to reduce the random_page_cost and
seq_page_cost to tweak things, raising the cpu_*_cost settings would
accomplish the same thing, so reducing them as show above would tend
to push things into sequential scans where indexed access might be
faster.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance