Re: [GENERAL] Background worker plus language handler for Andl: OK?

2016-04-25 Thread David Wilson
On Mon, Apr 25, 2016 at 03:18:30PM +1000, da...@andl.org wrote:

> It seems to me that it should be possible to create a language front
> end for Andl by creating a background worker and a language handler.

I've been reading your posts over the past few days and while I find it
fun to follow, I can't help but wonder why there there is urgency in
reimplementing a protocol within PG itself.

It seems to me this is a much larger undertaking than you realize, for
example, you would at least need to reinvent PG's existing
authentication and authorization mechanisms, or perhaps patch PG somehow
to expose them usefully to your own code.

Is there a hard requirement that this stuff be in-process? Most of the
cost of a SQL query will be lost in planning and execution, the actual
time spent copying some strings around and context switching will be
pretty minimal for a query of any significance.

If I were you I'd start with building a robust proxy server first,
serving up your custom protocol and rewriting it to a PG client
connection internally, and only then look at how that might be merged
in-proess if indeed there was a real need for it.


David


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Uninstalled working db by mistake

2016-03-24 Thread David Wilson
On Thu, Mar 24, 2016 at 05:44:27PM +, Howard News wrote:
> Thanks David,
> 
> Unfortunately my cluster wont start - I am not entirely sure on the state of
> postgresql-9.0, this is the output from dpkd --list
> 
> 
> 
> rc  postgresql-9.0 9.0.4-1~lucid1 object-relational SQL database, version
> 9.0

Per the heading printed by dpkg --list, this means the package is in the
removed state, but it's config files are still present. "apt-get install
postgresql-9.0" should be all required.


David


> ii  postgresql-9.1 9.1.20-1.pgdg1 object-relational SQL database, version
> 9.1
> ii  postgresql-9.2 9.2.15-1.pgdg1 object-relational SQL database, version
> 9.2
> 
> Does this mean it is removed?
> 
> On 24/03/2016 17:34, David Wilson wrote:
> >Hi Howard,
> >
> >So long as you haven't touched anything else, simply reinstalling the
> >package should restore your cluster. Debian packages only do
> >initialization if the data directories are missing.
> >
> >
> >David
> >
> >On Thu, Mar 24, 2016 at 05:29:23PM +, Howard News wrote:
> >>Hi,
> >>
> >>I uninstalled the wrong version of postgres on Ubuntu using apt-get remove
> >>postgresql-9.0, convinced that this was an old unused version. You guess the
> >>rest...
> >>
> >>The data files still appear to be there, all 485GB of them. Can these be
> >>restored?
> >>
> >>Thanks.
> >
> 
> 
> 
> -- 
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Uninstalled working db by mistake

2016-03-24 Thread David Wilson
Hi Howard,

So long as you haven't touched anything else, simply reinstalling the
package should restore your cluster. Debian packages only do
initialization if the data directories are missing.


David

On Thu, Mar 24, 2016 at 05:29:23PM +, Howard News wrote:
> Hi,
> 
> I uninstalled the wrong version of postgres on Ubuntu using apt-get remove
> postgresql-9.0, convinced that this was an old unused version. You guess the
> rest...
> 
> The data files still appear to be there, all 485GB of them. Can these be
> restored?
> 
> Thanks.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Shared system resources

2015-12-23 Thread David Wilson
On Wed, Dec 23, 2015 at 07:07:31AM -0600, oleg yusim wrote:

> May we run into situation, when attacker dumps memory and analyses it
> for valuable content, instead of reserving it for own process, where
> it would be zeroed? My understanding, it is a possibility. Does kernel
> have any safeguard against it?

Sure it might be possible, but they would not have much useful
information about which old processes the pages belonged to, and
besides, they could most likely simply dump memory of a connected client
in this case, or indeed just examine the filesystem or cache to get at
the raw PG database files.

Once someone has this level of access to the system it's not really
useful to model threats much further.

One minor correction from my first mail: MAP_UNINITIALIZED is indeed
accessible to non-root, but as George mentions only when a non-default
kernel parameter has been enabled.


David


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Shared system resources

2015-12-22 Thread David Wilson
On Tue, Dec 22, 2015 at 04:52:23PM -0600, oleg yusim wrote:
> Greetings,
> 
> I'm looking at the following security control right now:
> 
> The DBMS must prevent unauthorized and unintended information transfer via
> shared system resources.
> 
> The purpose of this control is to prevent information, including encrypted
> representations of information, produced by the actions of a prior user/role
> (or the actions of a process acting on behalf of a prior user/role) from being
> available to any current user/role (or current process) that obtains access to
> a shared system resource (e.g., registers, main memory, secondary storage)
> after the resource has been released back to the information system. Control 
> of
> information in shared resources is also referred to as object reuse.
> 
> From previous discussions, I understood that session in PostgreSQL is closely
> associated with process, and it is essentially new process for every user
> connection. In regards to that, my question would be:
> 
> When user disconnects, process is terminated and all resources are released,
> does memory, session/process stored information at gets "sanitized" or just
> released as is?

On Linux the memory pages of an exiting process aren't sanitized at
exit, however it is impossible(?) for userspace to reallocate them
without the kernel first zeroing their contents.

It might be possible for root to use some debugging mechanisms to access
the freed memory, but certainly there is no mechanism for a non-root
user to do so.

Regarding PG in particular, I can't speak for any shared internal state
that might be maintained after a session has exitted (e.g. in the SHM
regions PG maintains). Since that state lives longer than an individual
process, it's possible some information leakage could occur that way,
but "object reuse", it seems doubtful.


David


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Should PQconsumeInput/PQisBusy be expensive to use?

2010-10-27 Thread David Wilson
On Wed, Oct 27, 2010 at 5:02 PM, Michael Clark codingni...@gmail.comwrote:


 while ( ((consume_result = PQconsumeInput(self.db)) == 1) 
 ((is_busy_result = PQisBusy(self.db)) == 1) )
 ;


 The problem with this code is that it's effectively useless as a test.
You're just spinning in a loop; if you don't have anything else to be doing
while waiting for responses, then this sort of calling pattern is always
going to be worse than just blocking.

Only do async if you actually have an async problem, and only do a
performance test on it if you're actually doing a real async test, otherwise
the results are fairly useless.

-- 
- David T. Wilson
david.t.wil...@gmail.com


[GENERAL] Avoiding in underflow in double - real conversions

2010-09-24 Thread David Wilson
List-

I've got a calculation that, due to its use of exp(), is produces double
precision intermediate results which I ultimately need to cast back to real
to store in a table column. However, with certain inputs, this results in an
underflow error. Is there any way outside of an exception-trapping plpgsql
function to performance the conversion and simply cause underflows to result
in a 0?

Thanks for any ideas.

-- 
- David T. Wilson
david.t.wil...@gmail.com


Re: [GENERAL] [pgsql-sql] Daily digest v1.3328 (5 messages)

2010-09-22 Thread David Wilson
On Wed, Sep 22, 2010 at 8:30 PM, Adrian Klaver adrian.kla...@gmail.comwrote:


 From here;

 http://www.postgresql.org/docs/8.4/interactive/datatype-boolean.html



I believe the question relates to changing the string *output* of the
database to something other than 't' and 'f', not an issue with input
formats.

-- 
- David T. Wilson
david.t.wil...@gmail.com


Re: [GENERAL] Post Install / Secure PostgreSQL

2010-09-15 Thread David Wilson
On Wed, Sep 15, 2010 at 1:36 PM, Carlos Mennens carlos.menn...@gmail.comwrote:

 Thanks but then I am confused why I am getting the following:


You initially connected to the postgres database because you were logging in
as the postgres user...



 postgres=# \d
 No relations found.
 postgres=# \dt
 No relations found.


... and then you *reconnected* to the same database by requesting to do so.


 postgres=# \c postgres
 psql (8.4.4)
 You are now connected to database postgres.
 postgres=# \dt
 No relations found.
 postgres=# \d



-- 
- David T. Wilson
david.t.wil...@gmail.com


Re: [GENERAL] Partial indexes instead of partitions

2010-06-14 Thread David Wilson
On Mon, Jun 14, 2010 at 5:24 AM, Leonardo F m_li...@yahoo.it wrote:

  For inserts I do not see the reason
  why
  it would be better to use index partitioning because AFAIK
  b-tree
  would behave exactly the same in both cases.

 no, when the index gets very big inserting random values gets
 very slow.


Do you have any empirical evidence for this being a real problem, or are you
simply guessing? I have tables with 500m+ rows, on commodity hardware (4
SATA disks in raid 10), and inserts to the indexes on those tables remain
quite acceptable from a performance standpoint.

-- 
- David T. Wilson
david.t.wil...@gmail.com


Re: [GENERAL] Partial indexes instead of partitions

2010-06-14 Thread David Wilson
On Mon, Jun 14, 2010 at 8:38 AM, Peter Hunsberger 
peter.hunsber...@gmail.com wrote:



 Can you define acceptable?  IIRC the OP is looking for 20,000+ inserts /
 sec.


 He's actually only looking for 2k inserts/sec. With a battery backed
controller I can sustain that, yes. That's also on commodity hardware (the
whole system was under $2k a year and a half ago).


-- 
- David T. Wilson
david.t.wil...@gmail.com


Re: [GENERAL] 110,000,000 rows

2010-05-26 Thread David Wilson
On Wed, May 26, 2010 at 4:29 PM, John Gage jsmg...@numericable.fr wrote:

 Please forgive this intrusion, and please ignore it, but how many
 applications out there have 110,000,000 row tables?  I recently multiplied
 85,000 by 1,400 and said now way Jose.


I've got a DB with two 500,000,000+ row tables. Performance is quite good
even on very limited hardware if you set up the tables correctly (cluster,
indices, etc) and have a reasonable usage pattern.

-- 
- David T. Wilson
david.t.wil...@gmail.com


Re: [GENERAL] Query not using the indexes properly.

2009-10-01 Thread David Wilson
On Thu, Oct 1, 2009 at 10:04 PM, Chris dmag...@gmail.com wrote:

 Tim Uckun wrote:

 If this is a common query you could probably do a multi-column index on all
 3 columns (id, company_id, source_model_name) - but if company_id and
 source_model_name have a low number of distinct values, then it's not going
 to help.


If you try the multi-column index (which is a good idea), be sure that id
is the last of the three columns, since that's the column on which you have
an inequality test rather than an equality test; eg,
(company_id,source_model_name,id).

-- 
- David T. Wilson
david.t.wil...@gmail.com


Re: [GENERAL] Query not using the indexes properly.

2009-10-01 Thread David Wilson
On Thu, Oct 1, 2009 at 10:21 PM, Tim Uckun timuc...@gmail.com wrote:


 Interesting. I would have thought the order of the fields would not
 matter. I don't have to rewrite the query do I?


No. For multi-column indices, however, postgres can, starting at the
leftmost in the index, use as many columns as match equality comparisons
plus one column using an inequality comparison.

From our fine manual, section 11.3:

A multicolumn B-tree index can be used with query conditions that involve
any subset of the index's columns, but the index is most efficient when
there are constraints on the leading (leftmost) columns. The exact rule is
that equality constraints on leading columns, plus any inequality
constraints on the first column that does not have an equality constraint,
will be used to limit the portion of the index that is scanned. Constraints
on columns to the right of these columns are checked in the index, so they
save visits to the table proper, but they do not reduce the portion of the
index that has to be scanned. For example, given an index on (a, b, c) and a
query condition WHERE a = 5 AND b = 42 AND c  77, the index would have to
be scanned from the first entry with a = 5 and b = 42 up through the last
entry with a = 5. Index entries with c = 77 would be skipped, but they'd
still have to be scanned through. This index could in principle be used for
queries that have constraints on b and/or c with no constraint on a — but
the entire index would have to be scanned, so in most cases the planner
would prefer a sequential table scan over using the index.

-- 
- David T. Wilson
david.t.wil...@gmail.com


Re: [GENERAL] Need help with libpq and PQexec

2009-08-11 Thread David Wilson
On Tue, Aug 11, 2009 at 3:35 AM, Juan Backsonjuanback...@gmail.com wrote:
 PGRES_COMMAND_OK

You want PGRES_TUPLES_OK for a select statement. You're not getting an
actual failure- you're checking for the wrong status.

-- 
- David T. Wilson
david.t.wil...@gmail.com

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] general question on two-partition table

2009-07-27 Thread David Wilson
On Mon, Jul 27, 2009 at 7:52 PM, Janet Jacobsenjsjacob...@lbl.gov wrote:

 Can you suggest other strategies?

Something that might be easier to play with is to create a (or
several, to speed up other queries) functional index on the comparison
between rbscore and the cutoff. It won't buy you anything on seq
scans, but if most of the queries are done on small subsets of the
tuples which meet the good criteria, it could be a big win that's
very easy to implement.

-- 
- David T. Wilson
david.t.wil...@gmail.com

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] general question on two-partition table

2009-07-27 Thread David Wilson
On Mon, Jul 27, 2009 at 8:24 PM, Greg Starkgsst...@mit.edu wrote:

 I think it would be even more interesting to have partial indexes --
 ie specified with WHERE rbscore  cutoff.

Yes- that's what I actually meant. Word got scrambled between brain
and fingers...

-- 
- David T. Wilson
david.t.wil...@gmail.com

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] problem with pg_restore?

2009-07-14 Thread David Wilson
On Tue, Jul 14, 2009 at 7:47 PM, Jim Michaelsjmich...@yahoo.com wrote:
 could somebody rewrite pg_dumpall and pg_dump so that it makes editable
 dumps?
 most programmer's text editors can't handle more than 2000 characters per
 line.
 and I want to be able to edit my dumps.

You're going to have better luck finding a decent editor than finding
someone to rewrite pg_dump and pg_dumpall just for you.


-- 
- David T. Wilson
david.t.wil...@gmail.com

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] question about frequency of updates/reads

2009-06-23 Thread David Wilson
On Wed, Jun 24, 2009 at 12:48 AM, Scott Marlowescott.marl...@gmail.com wrote:

 Maybe.  Rows that are updated often are NOT generally pgsql's strong
 suit, but IF you're running 8.3 or above, and IF you have a low enough
 fill factor that there's empty space for the updates and IF the fields
 you are updating are not indexed and IF you have aggressive enough
 vacuuming and IF you restrict your updates to JUST real updates (i.e.
 update ... set a=1 where a1) and IF your IO subsystem has enough raw
 horsepower, you can make this work.  But only benchmarking will tell
 you if you can do it with your current hardware and setup.

One option, though not nearly as nice as a simple additional table, is
to initdb on a ramdisk and run a separate server instance on it. It
doesn't sound like a) you've got a lot of data in there, or b) you
really care about the data sticking around after a crash (since it's
in the proprietary system already). You'd get the benefit of easy sql
to do the storage and retrieval, though you'd need a separate
connection to the separate server instance on the web page which would
be annoying, and you'd probably want to script the ramdisk and db
setup since you'll have to recreate after a crash. (Alternately,
create on disk somewhere; then put a copy on the ramdisk and start a
postgres instance pointed at it. Then after a restart you just need to
copy over from disk and start up the postgres instance)


-- 
- David T. Wilson
david.t.wil...@gmail.com

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Select ranges based on sequential breaks

2009-06-15 Thread David Wilson
On Mon, Jun 15, 2009 at 2:23 PM, Mike Toewsmwto...@sfu.ca wrote:
 Hi,

 I'm having difficulty constructing a query that will find breaks where data
 change in a time-series. I've done some searching for this too, but I
 haven't found anything.

 Here is my example situation, consider my source table:
 date    bin
 2009-01-01      red
 2009-01-02      red
 2009-01-03      blue
 2009-01-04      blue
 2009-01-05      blue
 2009-01-06      red
 2009-01-07      blue
 2009-01-08      blue
 2009-01-09      red
 2009-01-10      red


 I would like to get the first and last of each consecutive series based on
 column bin. My result for the table would look like:
 first   last    bin
 2009-01-01      2009-01-02      red
 2009-01-03      2009-01-05      blue
 2009-01-06      2009-01-06      red
 2009-01-07      2009-01-08      blue
 2009-01-09      2009-01-10      red


 This is easy to compute using a spreadsheet or in R, but how would I do this
 with SQL? I'm using 8.3. Advice is appreciated.

(Written in email and untested- also, someone will probably provide a
better way, I hope, but this should at least work)

select date as first,
(select date from table t3 where t3.date(select date from table t5
where t5.datet1.date and t5.bint1.bin order by date asc limit 1)
order by date desc limit 1) as last,
bin
from table t1 where (select bin from table t2 where t2.datet1.order
order by date desc limit 1)t1.bin;

Ugly, and I'm pretty sure there's a much better way, but my brain is
failing me right now- hopefully this'll at least get you started,
though.

-- 
- David T. Wilson
david.t.wil...@gmail.com

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Entering time into postgresql

2009-06-08 Thread David Wilson
On Mon, Jun 8, 2009 at 11:04 AM, BlackMage dsd7...@uncw.edu wrote:

 Is there a field that can actually store time in PostgreSql. And what I mean
 by time is not like time of the day, but a specific amount of time.

The manual is your friend:
http://www.postgresql.org/docs/8.3/static/datatype-datetime.html

The interval data type is what you're looking for.

-- 
- David T. Wilson
david.t.wil...@gmail.com

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How should I deal with disconnects during insert?

2009-05-20 Thread David Wilson
On Wed, May 20, 2009 at 4:41 PM, Sergey Samokhin prikru...@gmail.com wrote:

 Is there a way to do #1 (it's a bit cleaner and simpler to implement)?

You could wrap the whole thing in a transaction. If you don't get to
the commit, the whole transaction should roll back.

-- 
- David T. Wilson
david.t.wil...@gmail.com

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Providing an alternative result when there is no result

2009-05-18 Thread David Wilson
On Mon, May 18, 2009 at 3:13 PM, Joshua Berry yob...@gmail.com wrote:
 Any hints/tips? Is our original solution okay, or is there something we can
 do to improve things?

It seems as if you don't really care about the results of the query-
just whether or not it returns any rows. In that case, why not
something like:

select (case when exists (select * from foo where...) then true else
false end) as result;
-- 
- David T. Wilson
david.t.wil...@gmail.com

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Doubt about join clause

2009-04-20 Thread David Wilson
On Mon, Apr 20, 2009 at 7:39 PM, jc_mich juan.mich...@paasel.com wrote:

 You've understood very well my problem, but also this query works as worse
 than everything I did before, it throws as many rows as rows are contained
 my tables clients and stores. I only want to find for every client what
 store is closer to him, I expect one client to one store and their distance

select clients.id as client_id, (select stores.id from stores order by
(power(clients.x-stores.x)+power(clients.y-stores.y)) asc limit 1) as
store_id from clients;

Should do the trick, or at least something very similar.

-- 
- David T. Wilson
david.t.wil...@gmail.com

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Now I am back, next thing. Final PGS tuning.

2009-04-08 Thread David Wilson
On Wed, Apr 8, 2009 at 12:24 PM, Jennifer Trey jennifer.t...@gmail.com wrote:

 I think I might have misunderstood the effective cache size. Its measured in
 8kB blocks. So the old number 449697 equals 3.5 GB, which is quite much.
 Should I lower this? I had plans to use 2.75GB max. Can I put 2.75GB there?
 Should I leave it?

The effective cache size setting is merely letting postgres know how
much caching it can expect the OS to be doing. If you know that the OS
isn't going to have more than 2.75 GB available for caching DB files,
then by all means reduce it. The setting by itself doesn't affect
postgres memory usage at all, though.

-- 
- David T. Wilson
david.t.wil...@gmail.com

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] SELinux problem rsync'ing WAL logs

2009-03-31 Thread David Wilson
On Tue, Mar 31, 2009 at 9:18 PM, Aleksey Tsalolikhin
atsaloli.t...@gmail.com wrote:

 Could not create directory '/home/postgres/.ssh'.
 Host key verification failed.

Have you tested ssh node2 as the postgres user with SELinux enabled?
This looks like ssh failing to access the .ssh directory where it
keeps host keys (the known_keys file) and dying as a result. None of
the SELinux module setup lines seem to cover that, so you may want to
see if there's an SELinux failure for ssh in the audit log that could
give you a clue as to what needs to be allowed.

-- 
- David T. Wilson
david.t.wil...@gmail.com

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] text column constraint, newbie question

2009-03-23 Thread David Wilson
On Mon, Mar 23, 2009 at 3:07 AM, Scott Marlowe scott.marl...@gmail.com wrote:

 Are you saying pg_quer_params is MORE effective than pg_escape_string
 at deflecting SQL injection attacks?

pg_query_params() will protect non-strings. For instance, read a
number in from user input and do something of the form  and
foo=$my_number. Even if you escape the string, an attacker doesn't
need a ' to close a string, so he can manage injection. If it's  and
foo=$1 using pg_query_params(), however, that's not possible.

-- 
- David T. Wilson
david.t.wil...@gmail.com

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] connecting using libpq breaks printf

2009-02-18 Thread David Wilson
On Wed, Feb 18, 2009 at 9:47 AM, Joey Morris rjmorri...@gmail.com wrote:

 I expected this program to print connecting, but in fact I get no output
 whatsoever. If I comment out the PQconnectdb and PQfinish lines, I see
 connecting as expected. What could be going on here?

Try adding fflush(stderr); after your fprintf() call. If the
connection is hanging and output hasn't been flushed, you wouldn't see
much.


-- 
- David T. Wilson
david.t.wil...@gmail.com

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Query sometimes takes down server

2009-01-16 Thread David Wilson
On Fri, Jan 16, 2009 at 3:27 PM, Jason Long
mailing.l...@supernovasoftware.com wrote:

 I just tried it by sending text only instead of text and html.  We will see
 if it goes through this time.
 Other than that do you see anything weird about my email?

Still nothing. Do you have webspace you could place it on? If not, you
could use http://explain-analyze.info and simply provide the list with
links to the plans.

-- 
- David T. Wilson
david.t.wil...@gmail.com

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Timestamp shift when importing data

2009-01-03 Thread David Wilson
On Sat, Jan 3, 2009 at 5:59 PM, Jolles, Peter M (GE Infra, Energy)
peter.jol...@ge.com wrote:

 For example, on these days, 4/7/02  2:00 AM imports to 4/7/02 3:00 AM.
 4/6/03 2:15 AM imports as 4/6/03 3:15 AM, etc. All other dates and times
 do not give any errors. I have tried to extract the date and create a
 text field in MS Access, I get the same error when imported to Postgres.
 Is there some significance to these dates?

Those are the dates of daylight savings time kicking in- which
happens, not coincidentally, at 2am.

What's the type of the field you're trying to import into, and how are
you doing the import?

(Sorry for the duplicate- missed sending to the list the first time.)
-- 
- David T. Wilson
david.t.wil...@gmail.com

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Using Postgres to store high volume streams of sensor readings

2008-11-22 Thread David Wilson
On Sat, Nov 22, 2008 at 4:54 PM, Ciprian Dorin Craciun
[EMAIL PROTECTED] wrote:
 On Sat, Nov 22, 2008 at 11:51 PM, Scott Marlowe [EMAIL PROTECTED] wrote:
 On Sat, Nov 22, 2008 at 2:37 PM, Ciprian Dorin Craciun
 [EMAIL PROTECTED] wrote:

Hello all!
 SNIP
So I would conclude that relational stores will not make it for
 this use case...

 I was wondering you guys are having to do all individual inserts or if
 you can batch some number together into a transaction.  Being able to
 put  1 into a single transaction is a huge win for pgsql.

I'm aware of the performance issues between 1 insert vs x batched
 inserts in one operation / transaction. That is why in the case of
 Postgres I am using COPY table FROM STDIN, and using 5k batches...
 (I've tried even 10k, 15k, 25k, 50k, 500k, 1m inserts / batch and no
 improvement...)

I've had exactly the same experience with Postgres during an attempt
to use it as a store for large-scale incoming streams of data at a
rate very comparable to what you're looking at (~100k/sec). We
eventually just ended up rolling our own solution.

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Table bloat in 8.3

2008-11-13 Thread David Wilson
On Thu, Nov 13, 2008 at 2:03 PM,  [EMAIL PROTECTED] wrote:
 I have several tables that when I run VACUUM FULL on, they are under 200k,
 but after a day of records getting added they grow to 10 to 20 megabytes.
 They get new inserts and a small number of deletes and updates.

 seq_scan | 32325
 seq_tup_read | 39428832
 idx_scan | 6590219
 idx_tup_fetch| 7299318
 n_tup_ins| 2879
 n_tup_upd| 6829984
 n_tup_del| 39
 n_tup_hot_upd| 420634
 n_live_tup   | 2815
 n_dead_tup   | 0

Can you define small number of deletes and updates? The stats above
would disagree with small. Remember that every update creates a new,
updated version of the row, which is where the increase is coming
from.

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Semi-customized queries? Subset of SQL? Accessing the parser? Injection?

2008-11-04 Thread David Wilson
On Tue, Nov 4, 2008 at 2:12 PM, Webb Sprague [EMAIL PROTECTED] wrote:

 Can't do that.  (Or I wouldn't have asked the question.)  Need a WWW
 interface, period.

A WWW interface doesn't preclude the suggestion of simply relying on
permissions to maintain safety and providing what amounts to a query
command line; I've got that exact thing in php for one of my DBs. The
user can't make db changes, and just from paranoia I check the query
for certain bad keywords (delete, insert, into, update, drop, create,
alter, etc) before passing it on. On return, some simple php functions
create a table with the appropriate column names and such.

It's surely the simplest solution, and it definitely will work.

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] postgresql and Mac OS X

2008-10-28 Thread David Wilson
On Tue, Oct 28, 2008 at 7:10 PM, Christophe [EMAIL PROTECTED] wrote:

 On Oct 28, 2008, at 4:03 PM, Steve Atkins wrote:

 Installing from source means I can avoid the fragility of macports or
 fink, and know that I've built it in much the same way as the postgresql or
 solaris installation I'd be using for production.

 +1

Make that +2.

My primary development machine is my macbook pro, and I'd definitely
suggest compiling from source. It's quite painless, and it ensures
that you have exactly what you think you have on the machine. Getting
it to start up automatically is pretty trivial, and I'd say that the
benefits *far* outweigh the avoidance of a few annoying system
administration tasks at install time. Saving myself pain over time is
definitely far better than saving a bit of discomfort once, IMO!

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Out of memory in create index

2008-10-13 Thread David Wilson
After dropping an index to do some full-table updating, I'm running
into an out of memory issue recreating one of my indices. This is on
8.3 running on linux.

The table in question has about 300m rows. The index is on a single
integer column. There are approximately 4000 unique values among the
rows.

create index val_datestamp_idx on vals(datestamp) tablespace space2;

About 30 seconds into the query, I get:
ERROR:  out of memory
DETAIL:  Failed on request of size 536870912.

Increasing maintenance_work_mem from 1GB to 2GB changed nothing at
all- exact same error at exact same time. Watching memory on the
machine shows the out of memory error happens when the machine is only
at about 35% user. create index concurrently shows an identical error.

Two other indexes (multicolumn) on the same table have already been
successfully recreated, so this puzzles me.

Actually, while I was writing this, I added an additional column to
the index and it now appears to be completing (memory has reached
about the point it had been failing at and is now holding steady, and
the query has been going for significantly longer than the 30 seconds
or so it took to error out previously). I sort by both columns at
times, so the extra column may in fact turn out to be useful, but the
failure of the single column create index in the face of the other
successful creates has me confused. Can anyone shed some light on the
situation?
-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Out of memory in create index

2008-10-13 Thread David Wilson
On Mon, Oct 13, 2008 at 6:44 AM, Gregory Stark [EMAIL PROTECTED] wrote:

 How much memory the OS allows Postgres to allocate will depend on a lot of
 external factors. At a guess you had some other services or queries running at
 the same time the first time which reduced the available memory.

I'm sorry- I was insufficiently clear. Postgres was the only service
running, and there were no additional queries happening at the same
time. (This database is on a dedicated machine; the only other things
that run are some decision-support applications that were all off at
the time.) In addition, the 35% memory usage number was for user-space
processes in total, not for postgres specifically; the swap space was
completely clear. maintenance_work_mem + work_mem is well under the
total amount of RAM on the system, and certainly well under RAM +
swap.

I'll give a try to building that index with a lower
maintenance_work_mem this evening when I can shut off the other
processes again, though given the above it strikes me as unlikely to
be the problem.

Also, the thing that has me even more confused is the fact that it
worked when I added an additional column to the index.


-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Question About UNION

2008-10-09 Thread David Wilson
On Thu, Oct 9, 2008 at 1:48 PM, Bill Thoen [EMAIL PROTECTED] wrote:
 No, this won't work. Here's an example of my tables:
 Table1
 1, 12, A
 2, 16, B
 8, 6, A
 19, 9, C

 Table2
 1, 13, D
 2, 16, B
 8, 6, B
 12, 5, A

select * from table1
union
select table2.* from table2 left join table1 on table2.a=table1.a and
table2.b=table1.b where table2.a is null;

(Written in gmail, but you should get the basic idea.)

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Question About UNION

2008-10-09 Thread David Wilson
On Thu, Oct 9, 2008 at 3:31 PM, Bill Thoen [EMAIL PROTECTED] wrote:

 Thanks, but that didn't work. That selected only the records from table1.

That's why I warned you about it being written in gmail. :)

select * from table1
union
select table2.* from table2 left join table1 on table2.a=table1.a and
table2.b=table1.b where table1.a is null;

This should probably do it without the temp table (the first version
was checking for null in the wrong place).
-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Efficiency of inner joins across multiple tables

2008-10-07 Thread David Wilson
On Tue, Oct 7, 2008 at 7:30 PM, Saptarshi Guha [EMAIL PROTECTED] wrote:

 Q: How efficient is this?

If connectionHash is unique or near to it, and everything's indexed...
it could work for reasonable K values depending on your performance
requirements.

 Every time i create a new table should i create a table with all columns and
 query from that?

Is there any reason you can't just add the new column to the existing table?

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Psql command for rowcount

2008-09-11 Thread David Wilson
On Thu, Sep 11, 2008 at 3:19 PM, Markova, Nina [EMAIL PROTECTED] wrote:
 Is there a psql or other command that I can use to list tables and their
 rows? All I found is this:
 http://archives.postgresql.org/pgsql-hackers/2004-09/msg00876.php

select tablename,reltuples from pg_class inner join pg_tables on
tablename=relname where tablename !~* 'pg_*' and tablename !~*
'sql_*';

Remember that the reltuples count is an *estimate* and won't be 100%
an accurate- only a count(*) will get you that.

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Prepared statements aren't working with parameters with PQexecParams

2008-09-05 Thread David Wilson
On Fri, Sep 5, 2008 at 2:52 AM, Subspace god [EMAIL PROTECTED] wrote:

 The following does not work in C++, executed as above, in another session
 using two subsequent calls to PQexecParams

 PREPARE myquery (text) AS INSERT INTO myTable (word) VALUES ($1);
 EXECUTE myquery($1::text);

You're doing prepared statements incorrectly. Use PQprepare() and
PQexecPrepared() when using them from libpq; don't do them by hand on
your own.

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] What's size of your PostgreSQL Database?

2008-08-19 Thread David Wilson
On Fri, Aug 15, 2008 at 11:42 PM, Amber [EMAIL PROTECTED] wrote:
 Dear all:
We are currently considering using PostgreSQL to host a read only 
 warehouse, we would like to get some experiences, best practices and 
 performance metrics from the user community, following is the question list:

DB is ~650m rows across 10 tables and is currently around 160gb.

Running on Ubuntu, mostly because this db started out as a toy and it
was easy. It's done well enough thus far that it isn't worth the
hassle to replace it with anything else.

Currently only using Raid 0; the database can be regenerated from
scratch if necessary so we don't have to worry overmuch about disk
failures.

Machine is a quad-core Xeon 2.5 with 4g of RAM.

Our access pattern is a little odd; about half the database is wipe
and regenerated at approximately 1-2 month intervals (the regeneration
takes about 2 weeks); in between there's a nightly computation run
that creates a small amount of new data in two of the tables. Both the
regeneration and the addition of the new data depends very heavily on
many, many several table joins that generally involve about 50% of the
database at a time. We've been fairly pleased with the performance
overall, though it's taken some tweaking to get individual operations
to perform adequately.

I can't speak to pure load operations; all of our bulk-load style ops
are 4k-row COPY commands interspersed among a lot of big, complicated
aggregate queries- not exactly ideal from a cache perspective.

Concurrent readers are anywhere from 1-8, and we're not in a cluster.
Sequential transfer rate is usually a touch over 100mb/sec; we don't
have a lot of disks on this machine (though that may change oh how
some of our index scans long for more spindles).

The performance improvements made in the past few releases have been
incredibly helpful- and very much noticeable each time.

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] is a 'pairwise' possible / feasible in SQL?

2008-08-04 Thread David Wilson
On Mon, Aug 4, 2008 at 2:08 PM, Rajarshi Guha [EMAIL PROTECTED] wrote:

 select count(aid) where cid = 123 and cid = 456;

 but I was wondering whether I could construct a single SQL statement to do
 this.

 Any pointers would be appreciated,

Typed into gmail, so may need some tweaking, but something to the effect of:

select count(*) from table a inner join table b on a.aid=b.bid group
by a.cid,b.cid;

should do the trick, I'd think...

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] use of column in COPY

2008-08-01 Thread David Wilson
On Fri, Aug 1, 2008 at 10:16 AM, Ivan Sergio Borgonovo
[EMAIL PROTECTED] wrote:
 what's the use of column in the COPY FROM syntax if I get:

 ERROR:  extra data after last expected column

 I've read:
 http://bytes.com/forum/thread424089.html

 but then... is there any actual use?


I use COPY FROM extensively; the column specification allows for cases
where the incoming file doesn't match the table's column ordering (or
if the incoming file has *fewer* columns than the table).

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] non-WAL btree?

2008-08-01 Thread David Wilson
On Fri, Aug 1, 2008 at 4:07 PM, Alex Vinogradovs
[EMAIL PROTECTED] wrote:
 By loading in bulks, I mean I load some 40-50 thousand
 rows at once into a table that already has some millions.
 Index rebuild on that table after each 50k inserts will
 be even less efficient ;)

How many indexes do you have on this...? I do this pretty regularly
(actually, I do 4k batches with COPY, 4-10 concurrent batches every 10
seconds, for 2-3 days at a time) and, having testing dropping indices,
nothing to do with the index has a particularly strong performance
impact.

That said, a significant increase in checkpoint segments was required
to get good performance out of the above use case. If you haven't
tried that, I'd say that's a good place to start.

What makes you think it's specifically index WAL work, though?

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] non-WAL btree?

2008-08-01 Thread David Wilson
On Fri, Aug 1, 2008 at 4:16 PM, David Wilson [EMAIL PROTECTED] wrote:
 On Fri, Aug 1, 2008 at 4:07 PM, Alex Vinogradovs
 [EMAIL PROTECTED] wrote:
 By loading in bulks, I mean I load some 40-50 thousand
 rows at once into a table that already has some millions.
 Index rebuild on that table after each 50k inserts will
 be even less efficient ;)

 How many indexes do you have on this...? I do this pretty regularly
 (actually, I do 4k batches with COPY, 4-10 concurrent batches every 10
 seconds, for 2-3 days at a time) and, having testing dropping indices,
 nothing to do with the index has a particularly strong performance
 impact.

Sorry, as I hit send, I realized I should clarify this: I do my bulk
loads with the indexes active; I don't rebuild them (as they're
necessary during the batch calculations). Dropping the indexes and
using test data didn't show a significant performance improvement over
leaving the indexes enabled.

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Cursor

2008-07-29 Thread David Wilson
On Tue, Jul 29, 2008 at 5:42 PM, Bob Pawley [EMAIL PROTECTED] wrote:

 begin

Don't you need a ; after your begin...?

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] copy ... from stdin csv; and bytea

2008-07-27 Thread David Wilson
My application is adding a bytea column to a table into which data is
dumped in approximately 4k row batches, one batch approximately every
10 seconds. To this point, those dumps have used copy from stdin;
however, I'm having some difficulty getting bytea encodings to work
with it. Specifically, I can't seem to get the parser to recognize
that what I'm handing it is an escaped string that needs to be parsed
back into individual bytes rather than stored as-is. The encoding is
very straightforward for INSERT, of course, but the COPY ... FROM
STDIN CSV doesn't seem to want to work no matter what I've tried:

\\000
\\000
E'\\000'

etc.

Is there a trick to this that I just didn't see in the documentation,
or is this some limitation of CSV copy-in? If the latter, are there
suggestions for workarounds other than to fallback on the inserts?

Using 8.3.3, and this is specifically via libpq, if that makes a difference.

Thanks much.
-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] copy ... from stdin csv; and bytea

2008-07-27 Thread David Wilson
On Mon, Jul 28, 2008 at 1:24 AM, Klint Gore [EMAIL PROTECTED] wrote:
 Try just a single \

 e.g.
 ge.xls,application/vnd.ms-excel,71168,\320\317\021\340\241[snip]

Thanks- I did try that, and it at least gave the expected output from
select, but is there a way to verify that it's actually handling it
correctly rather than simply storing the sequence of characters? I'm
not certain how to check the actual byte width of a column within a
row, and I'd *really* rather not be storing 4 bytes for every 1 in the
binary if I can avoid it- this column is already going to be doubling
field width; quadrupling it would give me space headaches I really
don't want to deal with. :)

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] renumber table

2008-06-19 Thread David Wilson
On Thu, Jun 19, 2008 at 7:54 AM, Steve Clark [EMAIL PROTECTED] wrote:

 I am not sure that will do what I want. As an example
 suppose I have 5 rows and the idfield is 1,2,3,4,5
 now row 1 is updated, not the idfield but another column, then row 3 is
 deleted.
 Now I would like to renumber  them such that 1 is 1, 2 is 2, 4 is 4 , 5 is
 4.

 I don't think what you wrote will necessarily keep them in the same relative
 order that they
 were before.

It's ugly and a hack, but if you could do:

1. rename the table
2. create a new copy of the table. give the idfield a sequence.
3. select all the records from the renamed table, ordering by idfield,
and insert all fields other than idfield into the new table.
4. drop the renamed table

You'll maintain ordering and have a nice sequential idfield.
-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Determining offsets to jump to grab a group of records

2008-06-11 Thread David Wilson
On Wed, Jun 11, 2008 at 5:54 PM, David Lambert [EMAIL PROTECTED] wrote:
 I am trying to find a way to figure out what offset I would have to use in a
 SELECT with LIMIT and OFFSET clauses to get a grouping of records. For
 example:

  Consider a table full of first names.  I want to be able to find the first
 offset where the name is DAVID.  (We'll say that it is the 1009th row in
 the resulting select) This way I could perform the following:

SELECT name FROM nametable LIMIT 25 OFFSET 1009;


This isn't an answer to your direct question, but it seems to me as if
you've already decided on a solution (figuring out an offset) that's
non-optimal. If you're using offsets at all, you must have a distinct
sort ordering; if you have that, you should be able to accomplish the
same thing with a where clause instead of an offset. Is there a
particular reason why this doesn't work for you?

Perhaps the better question is what are you trying to do with this?
There may be a higher level solution that can be given.


-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Alias in the HAVING clause

2008-05-13 Thread David Wilson
On Tue, May 13, 2008 at 5:43 PM, Nathan Thatcher [EMAIL PROTECTED] wrote:
 Is this the correct way to do this, or is there a better way / a way
 to get PostgreSQL to recognize an alias in the HAVING clause?

As Tom pointed out, f1's not in scope for the HAVING clause. If you're
that concerned about expression duplication, you could move the
calculation into a sub-select:

SELECT COUNT(*), f1 FROM (SELECT id % 3 AS f1 FROM table) t1 GROUP BY
f1 HAVING f1  0;

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Backup/Restore of single table in multi TB database

2008-05-07 Thread David Wilson
On Wed, May 7, 2008 at 4:02 PM, John Smith [EMAIL PROTECTED] wrote:

  Does anyone see a problem with this approach (e.g., correctness,
  performance, etc.)?  Or is there perhaps an alternative approach using
  some other postgresql mechanism that I'm not aware of?

Did you already look at and reject pg_dump for some reason? You can
restrict it to specific tables to dump, and it can work concurrently
with a running system. Your database is large, but how large are the
individual tables you're interested in backing up? pg_dump will be
slower than a file copy, but may be sufficient for your purpose and
will have guaranteed correctness.

I'm fairly certain that you have to be very careful about doing simple
file copies while the system is running, as the files may end up out
of sync based on when each individual one is copied. I haven't done it
myself, but I do know that there are a lot of caveats that someone
with more experience doing that type of backup can hopefully point you
to.

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Custom Base Type in C

2008-05-07 Thread David Wilson
On Wed, May 7, 2008 at 7:06 PM, Toby Chavez [EMAIL PROTECTED] wrote:

 My custom type needs to have one biginteger and one text value... pretty
 straight forward. I guess my first question is if there are any examples out
 there that do something similar. I have looked extensively through the
 contrib directory and can't find anything very helpful. If not, could
 someone help me understand what my c struct would look like and what values
 I need for INTERNALLENGTH, ALIGNMENT, STORAGE, etc in the CREATE TYPE
 command?

Is there any particular reason why this needs to be done in C? Why not
just create the type normally as per:

CREATE TYPE my_transition_type AS (a bigint, b text);

And then create your aggregate function using that type?

I can't help you with the C stuff, but you may be making far more work
for yourself than you really need.

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] GROUP BY, ORDER LIMIT ?

2008-05-06 Thread David Wilson
select class, size from X t1 where size in (select size from X t2
where t2.class=t1.class order by size desc limit 5);

On Tue, May 6, 2008 at 1:22 PM, Kynn Jones [EMAIL PROTECTED] wrote:

 Suppose table X has two columns: class (TEXT) and size (INT).  I want a
 listing showing the (up to) 5 largest values of size for each value of
 class (for some values of class the total number of available records
 may be less than 5).

 What would be the simplest way to achieve such a listing?  It seems like
 something one could do with GROUP BY, ORDER, and LIMIT, but I can't figure
 out the right syntax for this query.

Warning, this is typed directly into mail:
select class, size from X t1 where size in (select size from X t2
where t2.class=t1.class order by size desc limit 5);

That should do the trick. The only problem is if you've got duplicated
size values, you could end up with more than 5 per class.

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to modify ENUM datatypes?

2008-05-01 Thread David Wilson
  Maybe I'm some crazy, radical DBA, but I've never had a version of
  pgsql get EOLed out from underneath me.  I migrated from 7.4 to 8.1
  right around the time 8.2 came out then upgraded to 8.2 around 6
  months later.

  Where I work now we are looking at migrating from 8.1 to 8.2 or 8.3
  (depending on whether or not we have the man power to fix a few issues
  with type coercion, our app, and 8.3)  These aren't the DBA got a
  wild hair and just had to upgrade upgrades.  Each time I've migrated
  it's been because there were performance or maintenance issues that
  were solved by upgrading.

Perhaps I'm in a unique situation as well, but as the DBA of a
data-tank style DB, I haven't had a problem at all finding
opportunities to upgrade to later versions of postgresql. My schema
isn't all that complicated; it's just a very large amount of data and
some very complex queries on that data- but the queries have been kept
to extremely standard SQL specifically for migration and
cross-platform reasons. It's definitely been annoying on occasion to
find that I need to do a dump and restore to move to a new version,
but at the same time cheap, large storage is extremely inexpensive
when compared to the sort of storage acceptable for day-to-day use, so
size isn't generally a problem- just dump to a big, cheap disk and
then restore. I'm probably lucky in that I manage a shop that can
tolerate a day's downtime for such a situation, but at the same time,
we also demand the most from database performance for complex queries,
so a day's downtime here could easily save many days' worth of query
time down the line.

8.3, FWIW, was particularly attractive in this regard. I couldn't
quite justify upgrading to the release candidates, but the performance
improvements were pretty tempting.
-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Rapidly decaying performance repopulating a large table

2008-04-24 Thread David Wilson
Thanks for the help from everyone on this. Further investigation with
the suggested statistics and correlating that with some IO graphs
pretty much nailed the problem down to checkpoint IO holding things
up, and tuning the checkpoint segments and completion target (128 and
0.9 seemed to be the best tradeoff for me) pretty much cleared things
up.

All told, the run time of this regeneration pass was less than half of
what it was the last time I ran one, despite involving more total
data. Much appreciated.

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] table as log (multiple writers and readers)

2008-04-22 Thread David Wilson
On Tue, Apr 22, 2008 at 9:52 AM, Joris Dobbelsteen
[EMAIL PROTECTED] wrote:

  Describe the mechanism, because I don't really believe it yet. I think you
 need to do a advisory lock around every commit of every transaction that
 writes to the log table.

Consider some number of reader processes and some number of writer processes.

Writer processes touch only the staging table, and solely do inserts
into it. As a result, writer processes cannot interfere with each
other in any way and do not require any synchronization beyond that
provided by MVCC.

Reader processes are interested in polling the logging table at
intervals. In the process, they also act as staging-to-log movers.
This act (because it is destructive and because we require serialized
inserts for id generation in the log table) must take a lock that
prevents other readers from attempting the same work at the same time.

Each reader process therefore has a loop that appears as follows:
1) Obtain advisory lock.
2) Begin transaction.
3) For each row in staging table, insert copy into log table.
4) Delete all visible rows from staging table.
5) Commit transaction.
6) Release advisory lock.
7) Handle not-yet-seen rows in the logging table (This is the primary
work of the readers)
8) Sleep for the desired interval and return to 1).

We require two types of synchronization and the above takes care of both:
1) The advisory lock prevents multiple readers from doing simultaneous
staging-to-log moves.
2) The transaction block ensures that the reader will see a consistent
state on the staging table while writers may write at the same time;
writes that occur during the reader's transaction block will simply be
ignored during this round of reading.

You need both types of synchronization to avoid problems- taking an
exclusive lock would simply be the sledgehammer method of doing the
synchronization, since it would take the place of both the advisory
lock and the transaction at the same time but would also block
writers.

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] table as log (multiple writers and readers)

2008-04-22 Thread David Wilson
On Tue, Apr 22, 2008 at 2:48 PM, Joris Dobbelsteen
[EMAIL PROTECTED] wrote:

  Ah, yes, all visible rows...
  My point is that, unless you use a transaction with serializable isolation,
 this all visible rows for the second statement might be different from those
 that you copied into the log table.

  With the normal Read committed isolation level you suffer from a possible
 nonrepeatable read that might change tuple visibility between different
 statements.

That depends on implementation. A select into ... to do the initial
copy followed by a delete where... with the where clause referencing
the log table itself to ensure that we delete only things that now
exist in the log table, or a row by row  insert/delete pair. Either
would provide the appropriate level of protection from accidental
deletion of more things than you intended without harming concurrency.
The delete referencing the log table might require that the log table
be indexed for performance, but it's likely that such indexing would
be done anyway for general log use.
-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Rapidly decaying performance repopulating a large table

2008-04-22 Thread David Wilson
I have a fairly simple table (a dozen real/integer columns, a few
indexes, one foreign key reference) with ~120m rows. Periodically the
table is truncated or dropped and recreated and the data is
regenerated (slightly different data, of course, or the exercise would
be rather pointless). The regeneration occurs in batches of ~4000 data
points at a time, which are inserted into the table via COPY, and are
coming from several simultaneous processes.

The first several million data points are quite quick (the copy
executes in well under a quarter second). By the time the table
reaches 10-15m rows, however, each individual COPY is taking 20
seconds to execute. Is there anything I can do to improve this
performance? I can't drop/recreate the indices because some of the
data points rely on points generated already in the run, and dropping
the indices would make the required joins ridiculously expensive once
the table starts growing. The foreign key reference *is* droppable for
this regeneration, but I wouldn't expect it to be a performance
problem.

The autovacuum daemon is running in the background, with these
settings: (All autovacuum-specific settings are still at defaults)
vacuum_cost_delay = 50  # 0-1000 milliseconds
vacuum_cost_page_hit = 1# 0-1 credits
vacuum_cost_page_miss = 10  # 0-1 credits
vacuum_cost_page_dirty = 20 # 0-1 credits
vacuum_cost_limit = 200 # 1-1 credits

My gut feeling is that better autovacuum settings would help, but I'm
not really sure what to modify to get the improvement I'm looking for.

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Rapidly decaying performance repopulating a large table

2008-04-22 Thread David Wilson
On Tue, Apr 22, 2008 at 4:38 PM, Scott Marlowe [EMAIL PROTECTED] wrote:
  The best bet is to issue an analyze table (with your table name in
  there, of course) and see if that helps.  Quite often the real issue
  is that pgsql is using a method to insert rows when you have 10million
  of them that made perfect sense when you had 100 rows, but no longer
  is the best way.


This has caused the behavior to be... erratic. That is, individual
copies are now taking anywhere from 2 seconds (great!) to 30+ seconds
(back where we were before). I also clearly can't ANALYZE the table
after every 4k batch; even if that resulted in 2 second copies, the
analyze would take up as much time as the copy otherwise would have
been. I could conceivably analyze after every ~80k (the next larger
unit of batching; I'd love to be able to batch the copies at that
level but dependencies ensure that I can't), but it seems odd to have
to analyze so often.

Oh, barring COPY delays I'm generating the data at a rate of something
like a half million rows every few minutes, if that's relevant.
-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Rapidly decaying performance repopulating a large table

2008-04-22 Thread David Wilson
On Tue, Apr 22, 2008 at 5:04 PM, Scott Marlowe [EMAIL PROTECTED] wrote:
  Normally, after the first 50,000 or so the plan won't likely change
  due to a new analyze, so you could probably just analyze after 50k or
  so and get the same performance.  If the problem is a bad plan for the
  inserts / copies.

  also, non-indexed foreign keyed fields can cause this problem.


Analyzing after the first 50k or so is easy enough, then; thanks for
the suggestion.

Foreign keys are definitely indexed (actually referencing a set of
columns that the foreign table is UNIQUE on).

Any other suggestions? COPY times alone are pretty much quadrupling my
table-rebuild runtime, and I can interrupt the current rebuild to try
things pretty much at a whim (nothing else uses the DB while a rebuild
is happening), so I'm pretty much game to try any reasonable
suggestions anyone has.

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Rapidly decaying performance repopulating a large table

2008-04-22 Thread David Wilson
On Tue, Apr 22, 2008 at 5:18 PM, Scott Marlowe [EMAIL PROTECTED] wrote:

  Try upping your checkpoint segments.  Some folks find fairly large
  numbers like 50 to 100 to be helpful.  Each segment = 16Megs, so be
  sure not to run your system out of drive space while increasing it.


Ahh, much more progress. Upping the segments to 50, timeout to 30m and
completion target to 0.9 has improved average copy time to between 2
and 10 seconds, which is definitely an improvement. Thanks for the
help. Any other random thoughts while you're at it? :)

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Rapidly decaying performance repopulating a large table

2008-04-22 Thread David Wilson
On Tue, Apr 22, 2008 at 6:10 PM, Tom Lane [EMAIL PROTECTED] wrote:
 David Wilson [EMAIL PROTECTED] writes:

  Are you loading any tables that are the targets of foreign key
  references from other tables being loaded?  If so, I'd bet on
  Scott's theory being correct with respect to the plans for checks
  of those FK constraints.  The COPY itself hasn't got any plan,
  and inserting rows into a table should be constant-time in itself,
  so it seems to me there are only two possibilities for a progressive
  slowdown:

  * the cost of updating the indexes, which for standard btree indexes
  ought to grow at about log(N) when there are already N entries

  * bad plans in either foreign-key triggers or user-defined triggers
  attached to the tables.

Only one table is being regenerated, and it's not the target of any
foreign key checks itself; it merely has a single FK reference out to
one unchanging table. There are no triggers on the table.

  You failed to mention what PG version this is (tut tut) but if it's
  less than 8.3 then ANALYZE alone won't fix bad plans in triggers;
  you'd need to analyze and then start a fresh database session.

PG is 8.3.1.

I certainly expect some slowdown, given that I have indices that I
can't drop (as you indicate above). Having been watching it now for a
bit, I believe that the checkpoint settings were the major cause of
the problem, however. Changing those settings has dropped the copy
times back down toward what I'd expect; I have also now dropped the FK
constraint, but that has made no perceptible difference in time.

My guess at this point is that I'm just running into index update
times and checkpoint IO. The only thing that still seems strange is
the highly variable nature of the COPY times- anywhere from 1.0
seconds to 20 seconds, with an average probably around 8ish. I can
live with that, but I'm still open to any other suggestions anyone
has!

Thanks for the help so far.

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Rapidly decaying performance repopulating a large table

2008-04-22 Thread David Wilson
On Tue, Apr 22, 2008 at 7:33 PM, Tom Lane [EMAIL PROTECTED] wrote:

  What have you got shared_buffers set to?  If it's not enough to cover
  the working set for your indexes, that might be the (other) problem.


shared_buffers = 1536MB

Is there a way to get the size of a specific index, on that note?
There seem to be access functions for the relation + indices, and for
the relation by itself, but not a specific index out of possibly
several. I could increase shared_buffers some, but client apps on the
same machine occasionally also have hefty memory requirements (not
during these regeneration runs, but it seems like restarting the
server with a new shared_buffers value before and after the
regeneration is a bit of overkill).

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] table as log (multiple writers and readers)

2008-04-21 Thread David Wilson
On Mon, Apr 21, 2008 at 7:55 PM, Joris Dobbelsteen
[EMAIL PROTECTED] wrote:

  If you want to clean up the the staging table I have some concerns about
 the advisory lock. I think you mean exclusive table lock.

Either works, really. An advisory lock is really just a lock over
which you have control of the meaning, as long as you're using it in
the appropriate places. Also, an advisory lock on just the processes
doing staging-to-log moves would allow writes into the staging table
to continue concurrently with the staging-to-log transaction (whereas
an exclusive lock would unnecessarily prevent them).

Also, while Vance appears to have chosen to have a dedicated
staging-to-log process, even that isn't necessary- each reader can
simply do the lock/clear staging/unlock before any attempt to read-
unless you're polling that log table at truly crazy rates, the
overhead should be negligible and will ensure that the staging table
is simply cleared out whenever necessary while removing the
complexity of a separate process.


-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Using meta-data for foreign key?

2008-04-17 Thread David Wilson
On Thu, Apr 17, 2008 at 1:55 PM, Roberts, Jon [EMAIL PROTECTED] wrote:
  Using the term metadata is misleading in your example.  Metadata is
  data about data and in your example, you just have data.  It is fine to
  join tables in an RDBMS.


I believe you missed the OP's actual goal. He's not interested in
linking printing options to specific rows of the foo table; he wants
each column of the foo table to have a single entry in the printing
options table; that is, he wants a foreign key reference to the system
catalog giving the columns of entity foo. He does, in fact, appear to
be interested in a foreign key reference to a table's metadata.

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] table as log (multiple writers and readers)

2008-04-16 Thread David Wilson
(I originally missed replying to all here; sorry about the duplicate,
Vance, but figured others might be interested.

On Wed, Apr 16, 2008 at 1:55 PM, Vance Maverick [EMAIL PROTECTED] wrote:

  Another approach would be to queue the log entries in a staging table,
  so that a single process could move them into the log.  This is fairly
  heavyweight, but it would guarantee the consistent sequencing of the log
  as seen by a reader (even if the order of entries in the log didn't
  always reflect the true commit sequence in the staging table).  I'm
  hoping someone knows a cleverer trick.


Consider a loop like the following

advisory lock staging table
if (entries in table)
   copy entries to main log table as a single transaction
release advisory lock on staging table
read out and handle most recent log entries from main table

The advisory lock is automatically released on client disconnect, and
doing the whole thing within one transaction should prevent any
partial-copies on failures.

It doesn't matter that there are concurrent inserts to the staging
table because the staging table is always wiped all at once and
transferred in a synchronous fashion to the main table. You also can't
lose data, because it's always in one of the two tables.

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Unacceptable postgres performance vs. Microsoft sqlserver

2008-04-14 Thread David Wilson
On Mon, Apr 14, 2008 at 1:34 PM, Chris Browne [EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] [EMAIL PROTECTED] writes:
   Another test.  In postgres I added an index to the userid column and
   then counted distinct userids.  The average run time over three
   queries was 4666 seconds, or 78 minutes.  Unbelievable.
  
   On SQL Server, with *no* index, the same query takes on average 414
   seconds, or about 7 minutes.  Ten times faster!

First, in general- use the EXPLAIN and EXPLAIN ANALYZE feature of
postgresql. That will tell you a lot about what your queries are doing
and why they're taking so long.

Second, make sure you've ANALYZE'd your table after creating it and
the index, which gives the planner the statistics necessary to make
intelligent choices.

For instance, your count of distinct userids is probably not using the
index you just created. If it still isn't using it after you ANALYZE
the table, try rewriting the query using group by (select count(*)
from (select userid from mytable group by userid) tmp). I recently had
a similar performance issue on a 75m row table, and the above helped.

VACUUM ANALYZE tables, and then remember that EXPLAIN and EXPLAIN
ANALYZE are your best friends.

-- 
- David T. Wilson
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Problem after VACUUM ANALYZE

2008-04-09 Thread David Wilson
On Wed, Apr 9, 2008 at 3:29 AM,  [EMAIL PROTECTED] wrote:

  But if i do VACUUM ANALYZE without concurrent queries, everything runs fine
  afterwards.

  If i run  VACUUM ANALYZE with few concurrent queries, it slows down to a
  crawl.

  Could it be that something like this is happening:
  - In the early morning a new DB connection is opened.
  - While running VACUUM ANALYZE the planner uses different plans because some
  index could not be used or the statistics are right in that moment not
  present because they are updated... So the query gets a wrong plan. It uses a
  seqScan instead of an index scan.
  - This wrongly planned statement is prepared so even after VACUUM ANALYZE is
  done, the statement does not use the wrong plan.
  - load raises triggers many concurrent queries with wrong plans. so 
 everything
  slows down.


I've run into a very similar problem. I have some long-running
processes that generate a large amount of data and then query that
data that must periodically drop their connections and rebuild to
ensure that query plans get re-planned according to updated
statistics. This is especially true when a new DB is first being
brought up with an initial set of data (~1 week of compute time, ~30gb
of data and ~120m rows) with processes that live through the entire
process.

My solution, as mentioned above, is to rebuild the connection approx
every hour on my long-running processes. This is a tricky question,
because you want the benefit of not re-planning queries in 99.9% of
the cases- I'm not really opposed to the current system that requires
the reconnections, given that the overhead involved in them is
completely negligible in the grand scheme of things. There are
downsides to not replanning queries, and if you can't live with them
then you should either force re-planning at intervals or avoid
prepared statements.

-- 
- David T. Wilson
Princeton Satellite Systems
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] select distinct and index usage

2008-04-08 Thread David Wilson
On Mon, Apr 7, 2008 at 9:11 PM, Stephen Denne
[EMAIL PROTECTED] wrote:

  You may be able to make use of an index by rearranging your query to 
 generate a series between your min  max values, testing whether each value 
 is in your table.

  You've got 4252 distinct values, but what is the range of max - min? Say 
 it's 5000 values, you'd do 5000 lookups via an index, unless postgres thought 
 that the number of index based lookups where going to be more expensive than 
 reading the entire table.

Upon further investigation, the above works very well:

explain analyze select ds from (select generate_series((select
datestamp from vals order by datestamp asc limit 1), (select datestamp
from vals order by datestamp desc limit 1), 86400) as ds) series where
exists (select datestamp from vals where datestamp=ds);

 QUERY PLAN
--
 Subquery Scan series  (cost=4.89..8.90 rows=1 width=4) (actual
time=0.080..25264.239 rows=4252 loops=1)
   Filter: (subplan)
   -  Result  (cost=4.89..4.90 rows=1 width=0) (actual
time=0.051..7.491 rows=6163 loops=1)
 InitPlan
   -  Limit  (cost=0.00..2.45 rows=1 width=4) (actual
time=0.023..0.024 rows=1 loops=1)
 -  Index Scan using val_datestamp_idx on vals
(cost=0.00..184401441.14 rows=75391472 width=4) (actual
time=0.021..0.021 rows=1 loops=1)
   -  Limit  (cost=0.00..2.45 rows=1 width=4) (actual
time=0.020..0.020 rows=1 loops=1)
 -  Index Scan Backward using val_datestamp_idx on
validations  (cost=0.00..184401441.14 rows=75391472 width=4) (actual
time=0.018..0.018 rows=1 loops=1)
   SubPlan
 -  Index Scan using val_datestamp_idx on validations
(cost=0.00..70453.21 rows=17685 width=4) (actual time=4.096..4.096
rows=1 loops=6163)
   Index Cond: (datestamp = $0)
 Total runtime: 25267.033 ms
(12 rows)

The series generates all the possible datestamps + about 40% extra.

What's particularly interesting here to me is that it at least seems
like this validates my original assumption that if the planner could
be coaxed into using the index it would be faster- or am I missing
something? This query, at 25 seconds, was certainly much faster than
even the GROUP BY version that ran in 120 seconds.

As before, thanks for all of the information and ideas. Down from 722
seconds to 25 seconds is a hefty improvement.

-- 
- David T. Wilson
Princeton Satellite Systems
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] select distinct and index usage

2008-04-07 Thread David Wilson
On Mon, Apr 7, 2008 at 2:05 AM, Alban Hertroys
[EMAIL PROTECTED] wrote:
 On Apr 7, 2008, at 1:32 AM, David Wilson wrote:

 

  The databases estimates seem consistent with yours, so why is it doing
 this? Could you provide an EXPLAIN ANALYSE? It shows the actual numbers next
 to the estimates, although I figure that query might take a while...

explain analyze select distinct datestamp from vals;
   QUERY
PLAN
-
 Unique  (cost=15003047.47..15380004.83 rows=4263 width=4) (actual
time=649599.159..721671.595 rows=4252 loops=1)
   -  Sort  (cost=15003047.47..15191526.15 rows=75391472 width=4)
(actual time=649599.157..694392.602 rows=75391476 loops=1)
 Sort Key: datestamp
 Sort Method:  external merge  Disk: 1178592kB
 -  Seq Scan on vals  (cost=0.00..1531261.72 rows=75391472
width=4) (actual time=9.104..93130.468 rows=75391476 loops=1)
 Total runtime: 722379.434 ms

(There were a couple other very long-running, disk-intensive queries
going on in the background of this, so that runtime is a little
inflated, but the values should still all be relevant.)

  Pg estimates the costs quite high too. It's almost as if there isn't an
 index on that column and it has no other way then doing a sequential scan...
 Could you show us the table definition and its indexes? What version of Pg
 is this?

Pg is 8.3.1

Table definition:
CREATE TABLE vals (
sid integer NOT NULL,
eid integer NOT NULL,
datestamp integer NOT NULL,
val_dur integer NOT NULL,
acc real NOT NULL,
yld real NOT NULL,
rt real NOT NULL,
ydev real NOT NULL,
vydev real NOT NULL,
adev real NOT NULL,
achange real NOT NULL,
ychange real NOT NULL,
arsi real NOT NULL,
yrsi real NOT NULL,
UNIQUE (sid,eid,val_dur,datestamp),
FOREIGN KEY (sid,eid,datestamp) REFERENCES preds
(sid,eid,datestamp) ON DELETE CASCADE
);
create index val_datestamp_idx on vals(datestamp);
create index val_evaluator_idx on vals(eid);
create index val_search_key on vals(val_dur,eid,datestamp);
create index val_vd_idx on vals(val_dur,datestamp);

(The various indices are for a variety of common queries into the table)

  It may be that your index on vals.datestamp doesn't fit into memory; what
 are the relevant configuration parameters for your database?

That's a very good question. I recently had to rebuild this particular
database and haven't played with the configuration parameters as much
as I'd like- what parameters would be most relevant here? I hadn't
realized that an index needed to fit into memory.

pg_total_relation_size('vals') - pg_relation_size('vals') gives 11gb.
All indexed columns are integers. My guess is that this means that
it's likely the index doesn't fit into memory.

-- 
- David T. Wilson
Princeton Satellite Systems
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] select distinct and index usage

2008-04-07 Thread David Wilson
On Mon, Apr 7, 2008 at 1:42 PM, Alban Hertroys
[EMAIL PROTECTED] wrote:

  Have you tried this query with enable_seqscan=off? If my guess is right
 (and the planners, in that case) it'd be even slower.

set enable_seqscan=off;
explain select distinct datestamp from vals;
  QUERY PLAN
--
 Unique  (cost=115003047.47..115380004.83 rows=4263 width=4)
   -  Sort  (cost=115003047.47..115191526.15 rows=75391472 width=4)
 Sort Key: datestamp
 -  Seq Scan on vals  (cost=1.00..101531261.72
rows=75391472 width=4)

It appears to be doing a sequential scan regardless of the set, as if
it doesn't believe it can use the index for some reason

  Something that might help you, but I'm not sure whether it might hurt the
 performance of other queries, is to cluster that table on val_datestamp_idx.
 That way the records are already (mostly) sorted on disk in the order of the
 datestamps, which seems to be the brunt of above query plan.

That's a good thought. I'll give that a try this evening when the DB
has some downtime and see what happens.


  There seems to be quite a bit of overlap in your index definitions. From my
 experience this can confuse the planner.

  I suggest you combine them, but not knowing your data... Maybe rewriting
 your UNIQUE constraint to (val_dur, datestamp, eid, sid) would be enough to
 replace all those other indexes.
  If not, it's probably better to have one index per column, so that the
 planner is free to combine them as it sees fit. That'd result in a bitmap
 index scan, btw.

I can take a look at the other indices again, but those are all in
place for specific other queries that generally involve some set of
a=1, b=2, c=3, datestamp5 type of where-clause and were created
specifically in response to sequential scans showing up in other
queries (and had the proper effect of fixing them!)

  I'm not a postgresql tuning expert (I know my way around though), other
 people can explain you way better than I can. Bruce Momjian for example:
 http://www.linuxjournal.com/article/4791

I'll take a look at that, thanks.

  That calculation doesn't look familiar to me, I'm more used to:
   select pg_size_pretty(pg_relation_size('...'));

  You can put the name of any relation in there, be it tables, indexes, etc.

  11GB is pretty large for an index on an integer column, especially with
 only 75M rows: that's 146 bytes/row in your index. Maybe your index got
 bloated somehow? I think it should be about a tenth of that.

pg_total_relation_size('..') gives the number of bytes for the table +
all associated indices; pg_relation_size('..') gives for just the
table. The difference between the two should be total bytes take up by
the 5 total indices (11 total index cols), giving a
back-of-the-envelope estimation of 1gb for the size of the datestamp
index. I am fairly certain that I didn't give pg 1gb to fit the index
in memory, so I'll try upping its total available memory tonight and
see if that doesn't improve things.

I appreciate the responses so far! I'm used to several minutes for
some of the complex queries on this DB, but 12.5 minutes for a select
distinct just seems wrong. :)

-- 
- David T. Wilson
Princeton Satellite Systems
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] select distinct and index usage

2008-04-07 Thread David Wilson
On Mon, Apr 7, 2008 at 7:57 PM, Gregory Stark [EMAIL PROTECTED] wrote:

  You could try changing it to the equivalent GROUP BY query. The planner,
  unfortunately, doesn't know they're equivalent and has two separate sets of
  plans available. In this case where there are only 4,000 distinct values out
  of 75M original records you might find a HashAggregate plan, which the 
 planner
  doesn't know can be used for DISTINCT, best. You might have to raise work_mem
  before the planner feels a hash will fit.

  --
   Gregory Stark
   EnterpriseDB  http://www.enterprisedb.com
   Ask me about EnterpriseDB's RemoteDBA services!


Progress!

explain analyze select datestamp from vals group by datestamp;
 QUERY
PLAN

 HashAggregate  (cost=1719740.40..1719783.03 rows=4263 width=4)
(actual time=120192.018..120193.930 rows=4252 loops=1)
   -  Seq Scan on vals  (cost=0.00..1531261.72 rows=75391472 width=4)
(actual time=17.441..66807.429 rows=75391476 loops=1)
 Total runtime: 120195.144 ms


Compared with:

explain analyze select distinct datestamp from vals;
  QUERY
PLAN
-
 Unique  (cost=15003047.47..15380004.83 rows=4263 width=4) (actual
time=649599.159..721671.595 rows=4252 loops=1)
  -  Sort  (cost=15003047.47..15191526.15 rows=75391472 width=4)
(actual time=649599.157..694392.602 rows=75391476 loops=1)
Sort Key: datestamp
Sort Method:  external merge  Disk: 1178592kB
-  Seq Scan on vals  (cost=0.00..1531261.72 rows=75391472
width=4) (actual time=9.104..93130.468 rows=75391476 loops=1)
 Total runtime: 722379.434 ms

Still doing the sequential scan on the table, but at least it's
avoiding the expensive disk merge sort. It still seems as if I ought
to be able to coax it into using an index for this type of query,
though- especially since it's using one on the other table. Is there
perhaps some way to reformulate the index in such a way as to make it
more useful to the planner?

-- 
- David T. Wilson
Princeton Satellite Systems
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] select distinct and index usage

2008-04-06 Thread David Wilson
I have a reasonably large table (~75m rows,~18gb) called vals. It
includes an integer datestamp column with approximately 4000 unique
entries across the rows; there is a normal btree index on the
datestamp column. When I attempt something like select distinct
datestamp from vals, however, explain tells me it's doing a
sequential scan:

explain select distinct datestamp from vals;
  QUERY PLAN
--
 Unique  (cost=15003047.47..15380004.83 rows=4263 width=4)
   -  Sort  (cost=15003047.47..15191526.15 rows=75391472 width=4)
 Sort Key: datestamp
 -  Seq Scan on vals v  (cost=0.00..1531261.72 rows=75391472 width=4)

On another table in the same database with a much smaller number of
total rows (~15m rows), I have the exact same situation- but in this
case the index on the datestamp column *is* used:

explain select distinct datestamp from sdays;
 QUERY PLAN
-
 Unique  (cost=0.00..974675.99 rows=4254 width=4)
   -  Index Scan using sdays_datestamp_idx on sdays
(cost=0.00..932822.79 rows=16741280 width=4)

Any help on why the index isn't being used, or how I can set up the
index/query to make use of the index rather than doing an 18gb
sequential scan, would be very much appreciated.

-- 
- David T. Wilson
Princeton Satellite Systems
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] select distinct and index usage

2008-04-06 Thread David Wilson
Both tables are vacuumed and analyzed. I have other queries that are
using various indices on the vals table in an intelligent fashion. I
can try increasing the stats, certainly, although they're at the
defaults for both tables.

The variation is definitely identical- the set of datestamps in the
vals table is a large (~98%) subset of the set of datestamps in the
sdays table. Of the approximately 4000 distinct datestamps, there are
80 that appear in the sdays table and not in the vals table.

On Sun, Apr 6, 2008 at 10:20 PM, Craig Ringer
[EMAIL PROTECTED] wrote:
 David Wilson wrote:

  On another table in the same database with a much smaller number of
  total rows (~15m rows), I have the exact same situation- but in this
  case the index on the datestamp column *is* used:
 
 
  Have you run ANALYZE on both tables?

  It might be worth increasing the stats collected on the datestamp column
 (or all columns in the table), re-running ANALYZE on the table, and seeing
 what happens.

  Also, is there any chance that that's a lot more variation in datestamp
 values in your problem table than the one where the index is used?

  --
  Craig Ringer




-- 
- David T. Wilson
Princeton Satellite Systems
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general