Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-13 Thread Mark Cotner
See comments . . . thanks for the feedback.

'njoy,
Mark

--- Christopher Browne <[EMAIL PROTECTED]> wrote:

> The world rejoiced as Mischa Sandberg
> <[EMAIL PROTECTED]> wrote:
> > Mark Cotner wrote:
> >> Requirements:
> >> Merge table definition equivalent.  We use these
> >> extensively.
> 
> > Looked all over mysql.com etc, and afaics merge
> table is indeed
> > exactly a view of a union-all. Is that right?
> 
> > PG supports views, of course, as well (now) as
> tablespaces, allowing
> > you to split tables/tablesets across multiple disk
> systems.  PG is
> > also pretty efficient in query plans on such
> views, where (say) you
> > make one column a constant (identifier, sort of)
> per input table.
> 
> The thing that _doesn't_ work well with these sorts
> of UNION views are
> when you do self-joins.  Supposing you have 10
> members, a self-join
> leads to a 100-way join, which is not particularly
> pretty.
> 
> I'm quite curious as to how MySQL(tm) copes with
> this, although it may
> not be able to take place; they may not support
> that...
> 
> >> Um, gonna sound silly, but the web interface has
> to remain "snappy"
> >> under load.  I don't see this as a major concern
> since you don't
> >> require table locking.
> 
> > Agreed. It's more in your warehouse design, and
> intelligent bounding
> > of queries. I'd say PG's query analyzer is a few
> years ahead of
> > MySQL for large and complex queries.
> 
> The challenge comes in if the application has had
> enormous amounts of
> effort put into it to attune it exactly to
> MySQL(tm)'s feature set.
> 
> The guys working on RT/3 have found this a
> challenge; they had rather
> a lot of dependancies on its case-insensitive string
> comparisons,
> causing considerable grief.
>

Not so much, I've tried to be as agnostic as possible.
 Much of the more advanced mining that I've written is
kinda MySQL specific, but needs to be rewritten as
stored procedures anyway.
 
> > On the other hand, if you do warehouse-style
> loading (Insert, or PG
> > COPY, into a temp table; and then 'upsert' into
> the perm table), I
> > can guarantee 2500 inserts/sec is no problem.
> 
> The big wins are thus:
> 
>  1.  Group plenty of INSERTs into a single
> transaction.
> 
>  2.  Better still, use COPY to cut parsing costs
> plenty more.
> 
>  3.  Adding indexes _after_ the COPY are a further
> win.
> 
> Another possibility is to do clever things with
> stored procs; load
> incoming data using the above optimizations, and
> then run stored
> procedures to use some more or less fancy logic to
> put the data where
> it's ultimately supposed to be.  Having the logic
> running inside the
> engine is the big optimization.

Agreed, I did some preliminary testing today and am
very impressed.  I wasn't used to running analyze
after a data load, but once I did that everything was
snappy.

My best results from MySQL bulk inserts was around 36k
rows per second on a fairly wide table.  Today I got
42k using the COPY command, but with the analyze post
insert the results were similar.  These are excellent
numbers.  It basically means we could have our
cake(great features) and eat it too(performance that's
good enough to run the app).

Queries from my test views were equally pleasing.  I
won't bore you with the details just yet, but
PostgreSQL is doing great.   Not that you all are
surprised.  ;)


> -- 
> wm(X,Y):-write(X),write('@'),write(Y).
> wm('cbbrowne','linuxfinances.info').
> http://linuxfinances.info/info/spreadsheets.html
> Rules  of  the  Evil  Overlord   #198.   "I  will 
> remember  that  any
> vulnerabilities I have  are to be revealed strictly 
> on a need-to-know
> basis.  I will also remember that no one needs to
> know."
> 
> 
> ---(end of
> broadcast)---
> TIP 8: explain analyze is your friend
> 


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-13 Thread Christopher Browne
A long time ago, in a galaxy far, far away, [EMAIL PROTECTED] (Mark Cotner) wrote:
> Agreed, I did some preliminary testing today and am very impressed.
> I wasn't used to running analyze after a data load, but once I did
> that everything was snappy.

Something worth observing is that this is true for _any_ of the
database systems supporting a "cost-based" optimization system,
including Oracle and DB2.

When working with SAP R/3 Payroll, on one project, we found that when
the system was empty of data, the first few employee creates were
quick enough, but it almost immediately got excruciatingly slow.  One
of the DBAs told the Oracle instance underneath to collect statistics
on the main table, and things _immediately_ got snappy again.  But it
didn't get snappy until the conversion folk had run the conversion
process for several minutes, to the point to which it would get
painfully slow :-(.  There, with MILLIONS of dollars worth of license
fees being paid, across the various vendors, it still took a fair bit
of manual fiddling.

MySQL(tm) is just starting to get into cost-based optimization; in
that area, they're moving from where the "big DBs" were about 10 years
ago.  It was either version 7 or 8 where Oracle started moving to
cost-based optimization, and (as with the anecdote above) it took a
release or two for people to get accustomed to the need to 'feed' the
optimizer with statistics.  This is a "growing pain" that bites users
with any database where this optimization gets introduced.  It's
worthwhile, but is certainly not costless.

I expect some forseeable surprises will be forthcoming for MySQL AB's
customers in this regard...

> My best results from MySQL bulk inserts was around 36k rows per
> second on a fairly wide table.  Today I got 42k using the COPY
> command, but with the analyze post insert the results were similar.
> These are excellent numbers.  It basically means we could have our
> cake(great features) and eat it too(performance that's good enough
> to run the app).

In the end, performance for inserts is always fundamentally based on
how much disk I/O there is, and so it should come as no shock that
when roughly the same amount of data is getting laid down on disk,
performance won't differ much on these sorts of essentials.

There are a few places where there's some need for cleverness; if you
see particular queries running unusually slowly, it's worth doing an
EXPLAIN or EXPLAIN ANALYZE on them, to see how the query plans are
being generated.  There's some collected wisdom out here on how to
encourage the right plans.

There are also unexpected results that are OK.  We did a system
upgrade a few days ago that led to one of the tables starting out
totally empty.  A summary report that looks at that table wound up
with a pretty wacky looking query plan (compared to what's usual)
because the postmaster knew that the query would be reading in
essentially the entire table.  You'd normally expect an index scan,
looking for data for particular dates.  In this case, it did a "scan
the whole table; filter out a few irrelevant entries" plan.  

It looked wacky, compared to what's usual, but it ran in about 2
seconds, which was way FASTER than what's usual.  So the plan was
exactly the right one.

Telling the difference between the right plan and a poor one is a bit
of an art; we quite regularly take a look at query plans on this list
to figure out what might be not quite right.  If you find slow ones,
make sure you have run ANALYZE on the tables recently, to be sure that
the plans are sane, and you may want to consider posting some of them
to see if others can point to improvements that can be made.
-- 
If this was helpful,  rate me
http://linuxfinances.info/info/linuxdistributions.html
"I can't believe my room doesn't have Ethernet!  Why wasn't it wired
when the house was built?"
"The house was built in 1576." 
-- Alex Kamilewicz on the Oxford breed of `conference American.'

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[PERFORM] Help with extracting large volumes of records across related tables

2004-09-13 Thread Damien Dougan

Hi All,

I am having a performance problem extracting a large volume of data from
Postgres 7.4.2, and was wondering if there was a more cunning way to get
the data out of the DB...

This isn't a performance problem with any particular PgSQL operation,
its more a strategy for getting large volumes of related tables out of
the DB whilst perserving the relations between them.


Basically we have a number of tables, which are exposed as 2 public
views (say PvA and PvB). For each row in PvA, there are a number of
related rows in PvB (this number is arbitrary, which is one of the
reasons why it cant be expressed as additional columns in PvA - so we
really need 2 sets of tables - which leads to two sets of extract calls
- interwoven to associate PvA with PvB).


The problem is that for extraction, we ultimately want to grab a row
from PvA, and then all the related rows from PvB and store them together
offline (e.g. in XML).

However, the number of rows at any time on the DB is likely to be in the
millions, with perhaps 25% of them being suitable for extraction at any
given batch run (ie several hundred thousand to several million).


Currently, the proposal is to grab several hundred rows from PvA (thus
avoiding issues with the resultset being very large), and then process
each of them by selecting the related rows in PvB (again, several
hundred rows at a time to avoid problems with large result sets).

So the algorithm is basically:


Do

Select the next 200 rows from PvA

For each PvA row Do
Write current PvA row as XML

Do 
Select the next 200 rows from PvB

For each PvB row Do
Write current PvB row as XML
within the parent PvA XML Element
End For
While More Rows
End For
While More Rows


However, this has a fairly significant performance impact, and I was
wondering if there was a faster way to do it (something like taking a
dump of the tables so they can be worked on offline - but a basic dump
means we have lost the 1:M relationships between PvA and PvB).


Are there any tools/tricks/tips with regards to extracting large volumes
of data across related tables from Postgres? It doesnt have to export
into XML, we can do post-processing on the extracted data as needed -
the important thing is to keep the relationship between PvA and PvB on a
row-by-row basis.


Many thanks,

Damien


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Help with extracting large volumes of records across related tables

2004-09-13 Thread Paul Thomas
On 13/09/2004 12:38 Damien Dougan wrote:
[snip]
Are there any tools/tricks/tips with regards to extracting large volumes
of data across related tables from Postgres? It doesnt have to export
into XML, we can do post-processing on the extracted data as needed -
the important thing is to keep the relationship between PvA and PvB on a
row-by-row basis.
Have you considered using cursors?
--
Paul Thomas
+--+---+
| Thomas Micro Systems Limited | Software Solutions for Business   |
| Computer Consultants | http://www.thomas-micro-systems-ltd.co.uk |
+--+---+
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Help with extracting large volumes of records across related tables

2004-09-13 Thread Pierre-Frédéric Caillaud

There's a very simple solution using cursors.
As an example :
create table categories ( id serial primary key, name text );
create table items ( id serial primary key, cat_id integer references  
categories(id), name text );
create index items_cat_idx on items( cat_id );

insert stuff...
select * from categories;
 id |   name
+--
  1 | tools
  2 | supplies
  3 | food
(3 lignes)
select * from items;
 id | cat_id | name
++--
  1 |  1 | hammer
  2 |  1 | screwdriver
  3 |  2 | nails
  4 |  2 | screws
  5 |  1 | wrench
  6 |  2 | bolts
  7 |  2 | cement
  8 |  3 | beer
  9 |  3 | burgers
 10 |  3 | french fries
(10 lignes)
	Now (supposing you use Python) you use the extremely simple sample  
program below :

import psycopg
db = psycopg.connect("host=localhost dbname=rencontres user=rencontres  
password=.")

#	Simple. Let's make some cursors.
cursor = db.cursor()
cursor.execute( "BEGIN;" )
cursor.execute( "declare cat_cursor no scroll cursor without hold for  
select * from categories order by id for read only;" )
cursor.execute( "declare items_cursor no scroll cursor without hold for  
select * from items order by cat_id for read only;" )

# set up some generators
def qcursor( cursor, psql_cursor_name ):
while True:
cursor.execute( "fetch 2 from %s;" % psql_cursor_name )guess
if not cursor.rowcount:
break
#   print "%s fetched %d rows." % (psql_cursor_name, cursor.rowcount)
for row in cursor.dictfetchall():
yield row
print "%s exhausted." % psql_cursor_name
# use the generators
categories = qcursor( cursor, "cat_cursor" )
items = qcursor( cursor, "items_cursor" )
current_item = items.next()
for cat in categories:
print "Category : ", cat

# if no items (or all items in category are done) skip to next category
if cat['id'] < current_item['cat_id']:
continue

# case of items without category (should not happen)
while cat['id'] > current_item['cat_id']:
current_item = items.next()

while current_item['cat_id'] == cat['id']:
print "\t", current_item
current_item = items.next()
It produces the following output :
Category :  {'id': 1, 'name': 'tools'}
{'cat_id': 1, 'id': 1, 'name': 'hammer'}
{'cat_id': 1, 'id': 2, 'name': 'screwdriver'}
{'cat_id': 1, 'id': 5, 'name': 'wrench'}
Category :  {'id': 2, 'name': 'supplies'}
{'cat_id': 2, 'id': 3, 'name': 'nails'}
{'cat_id': 2, 'id': 4, 'name': 'screws'}
{'cat_id': 2, 'id': 6, 'name': 'bolts'}
{'cat_id': 2, 'id': 7, 'name': 'cement'}
Category :  {'id': 3, 'name': 'food'}
{'cat_id': 3, 'id': 8, 'name': 'beer'}
{'cat_id': 3, 'id': 9, 'name': 'burgers'}
{'cat_id': 3, 'id': 10, 'name': 'french fries'}
This simple code, with "fetch 1000" instead of "fetch 2", dumps a database  
of several million rows, where each categories contains generally 1 but  
often 2-4 items, at the speed of about 10.000 items/s.

Satisfied ?



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] Help with extracting large volumes of records across related tables

2004-09-13 Thread Pierre-Frédéric Caillaud
My simple python program dumps 1653992 items in 1654000 categories in :
real3m12.029s
user1m36.720s
sys 0m2.220s
It was running on the same machine as postgresql (AthlonXP 2500).
I Ctrl-C'd it before it dumped all the database but you get an idea.
If you don't know Python and Generators, have a look !
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Help with extracting large volumes of records across related tables

2004-09-13 Thread Damien Dougan
Pierre-Frederic, Paul,

Thanks for your fast response (especially for the python code and
performance figure) - I'll chase this up as a solution - looks most
promising!

Cheers,

Damien


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Help with extracting large volumes of records across related tables

2004-09-13 Thread Pierre-Frédéric Caillaud
Thanks for the thanks !
	Generally, when grouping stuff together, it is a good idea to have two  
sorted lists, and to scan them simultaneously. I have already used this  
solution several times outside of Postgres, and it worked very well (it  
was with Berkeley DB and there were 3 lists to scan in order). The fact  
that Python can very easily virtualize these lists using generators makes  
it possible to do it without consuming too much memory.

Pierre-Frederic, Paul,
Thanks for your fast response (especially for the python code and
performance figure) - I'll chase this up as a solution - looks most
promising!
Cheers,
Damien
---(end of broadcast)---
TIP 6: Have you searched our list archives?
   http://archives.postgresql.org

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options

2004-09-13 Thread Bruce Momjian

Have you seen /src/tools/fsync?

---

[EMAIL PROTECTED] wrote:
> Hi, I'd like to help with the topic in the Subject: line.  It seems to be a
> TODO item.  I've reviewed some threads discussing the matter, so I hope I've
> acquired enough history concerning it.  I've taken an initial swipe at
> figuring out how to optimize sync'ing methods.  It's based largely on   
> recommendations I've read on previous threads about fsync/O_SYNC and so on.
> After reviewing, if anybody has recommendations on how to proceed then I'd   
> love to hear them.  
>
> Attached is a little program that basically does a bunch of sequential writes   
> to a file.  All of the sync'ing methods supported by PostgreSQL WAL can be   
> used.  Results are printed in microseconds.  Size and quanity of writes are   
> configurable.  The documentation is in the code (how to configure, build, run,   
> etc.).  I realize that this program doesn't reflect all of the possible   
> activities of a production database system, but I hope it's a step in the   
> right direction for this task.  I've used it to see differences in behavior   
> between the various sync'ing methods on various platforms.   
>
> Here's what I've found running the benchmark on some systems to which  
> I have access.  The differences in behavior between platforms is quite vast.   
>
> Summary first...   
>
>
> PostgreSQL should be run on an old Apple MacIntosh attached to   
> its own Hitachi disk array with 2GB cache or so.  Use any sync method   
> except for fsync().   
>
>   
> Anyway, there is *a lot* of variance in file synching behavior across   
> different hardware and O/S platforms.  It's probably not safe   
> to conclude much.  That said, here are some findings so far based on   
> tests I've run:  
>   
> 1.  under no circumstances do fsync() or fdatasync() seem to perform   
> better than opening files with O_SYNC or O_DSYNC   
> 2.  where there are differences, opening files with O_SYNC or O_DSYNC   
> tends to be quite faster.  
> 3.  fsync() seems to be the slowest where there are differences.  And   
> O_DSYNC seems to be the fastest where results differ.   
> 4.  the safest thing to assert at this point is that   
> Solaris systems ought to use the O_DSYNC method for WAL.   
>   
> ---   
>
> Test system(s)   
>
> Athlon Linux:   
> AMD Athlon XP2000, 512MB RAM, single (54 or 7200?) RPM 20GB IDE disk,   
> reiserfs filesystem (3 something I think)   
> SuSE Linux kernel 2.4.21-99   
>
> Mac Linux:   
> I don't know the specific model.  400MHz G3, 512MB, single IDE disk,   
> ext2 filesystem   
> Debian GNU/Linux 2.4.16-powerpc   
>
> HP Intel Linux:   
> Prolient HPDL380G3, 2 x 3GHz Xeon, 2GB RAM, SmartArray 5i 64MB cache,   
> 2 x 15,000RPM 36GB U320 SCSI drives mirrored.  I'm not sure if   
> writes are cached or not.  There's no battery backup.   
> ext3 filesystem.   
> Redhat Enterprise Linux 3.0 kernel based on 2.4.21   
>
> Dell Intel OpenBSD:   
> Poweredge ?, single 1GHz PIII, 128MB RAM, single 7200RPM 80GB IDE disk,   
> ffs filesystem   
> OpenBSD 3.2 GENERIC kernel   
>
> SUN Ultra2:   
> Ultra2, 2 x 296MHz UltraSPARC II, 2GB RAM, 2 x 10,000RPM 18GB U160   
> SCSI drives mirrored with Solstice DiskSuite.  UFS filesystem.   
> Solaris 8.   
>
> SUN E4500 + HDS Thunder 9570v   
> E4500, 8 x 400MHz UltraSPARC II, 3GB RAM,   
> HDS Thunder 9570v, 2GB mirrored battery-backed cache, RAID5 with a   
> bunch of 146GB 10,000RPM FC drives.  LUN is on single 2GB FC fabric   
> connection.   
> Veritas filesystem (VxFS)   
> Solaris 8.   
>
> Test methodology:   
>
> All test runs were done with CHUNKSIZE 8 * 1024, CHUNKS 2 * 1024,   
> FILESIZE_MULTIPLIER 2, and SLEEP 5.  So a total of 16MB was sequentially  
> written for each benchmark.  
>
> Results are in microseconds.   
>
> PLATFORM:   Athlon Linux   
> buffered:   48220   
> fsync:  74854397   
> fdatasync:  75061357   
> open_sync:  73869239   
> open_datasync:  74748145   
> Notes:  System mostly idle.  Even during tests, top showed about 95%   
> idle.  Something's not right on this box.  All sync methods similarly   
> horrible on this system.   
>
> PLATFORM:   Mac Linux   
> buffered:   58912   
> fsync:  1539079   
> fdatasync:  769058   
> open_sync:  767094   
> open_datasync:  763074   
> Notes: system mostly idle.  fsync seems worst.  Otherwise, they seem   
> pretty equivalent.  This is the fastest system tested.  
>
> PLATFORM:   HP Intel Linux   
> buffered:   33026   
> fsync:  29330067   
> fdatasync:  28673880   
> open_sync:  8783417   
> open_datasync:  8747971   
> Notes: system idle.  O_SYNC and O_DSYNC methods seem to be a lot   
> better on this platform than fsync & fdatasync.  
>   
> PLATFORM:   Dell Intel OpenBSD  
> buffer

Re: [PERFORM] Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options

2004-09-13 Thread mudfoot
Quoting Bruce Momjian <[EMAIL PROTECTED]>:

> 
> Have you seen /src/tools/fsync?
> 

I have now.  Thanks.

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options

2004-09-13 Thread Gaetano Mendola
Bruce Momjian wrote:
Have you seen /src/tools/fsync?
Now that the argument is already open, why postgres choose
on linux fdatasync? I'm understanding from other posts that
on this platform open_sync is better than fdatasync.
However I choose open_sync. During initdb why don't detect
this parameter ?
Regards
Gaetano Mendola


These are my times:

kernel 2.4.9-e.24smp ( RAID SCSI ):
Simple write timing:
write0.011544
Compare fsync times on write() and non-write() descriptor:
(If the times are similar, fsync() can sync data written
 on a different descriptor.)
write, fsync, close  1.233312
write, close, fsync  1.242086
Compare one o_sync write to two:
one 16k o_sync write 0.517633
two 8k o_sync writes 0.824603
Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write   0.438580
write, fdatasync 1.239377
write, fsync,1.178017
Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open o_sync, write   0.818720
write, fdatasync 1.395602
write, fsync,1.351214

kernel 2.4.22-1.2199.nptlsmp (single EIDE disk):
Simple write timing:
write0.023697
Compare fsync times on write() and non-write() descriptor:
(If the times are similar, fsync() can sync data written
 on a different descriptor.)
write, fsync, close  0.688765
write, close, fsync  0.702166
Compare one o_sync write to two:
one 16k o_sync write 0.498296
two 8k o_sync writes 0.543956
Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write   0.259664
write, fdatasync 0.971712
write, fsync,1.006096
Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open o_sync, write   0.536882
write, fdatasync 1.160347
write, fsync,1.189699



---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [PERFORM] Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options

2004-09-13 Thread Josh Berkus
Gaetano,

> Now that the argument is already open, why postgres choose
> on linux fdatasync? I'm understanding from other posts that
> on this platform open_sync is better than fdatasync.

Not necessarily.   For example, here's my test results, on Linux 2.6.7, 
writing to a ReiserFS mount on a Software RAID 1 slave of 2 IDE disks, on an 
Athalon 1600mhz single-processor machine.   I ran the loop 10,000 times 
instead of 1000 because tests with 1,000 varied too much.

Simple write timing:
write0.088701

Compare fsync times on write() and non-write() descriptor:
(If the times are similar, fsync() can sync data written
 on a different descriptor.)
write, fsync, close  3.593958
write, close, fsync  3.556978

Compare one o_sync write to two:
one 16k o_sync write42.951595
two 8k o_sync writes11.251389

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write   6.807060
write, fdatasync 7.207879
write, fsync,7.209087

Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open o_sync, write  13.120305
write, fdatasync 7.583871
write, fsync,7.801748

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options

2004-09-13 Thread Tom Lane
Gaetano Mendola <[EMAIL PROTECTED]> writes:
> Now that the argument is already open, why postgres choose
> on linux fdatasync? I'm understanding from other posts that
> on this platform open_sync is better than fdatasync.

AFAIR, we've seen *one* test from *one* person alleging that.
And it was definitely not that way when we tested the behavior
originally, several releases back.  I'd like to see more evidence,
or better some indication that the Linux kernel changed algorithms,
before changing the default.

The tests that started this thread are pretty unconvincing in my eyes,
because they are comparing open_sync against code that fsyncs after each
one-block write.  Under those circumstances, *of course* fsync will lose
(or at least do no better), because it's forcing the same number of
writes through a same-or-less-efficient API.  The reason that this isn't
a trivial choice is that Postgres doesn't necessarily need to fsync
after every block of WAL.  In particular, when doing large transactions
there could be many blocks written between fsyncs, and in that case you
could come out ahead with fsync because the kernel would have more
freedom to schedule disk writes.

So, the only test I put a whole lot of faith in is testing your own
workload on your own Postgres server.  But if we want to set up a toy
test program to test this stuff, it's at least got to have an easily
adjustable (and preferably randomizable) distance between fsyncs.

Also, tests on IDE drives have zero credibility to start with, unless
you can convince me you know how to turn off write buffering on the
drive...

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-13 Thread Simon Riggs

Mark,

I thought some additional comments on top of Christopher's excellent notes
might help you.

> Christopher Browne
> The world rejoiced as Mischa Sandberg
> <[EMAIL PROTECTED]> wrote:
> > Mark Cotner wrote:
> >> Requirements:
> >> Merge table definition equivalent.  We use these
> >> extensively.
>
> > Looked all over mysql.com etc, and afaics merge table is indeed
> > exactly a view of a union-all. Is that right?
>

PostgreSQL's functionality is in many ways similar to Oracle Partitioning.

Loading up your data in many similar tables, then creating a view like:

CREATE VIEW BIGTABLE (idate, col1, col2, col3...) AS
SELECT 200409130800, col1, col2, col3... FROM table200409130800
UNION ALL
SELECT 200409131000, col1, col2, col3... FROM table200409131000
UNION ALL
SELECT 200409131200, col1, col2, col3... FROM table200409131200
...etc...

will allow the PostgreSQL optimizer to eliminate partitions from the query
when you run queries which include a predicate on the partitioning_col, e.g.

select count(*) from bigtable where idate >= 200409131000

will scan the last two partitions only...

There are a few other ways of creating the view that return the same answer,
but only using constants in that way will allow the partitions to be
eliminated from the query, and so run for much longer.

So you can give different VIEWS to different user groups, have different
indexes on different tables etc.

However, I haven't managed to get this technique to work when performing a
star join to a TIME dimension table, since the parition elimination relies
on comparison of constant expressions. You'll need to check out each main
join type to make sure it works for you in your environment.

> > PG supports views, of course, as well (now) as tablespaces, allowing
> > you to split tables/tablesets across multiple disk systems.  PG is
> > also pretty efficient in query plans on such views, where (say) you
> > make one column a constant (identifier, sort of) per input table.
>
> The thing that _doesn't_ work well with these sorts of UNION views are
> when you do self-joins.  Supposing you have 10 members, a self-join
> leads to a 100-way join, which is not particularly pretty.
>

Well, that only happens when you forget to include the partitioning constant
in the self join.

e.g. select count(*) from bigtable a, bigtable b where a.idate =
.idate;  --works just fine

The optimizer really is smart enough to handle that too, but I'm sure such
large self-joins aren't common for you anyhow.

> I'm quite curious as to how MySQL(tm) copes with this, although it may
> not be able to take place; they may not support that...
>

It doesn't, AFAIK.

> Christopher Browne wrote
> A long time ago, in a galaxy far, far away, [EMAIL PROTECTED]
> (Mark Cotner) wrote:
> > Agreed, I did some preliminary testing today and am very impressed.
> > I wasn't used to running analyze after a data load, but once I did
> > that everything was snappy.
>
> Something worth observing is that this is true for _any_ of the
> database systems supporting a "cost-based" optimization system,
> including Oracle and DB2.

Agreed. You can reduce the time for the ANALYZE by ignoring some of the
(measures) columns not used in WHERE clauses.

Also, if you're sure that each load is very similar to the last, you might
even consider directly updating pg_statistic rows with the statistical
values produced from an earlier ANALYZE...scary, but it can work.

To create a set of tables of > 600Gb, you will benefit from creating each
table WITHOUT OIDS.

Hope some of that helps you...

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


[PERFORM] Large # of rows in query extremely slow, not using index

2004-09-13 Thread Stephen Crowley
Does postgres cache the entire result set before it begins returning
data to the client?

I have a table with ~8 million rows and I am executing a query which
should return about ~800,000 rows. The problem is that as soon as I
execute the query it absolutely kills my machine and begins swapping
for 5 or 6 minutes before it begins returning results. Is postgres
trying to load the whole query into memory before returning anything?
Also, why would it choose not to use the index? It is properly
estimating the # of rows returned. If I set enable_seqscan to off it
is just as slow.

Running postgres 8.0 beta2 dev2

explain select * from island_history where date='2004-09-07' and stock='QQQ';
QUERY PLAN
---
 Seq Scan on island_history  (cost=0.00..266711.23 rows=896150 width=83)
   Filter: ((date = '2004-09-07'::date) AND ((stock)::text = 'QQQ'::text))
(2 rows)

Any help would be appreciated

--Stephen

 Table "public.island_history"
  Column  |  Type  | Modifiers
--++---
 date | date   | not null
 stock| character varying(6)   |
 time | time without time zone | not null
 reference_number | numeric(9,0)   | not null
 message_type | character(1)   | not null
 buy_sell_ind | character(1)   |
 shares   | numeric(6,0)   |
 remaining_shares | numeric(6,0)   |
 price| numeric(10,4)  |
 display  | character(1)   |
 match_number | numeric(9,0)   | not null
Indexes:
"island_history_pkey" PRIMARY KEY, btree (date, reference_number,
message_type, "time", match_number)
"island_history_date_stock_time" btree (date, stock, "time")
"island_history_oid" btree (oid)

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Large # of rows in query extremely slow, not using index

2004-09-13 Thread Tom Lane
Stephen Crowley <[EMAIL PROTECTED]> writes:
> Does postgres cache the entire result set before it begins returning
> data to the client?

The backend doesn't, but libpq does, and I think JDBC does too.

I'd recommend using a cursor so you can FETCH a reasonable number of
rows at a time.

> Also, why would it choose not to use the index?

Selecting 1/10th of a table is almost always a poor candidate for an
index scan.  You've got about 100 rows per page (assuming the planner's
width estimate is credible) and so on average every page of the table
has about ten rows that need to be picked up and returned.  You might as
well just seqscan and be sure you don't read any page more than once.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Large # of rows in query extremely slow, not using

2004-09-13 Thread Rod Taylor
On Mon, 2004-09-13 at 20:51, Stephen Crowley wrote:
> Does postgres cache the entire result set before it begins returning
> data to the client?

Sometimes you need to be careful as to how the clients treat the data. 

For example psql will resize columns width on the length (width) of the
data returned.

PHP and Perl will retrieve and cache all of the rows if you request a
row count ($sth->rows() or pg_num_rows($rset))


You may find that using a cursor will help you out.

> I have a table with ~8 million rows and I am executing a query which
> should return about ~800,000 rows. The problem is that as soon as I
> execute the query it absolutely kills my machine and begins swapping
> for 5 or 6 minutes before it begins returning results. Is postgres
> trying to load the whole query into memory before returning anything?
> Also, why would it choose not to use the index? It is properly
> estimating the # of rows returned. If I set enable_seqscan to off it
> is just as slow.
> 
> Running postgres 8.0 beta2 dev2
> 
> explain select * from island_history where date='2004-09-07' and stock='QQQ';
> QUERY PLAN
> ---
>  Seq Scan on island_history  (cost=0.00..266711.23 rows=896150 width=83)
>Filter: ((date = '2004-09-07'::date) AND ((stock)::text = 'QQQ'::text))
> (2 rows)
> 
> Any help would be appreciated
> 
> --Stephen
> 
>  Table "public.island_history"
>   Column  |  Type  | Modifiers
> --++---
>  date | date   | not null
>  stock| character varying(6)   |
>  time | time without time zone | not null
>  reference_number | numeric(9,0)   | not null
>  message_type | character(1)   | not null
>  buy_sell_ind | character(1)   |
>  shares   | numeric(6,0)   |
>  remaining_shares | numeric(6,0)   |
>  price| numeric(10,4)  |
>  display  | character(1)   |
>  match_number | numeric(9,0)   | not null
> Indexes:
> "island_history_pkey" PRIMARY KEY, btree (date, reference_number,
> message_type, "time", match_number)
> "island_history_date_stock_time" btree (date, stock, "time")
> "island_history_oid" btree (oid)
> 
> ---(end of broadcast)---
> TIP 8: explain analyze is your friend
-- 
Rod Taylor 

Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL
PGP Key: http://www.rbt.ca/signature.asc


signature.asc
Description: This is a digitally signed message part


Re: [PERFORM] Large # of rows in query extremely slow, not using index

2004-09-13 Thread Stephen Crowley
On Mon, 13 Sep 2004 21:11:07 -0400, Tom Lane <[EMAIL PROTECTED]> wrote:
> Stephen Crowley <[EMAIL PROTECTED]> writes:
> > Does postgres cache the entire result set before it begins returning
> > data to the client?
> 
> The backend doesn't, but libpq does, and I think JDBC does too.
> 
> I'd recommend using a cursor so you can FETCH a reasonable number of
> rows at a time.

That is incredible. Why would libpq do such a thing? JDBC as well? I
know oracle doesn't do anything like that, not sure about mysql. Is
there any way to turn it off? In this case I was just using psql but
will be using JDBC for the app.  About cursors, I thought a jdbc
ResultSet WAS a cursor, am I mistaken?

Thanks,
Stephen

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options

2004-09-13 Thread Bruce Momjian
Tom Lane wrote:
> The tests that started this thread are pretty unconvincing in my eyes,
> because they are comparing open_sync against code that fsyncs after each
> one-block write.  Under those circumstances, *of course* fsync will lose
> (or at least do no better), because it's forcing the same number of
> writes through a same-or-less-efficient API.  The reason that this isn't
> a trivial choice is that Postgres doesn't necessarily need to fsync
> after every block of WAL.  In particular, when doing large transactions
> there could be many blocks written between fsyncs, and in that case you
> could come out ahead with fsync because the kernel would have more
> freedom to schedule disk writes.

My guess is that the majority of queries do not fill more than one WAL
block.  Sure some do, but in those cases the fsync is probably small
compared to the duration of the query.  If we had a majority of queries
filling more than one block we would be checkpointing like crazy  and we
don't normally get reports about that.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options

2004-09-13 Thread Gaetano Mendola
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Tom Lane wrote:
| Gaetano Mendola <[EMAIL PROTECTED]> writes:
|
|>Now that the argument is already open, why postgres choose
|>on linux fdatasync? I'm understanding from other posts that
|>on this platform open_sync is better than fdatasync.
|
|
| AFAIR, we've seen *one* test from *one* person alleging that.
| And it was definitely not that way when we tested the behavior
| originally, several releases back.  I'd like to see more evidence,
| or better some indication that the Linux kernel changed algorithms,
| before changing the default.
I remember more then one person claim that open_sync *apparently*
was working better then fdatasync, however I trust you ( here is
3:00 AM ).
| The tests that started this thread are pretty unconvincing in my eyes,
| because they are comparing open_sync against code that fsyncs after each
| one-block write.  Under those circumstances, *of course* fsync will lose
| (or at least do no better), because it's forcing the same number of
| writes through a same-or-less-efficient API.
|
| The reason that this isn't a trivial choice is that Postgres doesn't
| necessarily need to fsync after every block of WAL.  In particular,
| when doing large transactions there could be many blocks written between
| fsyncs, and in that case you could come out ahead with fsync because the
| kernel would have more freedom to schedule disk writes.
Are you suggesting that postgres shall use more the one sync method and use
one or the other depending on the activity is performing ?
| So, the only test I put a whole lot of faith in is testing your own
| workload on your own Postgres server.  But if we want to set up a toy
| test program to test this stuff, it's at least got to have an easily
| adjustable (and preferably randomizable) distance between fsyncs.
|
| Also, tests on IDE drives have zero credibility to start with, unless
| you can convince me you know how to turn off write buffering on the
| drive...
I reported the IDE times just for info; however my SAN works better
with open_sync. Can we trust on numbers given by tools/fsync ?  I seen
some your objections in the past but I don't know if there was some fix
from that time.
Regards
Gaetano Mendola




-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFBRkvW7UpzwH2SGd4RAia1AKD2L5JLhpRNvBzPq9Lv5bAfFJvRmwCffjC5
hg7V0Sfm2At7yR1C+gBCzPE=
=RsSy
-END PGP SIGNATURE-
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Large # of rows in query extremely slow, not using index

2004-09-13 Thread Tom Lane
Stephen Crowley <[EMAIL PROTECTED]> writes:
> On Mon, 13 Sep 2004 21:11:07 -0400, Tom Lane <[EMAIL PROTECTED]> wrote:
>> Stephen Crowley <[EMAIL PROTECTED]> writes:
>>> Does postgres cache the entire result set before it begins returning
>>> data to the client?
>> 
>> The backend doesn't, but libpq does, and I think JDBC does too.

> That is incredible. Why would libpq do such a thing?

Because the API it presents doesn't allow for the possibility of query
failure after having given you back a PGresult: either you have the
whole result available with no further worries, or you don't.
If you think it's "incredible", let's see you design an equally
easy-to-use API that doesn't make this assumption.

(Now having said that, I would have no objection to someone extending
libpq to offer an alternative streaming API for query results.  It
hasn't got to the top of anyone's to-do list though ... and I'm
unconvinced that psql could use it if it did exist.)

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options

2004-09-13 Thread Tom Lane
Bruce Momjian <[EMAIL PROTECTED]> writes:
> If we had a majority of queries filling more than one block we would
> be checkpointing like crazy and we don't normally get reports about
> that.

[ raised eyebrow... ]  And of course the 30-second-checkpoint-warning
stuff is a useless feature that no one ever exercises.

But your logic doesn't hold up anyway.  People may be doing large
transactions without necessarily doing them back-to-back-to-back;
there could be idle time in between.  For instance, I'd think an average
transaction size of 100 blocks would be more than enough to make fsync a
winner.  There are 2K blocks per WAL segment, so 20 of these would fit
in a segment.  With the default WAL parameters you could do sixty such
transactions per five minutes, or one every five seconds, without even
causing more-frequent-than-default checkpoints; and you could do two a
second without setting off the checkpoint-warning alarm.  The lack of
checkpoint complaints doesn't prove that this isn't a common real-world
load.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


[PERFORM] tblspaces integrated in new postgresql (version 8.0)

2004-09-13 Thread bill
Hi, I have downloaded the new postgresql (version 8.0 beta2) and I was
wondering what performance features I can take advantage of before I start
to dump my 3/4 terrabyte database into the new format.  More specifically
I am interested in tablespaces--what exactly is this feature, some sort of
organizational addition (?) and howcan I best take advantage of this? 
Anything else?

Thanks.

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-13 Thread Mischa Sandberg
Mark Cotner wrote:
Hi all,
I had a difficult time deciding which list to post
this to, so please forgive me if this list doesn't
perfectly match my questions.  My decision will not
solely be based on performance, but it is the primary
concern.  I would be very appreciative if you all
could comment on my test plan.  Real world examples of
a Postgres implementation of >=600G with a web
front-end would be great, or any data warehouse with
some size to it.
I'm only in the 30GB range of database, in case that's a consideration 
for my comments that follow.

At this time, I'm working out the best ROLAP options for our PG 
transaction store. The transaction store is highly volatile -- longest a 
txn stays in it is 15 days ... so you imagine the need for historic 
summaries :-)

I've also implemented multiple data servers, including
a federated server that had to make the best of existing engines
(like MySQL, PG and everything from MSJet to RedBrick in the commercial 
world).

The time has come to reevaluate/rearchitect an
application which I built about 3 years ago.  There
are no performance concerns with MySQL, but it would
benefit greatly from stored procedures, views, etc. 
If your company is currently happy with MySQL, there probably are other 
(nontechnical) reasons to stick with it. I'm impressed that you'd 
consider reconsidering PG.

Some of the mining that we do could benefit from
stored procedures as well.  MySQL may have these in
the works, but we won't be able to move to a version
of MySQL that supports stored procs for another year
or two.
And PG lets you back-end with some powerful pattern- and 
aggregate-handling languages, like Perl. This was definitely a plus for 
data mining of web traffic, for example. The power of server-side 
extensibility for bailing you out of a design dead-end is not 
inconsequential.

PG doesn't have PIVOT operators (qv Oracle and MSSQL), but it makes the 
translation from data to column fairly painless otherwise.

Requirements:
Merge table definition equivalent.  We use these
extensively.
Looked all over mysql.com etc, and afaics merge table
is indeed exactly a view of a union-all. Is that right?
PG supports views, of course, as well (now) as tablespaces, allowing you 
to split tables/tablesets across multiple disk systems.
PG is also pretty efficient in query plans on such views, where (say) 
you make one column a constant (identifier, sort of) per input table.

Merge table equivalent with all tables containing over
100M rows(and about 40 columns, some quite wide) will
need to do index scans in at least 5 seconds(MySQL
currently does 2, but we can live with 5) and return
~200 rows.
PG has TOAST for handling REALLY BIG columns, and the generic TEXT type 
is as efficient as any size-specific VARCHAR() type ... should make 
things easier for you.

Um, gonna sound silly, but the web interface has to
remain "snappy" under load.  I don't see this as a
major concern since you don't require table locking.
Agreed. It's more in your warehouse design, and intelligent bounding of 
queries. I'd say PG's query analyzer is a few years ahead of MySQL for 
large and complex queries.

If business logic is moved to the database(likely with
Postgres) performance for inserting with light logic
on each insert has to keep up with the 4.5M inserts
per 2 hours(which MySQL completes in ~35min
currently).  Acceptable numbers for this aggregation
would be 45-55min using stored procedures.
Again, it's a matter of pipeline design. The tools for creating an 
efficient pipeline are at least as good in PG as MySQL.

If you try to insert and postprocess information one row at a time,
procedures or no, there's no offhand way to guarantee your performance 
without a test/prototype.

On the other hand, if you do warehouse-style loading (Insert, or PG 
COPY, into a temp table; and then 'upsert' into the perm table), I can 
guarantee 2500 inserts/sec is no problem.

Here's our case study if you're interested . . . 
http://www.mysql.com/customers/customer.php?id=16
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[PERFORM] Four table join with million records - performance improvement?

2004-09-13 Thread Vijay Moses
Hi i have four sample tables ename, esal, edoj and esum
All of them have 100 records. Im running the following
query : select ename.eid, name, sal, doj, summary from
ename,esal,edoj,esum where ename.eid=esal.eid and ename.eid=edoj.eid
and ename.eid=esum.eid. Its a join of all four tables which returns
all 1 million records. The eid field in ename is a Primary Key and the
eid in all other tables are Foreign Keys. I have created an index for
all Foreign Keys. This query takes around 16 MINUTES to complete. Can
this time be reduced?
Thanks
Vijay



EXPLAIN OUTPUT

QUERY PLAN 
Merge Join  (cost=647497.97..163152572.97 rows=252500 width=80) 
  Merge Cond: ("outer".eid = "inner".eid) 
  ->  Merge Join  (cost=356059.69..75361059.69 rows=50 width=44) 
Merge Cond: ("outer".eid = "inner".eid) 
->  Sort  (cost=150295.84..152795.84 rows=100 width=8) 
  Sort Key: edoj.eid 
  ->  Seq Scan on edoj  (cost=0.00..15568.00 rows=100 width=8) 
->  Sort  (cost=205763.84..208263.84 rows=100 width=36) 
  Sort Key: esum.eid 
  ->  Seq Scan on esum  (cost=0.00..31976.00 rows=100 width=36) 
  ->  Sort  (cost=291438.28..293938.29 rows=102 width=48) 
Sort Key: ename.eid 
->  Hash Join  (cost=26683.01..107880.23 rows=102 width=48) 
  Hash Cond: ("outer".eid = "inner".eid) 
  ->  Seq Scan on esal  (cost=0.00..21613.01 rows=101 width=12) 
  ->  Hash  (cost=16370.01..16370.01 rows=101 width=36) 
->  Seq Scan on ename  (cost=0.00..16370.01
rows=101 width=36)

17 row(s)

Total runtime: 181.021 ms



EXPLAIN ANALYZE OUTPUT

QUERY PLAN 

Merge Join  (cost=647497.97..163152572.97 rows=252500
width=80) (actual time=505418.965..584981.013 rows=100 loops=1)
  Merge Cond: ("outer".eid = "inner".eid) 
  ->  Merge Join  (cost=356059.69..75361059.69 rows=50
width=44) (actual time=110394.376..138177.569 rows=100 loops=1)
Merge Cond: ("outer".eid = "inner".eid) 
->  Sort  (cost=150295.84..152795.84 rows=100 width=8)
(actual time=27587.622..31077.077 rows=100 loops=1)
  Sort Key: edoj.eid 
  ->  Seq Scan on edoj  (cost=0.00..15568.00 rows=100
width=8) (actual time=144.000..10445.145 rows=100 loops=1)
->  Sort  (cost=205763.84..208263.84 rows=100 width=36)
(actual time=82806.646..90322.943 rows=100 loops=1)
  Sort Key: esum.eid 
  ->  Seq Scan on esum  (cost=0.00..31976.00 rows=100
width=36) (actual time=20.312..29030.247 rows=100 loops=1)
  ->  Sort  (cost=291438.28..293938.29 rows=102 width=48) (actual
time=395024.482..426870.491 rows=101 loops=1)
Sort Key: ename.eid 
->  Hash Join  (cost=26683.01..107880.23 rows=102
width=48) (actual time=29234.472..198064.105 rows=101 loops=1)
  Hash Cond: ("outer".eid = "inner".eid) 
  ->  Seq Scan on esal  (cost=0.00..21613.01 rows=101
width=12) (actual time=32.257..23999.163 rows=101 loops=1)
  ->  Hash  (cost=16370.01..16370.01 rows=101
width=36) (actual time=19362.095..19362.095 rows=0 loops=1)
->  Seq Scan on ename  (cost=0.00..16370.01
rows=101 width=36) (actual time=26.744..13878.410 rows=101
loops=1)

Total runtime: 586226.831 ms 

18 row(s)

Total runtime: 586,435.978 ms



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[PERFORM] tblspace

2004-09-13 Thread Bill Fefferman
Hi, I have downloaded the new postgresql (version 8.0 beta2) and I was
wondering what performance features I can take advantage of before I start
to dump my 3/4 terrabyte database into the new database.  More 
specifically
I am interested in tablespaces--what exactly is this feature, some sort of
organizational addition (?) and how can I best take advantage of this? 
Anything else?  Furthermore, if I compile from source will I be able to 
revert to using the packaged version of postgresql 8.0 stable later on 
without modifying the database(I use debian)….?

Thanks.


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Help with extracting large volumes of records across related

2004-09-13 Thread Mischa Sandberg
Damien Dougan wrote:
Basically we have a number of tables, which are exposed as 2 public
views (say PvA and PvB). For each row in PvA, there are a number of
related rows in PvB (this number is arbitrary, which is one of the
reasons why it cant be expressed as additional columns in PvA - so we
really need 2 sets of tables - which leads to two sets of extract calls
- interwoven to associate PvA with PvB).
Are there any tools/tricks/tips with regards to extracting large volumes
of data across related tables from Postgres? It doesnt have to export
into XML, we can do post-processing on the extracted data as needed -
the important thing is to keep the relationship between PvA and PvB on a
row-by-row basis.
Just recently had to come up with an alternative to MSSQL's "SQL..FOR 
XML", for some five-level nested docs, that turned out to be faster (!)
and easier to understand:

Use SQL to organize each of the row types into a single text field, plus 
a single key field, as well as any filter fields you . Sort the union, 
and have the reading process break them into documents.

For example, if PvA has key (account_id, order_id) and 
fields(order_date, ship_date) and PvB has key (order_id, product_id) and 
fields (order_qty, back_order)

CREATE VIEW PvABxml AS
SELECT  account_id::text + order_id::text AS quay
,'order_date="' + order_date::text
+ '" ship_date="' + ship_date::text + '"' AS info
,ship_date
FROMPvA
UNION ALL
SELECT  account_id::text + order_id::text + product_id::text
,'order_qty="' + order_qty::text +'"'
,ship_date
FROMPvA JOIN PvB USING (order_id)
Then:
SELECT quay, info
FROM pvABxml
WHERE ship_date = '...'
ORDER BY quay
gives you a stream of info in the (parent,child,child... 
parent,child,child...) order you want, that assemble very easily into 
XML documents. If you need to pick out, say, orders where there are 
backordered items, you probably need to work with a temp table with 
which to prefilter.

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] tblspaces integrated in new postgresql (version 8.0)

2004-09-13 Thread Tom Lane
[EMAIL PROTECTED] writes:
> I am interested in tablespaces--what exactly is this feature, some sort of
> organizational addition (?) and howcan I best take advantage of this? 

See
http://developer.postgresql.org/docs/postgres/manage-ag-tablespaces.html

It doesn't talk a lot yet about *why* you'd want to use this ...

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Four table join with million records - performance improvement?

2004-09-13 Thread Tom Lane
Vijay Moses <[EMAIL PROTECTED]> writes:
> Hi i have four sample tables ename, esal, edoj and esum
> All of them have 100 records. Im running the following
> query : select ename.eid, name, sal, doj, summary from
> ename,esal,edoj,esum where ename.eid=esal.eid and ename.eid=edoj.eid
> and ename.eid=esum.eid. Its a join of all four tables which returns
> all 1 million records. The eid field in ename is a Primary Key and the
> eid in all other tables are Foreign Keys. I have created an index for
> all Foreign Keys. This query takes around 16 MINUTES to complete. Can
> this time be reduced?

The indexes will be completely useless for that sort of query; the
reasonable choices are sort/merge or hashjoin.  For either one, your
best way to speed it up is to increase sort_mem.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Large # of rows in query extremely slow, not using index

2004-09-13 Thread Stephen Crowley
Problem solved.. I set the fetchSize to a reasonable value instead of
the default of unlimited  in the PreparedStatement and now the query
is . After some searching it seeems this is a common problem, would it
make sense to change the default value to something other than 0 in
the JDBC driver?

If I get some extra time I'll look into libpq and see what is required
to fix the API. Most thirdparty programs and existing JDBC apps won't
work with the current paradigm when returning large result sets.

Thanks,
Stephen



On Mon, 13 Sep 2004 21:49:14 -0400, Tom Lane <[EMAIL PROTECTED]> wrote:
> Stephen Crowley <[EMAIL PROTECTED]> writes:
> > On Mon, 13 Sep 2004 21:11:07 -0400, Tom Lane <[EMAIL PROTECTED]> wrote:
> >> Stephen Crowley <[EMAIL PROTECTED]> writes:
> >>> Does postgres cache the entire result set before it begins returning
> >>> data to the client?
> >>
> >> The backend doesn't, but libpq does, and I think JDBC does too.
> 
> > That is incredible. Why would libpq do such a thing?
> 
> Because the API it presents doesn't allow for the possibility of query
> failure after having given you back a PGresult: either you have the
> whole result available with no further worries, or you don't.
> If you think it's "incredible", let's see you design an equally
> easy-to-use API that doesn't make this assumption.
> 
> (Now having said that, I would have no objection to someone extending
> libpq to offer an alternative streaming API for query results.  It
> hasn't got to the top of anyone's to-do list though ... and I'm
> unconvinced that psql could use it if it did exist.)

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]