[PERFORM] Fragmentation of WAL files

2007-04-26 Thread Jim Nasby
I was recently running defrag on my windows/parallels VM and noticed  
a bunch of WAL files that defrag couldn't take care of, presumably  
because the database was running. What's disturbing to me is that  
these files all had ~2000 fragments. Now, this was an EnterpriseDB  
database which means the WAL files were 64MB instead of 16MB, but  
even having 500 fragments for a 16MB WAL file seems like it would  
definitely impact performance.


Can anyone else confirm this? I don't know if this is a windows-only  
issue, but I don't know of a way to check fragmentation in unix.

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] seeking advise on char vs text or varchar in search table

2007-04-26 Thread Jim Nasby

On Apr 23, 2007, at 7:16 AM, Merlin Moncure wrote:

On 4/20/07, chrisj <[EMAIL PROTECTED]> wrote:


I have a table that contains a column for keywords that I expect  
to become
quite large and will be used for web searches.  I will either  
index the
column or come up with a simple hashing algorithm add the hash key  
to the

table and index that column.

I am thinking the max length in the keyword column I need to  
support is 30,

but the average would be less than10

Any suggestions on whether to use char(30), varchar(30) or text,  
would be
appreciated.  I am looking for the best performance option, not  
necessarily

the most economical on disk.


Don't use char...it pads out the string to the length always.   It
also has no real advantage over varchar in any practical situation.
Think of varchar as text with a maximum length...its no faster or
slower but the database will throw out entries based on length (which
can be good or a bad thing)...in this case, text feels better.


AIUI, char, varchar and text all store their data in *exactly* the  
same way in the database; char only pads data on output, and in the  
actual tables it still contains the regular varlena header. The only  
reason I've ever used char in other databases is to save the overhead  
of the variable-length information, so I recommend to people to just  
steer clear of char in PostgreSQL.

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] What`s wrong with JFS configuration?

2007-04-26 Thread Jim Nasby

On Apr 25, 2007, at 8:51 AM, Paweł Gruszczyński wrote:
where u6 stores Fedora Core 6 operating system, and u0 stores 3  
partitions with ext2, ext3 and jfs filesystem.


Keep in mind that drives have a faster data transfer rate at the  
outer-edge than they do at the inner edge, so if you've got all 3  
filesystems sitting on that array at the same time it's not a fair  
test. I heard numbers on the impact of this a *long* time ago and I  
think it was in the 10% range, but I could be remembering wrong.


You'll need to drop each filesystem and create the next one to get a  
fair comparison.

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] What`s wrong with JFS configuration?

2007-04-26 Thread Luke Lonergan
The outer track / inner track performance ratio is more like 40 percent.  
Recent example is 78MB/s outer and 44MB/s inner for the new Seagate 750MB drive 
(see http://www.storagereview.com for benchmark results)

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Jim Nasby [mailto:[EMAIL PROTECTED]
Sent:   Thursday, April 26, 2007 03:53 AM Eastern Standard Time
To: Pawel Gruszczynski
Cc: pgsql-performance@postgresql.org
Subject:Re: [PERFORM] What`s wrong with JFS configuration?

On Apr 25, 2007, at 8:51 AM, Pawel Gruszczynski wrote:
> where u6 stores Fedora Core 6 operating system, and u0 stores 3  
> partitions with ext2, ext3 and jfs filesystem.

Keep in mind that drives have a faster data transfer rate at the  
outer-edge than they do at the inner edge, so if you've got all 3  
filesystems sitting on that array at the same time it's not a fair  
test. I heard numbers on the impact of this a *long* time ago and I  
think it was in the 10% range, but I could be remembering wrong.

You'll need to drop each filesystem and create the next one to get a  
fair comparison.
--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match



Re: [PERFORM] seeking advise on char vs text or varchar in search table

2007-04-26 Thread Gregory Stark
"Jim Nasby" <[EMAIL PROTECTED]> writes:

> AIUI, char, varchar and text all store their data in *exactly* the same way in
> the database; char only pads data on output, and in the  actual tables it 
> still
> contains the regular varlena header. The only  reason I've ever used char in
> other databases is to save the overhead  of the variable-length information, 
> so
> I recommend to people to just  steer clear of char in PostgreSQL.

Everything you said is correct except that char actually pads its data on
input, not output. This doesn't actually make a lot of sense since we're
storing it as a varlena so we could pad it on output and modify the data type
functions to pretend the spaces are there without storing them.

However it would only make a difference if you're storing variable length data
in a char field in which case I would 100% agree with your conclusion and
strongly recommend using varchar. The only reason I would think of using char
is when the data should always be the same length, like a SSN or md5hash or
something like that. In which case it's purely for the self-documenting
notational convenience, not any performance reason.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] What`s wrong with JFS configuration?

2007-04-26 Thread Cosimo Streppone

Jim Nasby wrote:


On Apr 25, 2007, at 8:51 AM, Paweł Gruszczyński wrote:
where u6 stores Fedora Core 6 operating system, and u0 stores 3 
partitions with ext2, ext3 and jfs filesystem.


Keep in mind that drives have a faster data transfer rate at the 
outer-edge than they do at the inner edge [...]


I've been wondering from time to time if partitions position
can be a (probably modest, of course) performance gain factor.

If I create a partition at the beginning or end of the disk,
is this going to have a determined platter physical position?

I remember having heard that every manufacturer has its own
allocation logic.

Has anyone got some information, just for curiosity?

--
Cosimo


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


[PERFORM] Fw: PostgreSQL Performance Tuning

2007-04-26 Thread Shohab Abdullah
Dear,
We are facing performance tuning problem while using PostgreSQL Database 
over the network on a linux OS.
Our Database consists of more than 500 tables with an average of 10K 
records per table with an average of 20 users accessing the database 
simultaneously over the network. Each table has indexes and we are 
querying the database using Hibernate.
The biggest problem is while insertion, updating and fetching of records, 
ie the database performance is very slow. It take a long time to respond 
in the above scenario.
Please provide me with the tuning of the database. I am attaching my 
postgresql.conf file for the reference of our current configuration



Please replay me ASAP
Regards,
   Shohab Abdullah 
   Software Engineer,
Manufacturing SBU-POWAI
Larsen and Toubro Infotech Ltd.| 4th floor, L&T Technology Centre, 
Saki Vihar Road, Powai, Mumbai-400072
  (:  +91-22-67767366 | (:  +91-9870247322
Visit us at : http://www.lntinfotech.com 
”I cannot predict future, I cannot change past, I have just the present 
moment, I must treat it as my last" 

The information contained in this email has been classified: 
[ X] L&T Infotech General Business
[] L&T Infotech Internal Use Only
[] L&T Infotech Confidential
[] L&T Infotech Proprietary
This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged 
information.
If you are not the intended recipient, please contact the sender by reply 
e-mail and destroy all copies of the original message.

__


gifW11cBC4uxn.gif
Description: GIF image


gifsNzEiRPKHP.gif
Description: GIF image


jpgRn4ea2AxTR.jpg
Description: JPEG image


postgresql.conf
Description: Binary data

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Fw: PostgreSQL Performance Tuning

2007-04-26 Thread Richard Huxton
Please try to keep postings to one mailing list - I've replied to the 
performance list here.


Shohab Abdullah wrote:

Dear,
We are facing performance tuning problem while using PostgreSQL Database 
over the network on a linux OS.
Our Database consists of more than 500 tables with an average of 10K 
records per table with an average of 20 users accessing the database 
simultaneously over the network. Each table has indexes and we are 
querying the database using Hibernate.
The biggest problem is while insertion, updating and fetching of records, 
ie the database performance is very slow. It take a long time to respond 
in the above scenario.
Please provide me with the tuning of the database. I am attaching my 
postgresql.conf file for the reference of our current configuration


You haven't provided any details on what version of PG you are using, 
what hardware you are using, whether there is a specific bottleneck 
(disk, memory, cpu) or certain queries.


Without that, no-one can suggest useful settings. You might find this 
document a good place to start: http://www.powerpostgresql.com/PerfList/


HTH
--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] [GENERAL] Fw: PostgreSQL Performance Tuning

2007-04-26 Thread Richard Huxton

Please try to post to one list at a time.

I've replied to this on the -performance list.

--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Fragmentation of WAL files

2007-04-26 Thread Bill Moran
In response to Jim Nasby <[EMAIL PROTECTED]>:

> I was recently running defrag on my windows/parallels VM and noticed  
> a bunch of WAL files that defrag couldn't take care of, presumably  
> because the database was running. What's disturbing to me is that  
> these files all had ~2000 fragments. Now, this was an EnterpriseDB  
> database which means the WAL files were 64MB instead of 16MB, but  
> even having 500 fragments for a 16MB WAL file seems like it would  
> definitely impact performance.

I don't know about that.  I've seen marketing material that claims that
modern NTFS doesn't suffer performance problems from fragmentation.  I've
never tested it myself, but my point is that you might want to do some
experiments -- you might find out that it doesn't make any difference.

If it does, you should be able to stop the DB, defragment the files, then
start the DB back up.  Since WAL files are recycled, they shouldn't
fragment again -- unless I'm missing something.

If that works, it may indicate that (on Windows) a good method for installing
is to create all the necessary WAL files as empty files before launching
the DB.

> Can anyone else confirm this? I don't know if this is a windows-only  
> issue, but I don't know of a way to check fragmentation in unix.

I can confirm that it's only a Windows problem.  No UNIX filesystem
that I'm aware of suffers from fragmentation.

-- 
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

[EMAIL PROTECTED]
Phone: 412-422-3463x4023

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Fragmentation of WAL files

2007-04-26 Thread Heikki Linnakangas

Bill Moran wrote:

In response to Jim Nasby <[EMAIL PROTECTED]>:

I was recently running defrag on my windows/parallels VM and noticed  
a bunch of WAL files that defrag couldn't take care of, presumably  
because the database was running. What's disturbing to me is that  
these files all had ~2000 fragments. Now, this was an EnterpriseDB  
database which means the WAL files were 64MB instead of 16MB, but  
even having 500 fragments for a 16MB WAL file seems like it would  
definitely impact performance.


I don't know about that.  I've seen marketing material that claims that
modern NTFS doesn't suffer performance problems from fragmentation.  I've
never tested it myself, but my point is that you might want to do some
experiments -- you might find out that it doesn't make any difference.

If it does, you should be able to stop the DB, defragment the files, then
start the DB back up.  Since WAL files are recycled, they shouldn't
fragment again -- unless I'm missing something.

If that works, it may indicate that (on Windows) a good method for installing
is to create all the necessary WAL files as empty files before launching
the DB.


If that turns out to be a problem, I wonder if it would help to expand 
the WAL file to full size with ftruncate or something similar, instead 
of growing it page by page.


Can anyone else confirm this? I don't know if this is a windows-only  
issue, but I don't know of a way to check fragmentation in unix.


I can confirm that it's only a Windows problem.  No UNIX filesystem
that I'm aware of suffers from fragmentation.


What do you mean by suffering? All filesystems fragment files at some 
point. When and how differs from filesystem to filesystem. And some 
filesystems might be smarter than others in placing the fragments.


There's a tool for Linux in the e2fsprogs package called filefrag that 
shows the fragmentation of a file, but I've never used it myself.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Filesystem fragmentation (Re: [PERFORM] Fragmentation of WAL files)

2007-04-26 Thread Bill Moran
In response to Heikki Linnakangas <[EMAIL PROTECTED]>:

[snip]

> >> Can anyone else confirm this? I don't know if this is a windows-only  
> >> issue, but I don't know of a way to check fragmentation in unix.
> > 
> > I can confirm that it's only a Windows problem.  No UNIX filesystem
> > that I'm aware of suffers from fragmentation.
> 
> What do you mean by suffering? All filesystems fragment files at some 
> point. When and how differs from filesystem to filesystem. And some 
> filesystems might be smarter than others in placing the fragments.

To clarify my viewpoint:
To my knowledge, there is no Unix filesystem that _suffers_ from
fragmentation.  Specifically, all filessytems have some degree of
fragmentation that occurs, but every Unix filesystem that I am aware of
has built-in mechanisms to mitigate this and prevent it from becoming
a performance issue.

> There's a tool for Linux in the e2fsprogs package called filefrag that 
> shows the fragmentation of a file, but I've never used it myself.

Interesting.  However, the existence of a tool does not particularly
indicated the _need_ for said tool.  It might just have been something
cool that somebody wrote.

-- 
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

[EMAIL PROTECTED]
Phone: 412-422-3463x4023

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Fragmentation of WAL files

2007-04-26 Thread Tom Lane
> In response to Jim Nasby <[EMAIL PROTECTED]>:
>> I was recently running defrag on my windows/parallels VM and noticed  
>> a bunch of WAL files that defrag couldn't take care of, presumably  
>> because the database was running. What's disturbing to me is that  
>> these files all had ~2000 fragments.

It sounds like that filesystem is too stupid to coalesce successive
write() calls into one allocation fragment :-(.  I agree with the
comments that this might not be important, but you could experiment
to see --- try increasing the size of "zbuffer" in XLogFileInit to
maybe 16*XLOG_BLCKSZ, re-initdb, and see if performance improves.

The suggestion to use ftruncate is so full of holes that I won't
bother to point them all out, but certainly we could write more than
just XLOG_BLCKSZ at a time while preparing the file.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: Filesystem fragmentation (Re: [PERFORM] Fragmentation of WAL files)

2007-04-26 Thread Craig A. James

Bill Moran wrote:

In response to Heikki Linnakangas <[EMAIL PROTECTED]>:
Can anyone else confirm this? I don't know if this is a windows-only  
issue, but I don't know of a way to check fragmentation in unix.

I can confirm that it's only a Windows problem.  No UNIX filesystem
that I'm aware of suffers from fragmentation.
What do you mean by suffering? All filesystems fragment files at some 
point. When and how differs from filesystem to filesystem. And some 
filesystems might be smarter than others in placing the fragments.


To clarify my viewpoint:
To my knowledge, there is no Unix filesystem that _suffers_ from
fragmentation.  Specifically, all filessytems have some degree of
fragmentation that occurs, but every Unix filesystem that I am aware of
has built-in mechanisms to mitigate this and prevent it from becoming
a performance issue.


More specifically, this problem was solved on UNIX file systems way back in the 
1970's and 1980's.  No UNIX file system (including Linux) since then has had 
significant fragmentation problems, unless the file system gets close to 100% 
full.  If you run below 90% full, fragmentation shouldn't ever be a significant 
performance problem.

The word "fragmentation" would have dropped from the common parlance if it 
weren't for MS Windoz.

Craig

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: Filesystem fragmentation (Re: [PERFORM] Fragmentation of WAL files)

2007-04-26 Thread Gregory Stark
"Craig A. James" <[EMAIL PROTECTED]> writes:

> More specifically, this problem was solved on UNIX file systems way back in 
> the
> 1970's and 1980's. No UNIX file system (including Linux) since then has had
> significant fragmentation problems, unless the file system gets close to 100%
> full. If you run below 90% full, fragmentation shouldn't ever be a significant
> performance problem.

Note that the main technique used to avoid fragmentation -- paradoxically --
is to break the file up into reasonable sized chunks. This allows the
filesystem the flexibility to place the chunks efficiently.

In the case of a performance-critical file like the WAL that's always read
sequentially it may be to our advantage to defeat this technique and force it
to be allocated sequentially. I'm not sure whether any filesystems provide any
option to do so.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] [GENERAL] Fw: PostgreSQL Performance Tuning

2007-04-26 Thread Gerhard Wiesinger

Hello!

I would do the following (in that order):
1.) Check for a performant application logic and application design (e.g. 
degree of granularity of the Java Hibernate Mapping, are there some 
object iterators with hundreds of objects, etc.)
2.) Check the hibernate generated queries and whether the query is 
suitable or not. Also do a "explain query" do see the query plan.


Sometimes a manually generated is much more efficient than hibernate ones.

3.) Optimize the database e.g. postgresql.

Ciao,
Gerhard

--
http://www.wiesinger.com/


On Thu, 26 Apr 2007, Shohab Abdullah wrote:


Dear,
We are facing performance tuning problem while using PostgreSQL Database
over the network on a linux OS.
Our Database consists of more than 500 tables with an average of 10K
records per table with an average of 20 users accessing the database
simultaneously over the network. Each table has indexes and we are
querying the database using Hibernate.
The biggest problem is while insertion, updating and fetching of records,
ie the database performance is very slow. It take a long time to respond
in the above scenario.
Please provide me with the tuning of the database. I am attaching my
postgresql.conf file for the reference of our current configuration



Please replay me ASAP
Regards,
  Shohab Abdullah
  Software Engineer,
   Manufacturing SBU-POWAI
   Larsen and Toubro Infotech Ltd.| 4th floor, L&T Technology Centre,
Saki Vihar Road, Powai, Mumbai-400072
 (:  +91-22-67767366 | (:  +91-9870247322
   Visit us at : http://www.lntinfotech.com
ÿÿI cannot predict future, I cannot change past, I have just the present
moment, I must treat it as my last"

The information contained in this email has been classified:
[ X] L&T Infotech General Business
[] L&T Infotech Internal Use Only
[] L&T Infotech Confidential
[] L&T Infotech Proprietary
This e-mail and any files transmitted with it are for the sole use of the
intended recipient(s) and may contain confidential and privileged
information.
If you are not the intended recipient, please contact the sender by reply
e-mail and destroy all copies of the original message.

__


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: Filesystem fragmentation (Re: [PERFORM] Fragmentation of WAL files)

2007-04-26 Thread Tom Lane
Gregory Stark <[EMAIL PROTECTED]> writes:
> In the case of a performance-critical file like the WAL that's always read
> sequentially it may be to our advantage to defeat this technique and force it
> to be allocated sequentially. I'm not sure whether any filesystems provide any
> option to do so.

We more or less do that already by filling the entire file in one go
when it's created ...

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] not using indexes on large table

2007-04-26 Thread Scott Marlowe
On Sat, 2007-04-21 at 15:17, Jeroen Kleijer wrote:
> Hi all,
> 
> I'm a bit new to PostgreSQL and database design in general so forgive me
> for asking stupid questions. ;-)
> 
> I've setup a PostgreSQL database on a Linux machine (2 processor, 1GB
> mem) and while the database itself resides on a NetApp filer, via NFS,
> this doesn't seem to impact the performance to drastically.

What does a benchmark like bonnie++ say about your performance?  And I
hope your data's not too important to you, because I've had LOTS of
problems with NFS mounts in the past with pgsql.  Generally speaking,
NFS can be moderately fast, or moderately reliable (for databases) but
it generally isn't both at the same time.

Considering the cost of a quartet of 80 Gig SATA drives ($59x4) and a
decent RAID controller (LSI, Areca at ~$450 or so) you could be getting
VERY good performance out of your system with real reliability at the
same time on a RAID-10 volume.  Then use the NetApp for backup.  That's
what I'd do.

> I basically use it for indexed tables without any relation between 'em
> so far this has worked perfectly.
> 
> For statistics I've created the following table:
> volume varchar(30),
> qtree varchar(255),
> file varchar(512),
> ctime timestamp,
> mtime timestamp,
> atime timestamp
> annd created separate indexes on the volume and qtree columns.

You might want to look at setting this up as two or three tables with a
view and update triggers to look like one table to the user, and the
qtree and file in their own tables.  that would make your main stats
table only one varchar(30) and 3 timestamps wide. Especially if qtree
and file tend to be large.  If one of those tends to be small and the
other large, then look at moving just the large one into its own table. 
The reasons for this will be obvious later on in this post.

> The problems comes when I try to do a query without using a where clause
> because by then, it completely discards the indexes and does a complete
> table scan which takes over half an hour! (40.710.725 rows, 1110258
> pages, 1715 seconds)

Yes it does, and it should.

Why?  Visibility.  This has been discussed quite a bit on the lists. 
Because of the particular design for PostgreSQL's MVCC implementation,
indexes cannot contain visibility information on tables.  Therefore,
every time the db looks in an index, it then has to look in the table
anyway to find the right version of that tuple and to see if it's
actually valid for your snapshot.

> Can anyone tell me if this is normal behaviour (half an hour seems over
> the top to me) and if not, what I can do about it.

Yes this is normal behaviour.  It's just how PostgreSQL works.  There
are some workarounds our there that involve updating extra tables that
carry things like counts etc...  Each of these cost something in
overhead.

There are two distinct problems here.  One is that you're tying to use
PostgreSQL in a role where perhaps a different database might be a
better choice.  MSSQL Server or DB2 or even MySQL might be a better
choice depending on what you want to do with your data.

The other problem is that you're using an NFS server.  Either go whole
hog and buy a SAN with dual 2G nics in it or put local storage underneat
your machine with LOTS of hard drives in RAID-10.

Note that while other databases may be better at some of the queries
you're trying to run, it might be that PostgreSQL is still a good choice
because of other queries, and you can do rollups of the data that it's
slow at while using it for the things it is good at.

I've got a test db on my workstation that's pretty big at 42,463,248
rows and taking up 12 Gigs just for the table, 7.7 Gigs in indexes, and
a select count(*) on it takes 489 seconds.  I try not to do things like
that.  It covers the last 9 months of statistics.  

This query:

select a, b, count(*) from summary where atime > '2006-06-16' and
perspective = 'yada'
group by a, b
order by a, b

took 300 seconds, which is typical.

This is on a Workstation with one CPU, 2 gigs of ram, and a 150 Gig SATA
drive.  It's running X Windows, with Evolution, firefox, and a dozen
other user apps up and running.  Our "real" server, with 4 disks in a
RAID 5 on a mediocre RAID controller but with 2 CPUs and 6 gigs of ram,
stomps my little work station into the ground.  

I have the feeling my laptop with 512 Meg of ram and a 1.6 GHz CPU would
be faster than your current server.

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] index structure for 114-dimension vector

2007-04-26 Thread C Storm
On Apr 20, 12:07 pm, [EMAIL PROTECTED] (Andrew Lazarus) wrote:
> I have a table with 2.5 million real[] arrays. (They are points in a
> time series.) Given a new array X, I'd like to find, say, the 25
> closest to X in some sense--for simplification, let's just say in the
> usualvectornorm. Speed is critical here, and everything I have tried
> has been too slow.
>
> I imported the cube contrib package, and I tried creating an index on
> a cube of the last 6 elements, which are the most important. Then I
> tested the 2.5MM rows for being contained within a tolerance of the
> last 6 elements of X, +/- 0.1 in each coordinate, figuring that would
> be an indexed search (which I CLUSTERED on). I then ran the sort on
> this smaller set. The index was used, but it was still too slow. I
> also tried creating new columns with rounded int2 values of the last 6
> coordinates and made a multicolumn index.
>
> For each X the search is taking about 4-15 seconds which is above my
> target at least one order of magnitude. Absolute numbers are dependent
> on my hardware and settings, and some of this can be addressed with
> configuration tweaks, etc., but first I think I need to know the
> optimum data structure/indexing strategy.
>
> Is anyone on the list experienced with this sort of issue?
>
> Thanks.
> Andrew Lazarus  [EMAIL PROTECTED]
>
> ---(end of broadcast)---
> TIP 3: Have you checked our extensive FAQ?
>
>http://www.postgresql.org/docs/faq

Having worked in high dimensional spaces a lot in my career I think
you'll find that there are
mathematical limits in terms of speed.  In practical terms, a seq_scan
will be unavoidable since
on first approximation you are limited to doing an exhaustive search
in 101-dimensional space unless
you make approximations or dimensionality reductions of some kind.

Read up on the Curse of Dimensionality:  
http://en.wikipedia.org/wiki/Curse_of_dimensionality

Have you considered dimension reduction techniques such as Singular
Value Decomposition,
Principal Components Analysis, etc.?


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] postgres: 100% CPU utilization

2007-04-26 Thread Sergey Tsukinovsky
Thanks for this reply, Ron.
This is almost what I was looking for.

While the upgrade to the latest version is out of the question (which
unfortunately for me became the subject of this discussion) still, I was
looking for the ways to improve the performance of the 7.0.2 version. 

Extensive use of vacuum was almost obvious, though I was hoping to get
some more tips from postrges gurus (or dinosaurs, if you want).

Anyways, the 8.2.4 was not performing so well without auto-vacuum. It
ramped up to 50% of CPU usage in 2 hours under the load.
With the auto-vacuum ON I've got what I really need and thus I know what
to do next.

Just for the record - the hardware that was used for the test has the
following parameters:
AMD Opteron 2GHZ
2GB RAM
LSI Logic SCSI

Thanks everyone for your assistance!
Sergey




-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Ron
Sent: Monday, April 23, 2007 11:07 AM
To: Mario Weilguni
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] postgres: 100% CPU utilization

At 04:53 AM 4/23/2007, Mario Weilguni wrote:
>Am Donnerstag, 19. April 2007 schrieb Sergey Tsukinovsky:
> > 2. What would be the recommended set of parameters to tune up in
order
> > to improve the performance over the time, instead of considering an
> > option to vacuum every 30 minutes or so?
> >
> > 3. Is it safe to run 'vacuum' as frequently as every 15-30 minutes?
>No problem.
>
> >
> > 4. Suggestions?
>Do yourself a favor and upgrade at least to 8.1.x and use autovacuum.
In fact, I'll go one step further and say that pg improves so much 
from release to release that everyone should make superhuman efforts 
to always be running the latest stable release.

Even the differences between 8.1.x and 8.2.x are worth it.

(and the fewer and more modern the releases "out in the wild", the 
easier community support is)
Cheers,
Ron Peacetree 


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


[PERFORM] Simple query, 10 million records...MySQL ten times faster

2007-04-26 Thread zardozrocks
I have this table:

CREATE TABLE test_zip_assoc (
id serial NOT NULL,
f_id integer DEFAULT 0 NOT NULL,
lat_radians numeric(6,5) DEFAULT 0.0 NOT NULL,
long_radians numeric(6,5) DEFAULT 0.0 NOT NULL
);
CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians);
CREATE INDEX long_radians ON test_zip_assoc USING btree
(long_radians);



It's basically a table that associates some foreign_key (for an event,
for instance) with a particular location using longitude and
latitude.  I'm basically doing a simple proximity search.  I have
populated the database with *10 million* records.  I then test
performance by picking 50 zip codes at random and finding the records
within 50 miles with a query like this:

SELECT id
FROM test_zip_assoc
WHERE
lat_radians > 0.69014816041
AND lat_radians < 0.71538026567
AND long_radians > -1.35446228028
AND long_radians < -1.32923017502


On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB
ram) this query averages 1.5 seconds each time it runs after a brief
warmup period.  In PostGreSQL it averages about 15 seconds.

Both of those times are too slow.  I need the query to run in under a
second with as many as a billion records.  I don't know if this is
possible but I'm really hoping someone can help me restructure my
indexes (multicolumn?, multiple indexes with a 'where' clause?) so
that I can get this running as fast as possible.

If I need to consider some non-database data structure in RAM I will
do that too.  Any help or tips would be greatly appreciated.  I'm
willing to go to greath lengths to test this if someone can make a
good suggestion that sounds like it has a reasonable chance of
improving the speed of this search.  There's an extensive thread on my
efforts already here:

http://phpbuilder.com/board/showthread.php?t=10331619&page=10


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Simple query, 10 million records...MySQL ten times faster

2007-04-26 Thread Merlin Moncure

On 24 Apr 2007 14:26:46 -0700, zardozrocks <[EMAIL PROTECTED]> wrote:

I have this table:

CREATE TABLE test_zip_assoc (
id serial NOT NULL,
f_id integer DEFAULT 0 NOT NULL,
lat_radians numeric(6,5) DEFAULT 0.0 NOT NULL,
long_radians numeric(6,5) DEFAULT 0.0 NOT NULL
);
CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians);
CREATE INDEX long_radians ON test_zip_assoc USING btree
(long_radians);



It's basically a table that associates some foreign_key (for an event,
for instance) with a particular location using longitude and
latitude.  I'm basically doing a simple proximity search.  I have
populated the database with *10 million* records.  I then test
performance by picking 50 zip codes at random and finding the records
within 50 miles with a query like this:

SELECT id
FROM test_zip_assoc
WHERE
lat_radians > 0.69014816041
AND lat_radians < 0.71538026567
AND long_radians > -1.35446228028
AND long_radians < -1.32923017502


On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB
ram) this query averages 1.5 seconds each time it runs after a brief
warmup period.  In PostGreSQL it averages about 15 seconds.

Both of those times are too slow.  I need the query to run in under a
second with as many as a billion records.  I don't know if this is
possible but I'm really hoping someone can help me restructure my
indexes (multicolumn?, multiple indexes with a 'where' clause?) so
that I can get this running as fast as possible.

If I need to consider some non-database data structure in RAM I will
do that too.  Any help or tips would be greatly appreciated.  I'm
willing to go to greath lengths to test this if someone can make a
good suggestion that sounds like it has a reasonable chance of
improving the speed of this search.  There's an extensive thread on my
efforts already here:


You can always go the earthdist route.  the index takes longer to
build (like 5x) longer than btree, but will optimize that exact
operation.

merlin

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Simple query, 10 million records...MySQL ten times faster

2007-04-26 Thread Bill Moran
In response to zardozrocks <[EMAIL PROTECTED]>:

> I have this table:
> 
> CREATE TABLE test_zip_assoc (
> id serial NOT NULL,
> f_id integer DEFAULT 0 NOT NULL,
> lat_radians numeric(6,5) DEFAULT 0.0 NOT NULL,
> long_radians numeric(6,5) DEFAULT 0.0 NOT NULL
> );
> CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians);
> CREATE INDEX long_radians ON test_zip_assoc USING btree
> (long_radians);
> 
> 
> 
> It's basically a table that associates some foreign_key (for an event,
> for instance) with a particular location using longitude and
> latitude.  I'm basically doing a simple proximity search.  I have
> populated the database with *10 million* records.  I then test
> performance by picking 50 zip codes at random and finding the records
> within 50 miles with a query like this:
> 
> SELECT id
>   FROM test_zip_assoc
>   WHERE
>   lat_radians > 0.69014816041
>   AND lat_radians < 0.71538026567
>   AND long_radians > -1.35446228028
>   AND long_radians < -1.32923017502
> 
> 
> On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB
> ram) this query averages 1.5 seconds each time it runs after a brief
> warmup period.  In PostGreSQL it averages about 15 seconds.
> 
> Both of those times are too slow.  I need the query to run in under a
> second with as many as a billion records.  I don't know if this is
> possible but I'm really hoping someone can help me restructure my
> indexes (multicolumn?, multiple indexes with a 'where' clause?) so
> that I can get this running as fast as possible.
> 
> If I need to consider some non-database data structure in RAM I will
> do that too.  Any help or tips would be greatly appreciated.  I'm
> willing to go to greath lengths to test this if someone can make a
> good suggestion that sounds like it has a reasonable chance of
> improving the speed of this search.  There's an extensive thread on my
> efforts already here:
> 
> http://phpbuilder.com/board/showthread.php?t=10331619&page=10

Why didn't you investigate/respond to the last posts there?  The advice
to bump shared_buffers is good advice.  work_mem might also need bumped.

Figure out which postgresql.conf your system is using and get it dialed
in for your hardware.  You can make all the indexes you want, but if
you've told Postgres that it only has 8M of RAM to work with, performance
is going to suck.  I don't see hardware specs on that thread (but I
didn't read the whole thing)  If the system you're using is a dedicated
DB system, set shared_buffers to 1/3 - 1/2 of the physical RAM on the
machine for starters.

-- 
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

[EMAIL PROTECTED]
Phone: 412-422-3463x4023

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] [GENERAL] PostgreSQL Performance Tuning

2007-04-26 Thread Steve Crawford
Shohab Abdullah wrote:
> 
> Dear,
> We are facing performance tuning problem while using PostgreSQL Database
> over the network on a linux OS.
> Our Database consists of more than 500 tables with an average of 10K
> records per table with an average of 20 users accessing the database
> simultaneously over the network. Each table has indexes and we are
> querying the database using Hibernate.
> The biggest problem is while insertion, updating and fetching of
> records, ie the database performance is very slow. It take a long time
> to respond in the above scenario.
> Please provide me with the tuning of the database. I am attaching my
> *postgresql.conf* file for the reference of our current configuration

Have you changed _anything_ from the defaults? The defaults are set so
PG will run on as many installations as practical. They are not set for
performance - that is specific to your equipment, your data, and how you
need to handle the data. Assuming the record sizes aren't huge, that's
not a very large data set nor number of users.

Look at these for starters:
http://www.varlena.com/GeneralBits/Tidbits/perf.html
http://www.varlena.com/GeneralBits/Tidbits/annotated_conf_e.html

You might try setting the logging parameters to log queries longer than
"x" (where x is in milliseconds - you will have to decide the
appropriate value for "too long") and start looking into those first.

Make sure that you are running "analyze" if it is not being run by
autovacuum.

Use "EXPLAIN " to see how the query is being planned - as a
first-pass assume that on any reasonably sized table the words
"sequential scan" means "fix this". Note that you may have to cast
variables in a query to match the variable in an index in order for the
planner to figure out that it can use the index.

Read the guidelines then take an educated stab at some settings and see
how they work - other than turning off fsync, there's not much in
postgresql.conf that will put your data at risk.

Cheers,
Steve

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Simple query, 10 million records...MySQL ten times faster

2007-04-26 Thread Benjamin Minshall

zardozrocks wrote:

lat_radians numeric(6,5) DEFAULT 0.0 NOT NULL,
long_radians numeric(6,5) DEFAULT 0.0 NOT NULL


Native data types such as integer or real are much faster than numeric. 
 If you need 6 digits, it's better to multiply your coordinates by 10^6 
and store as INTEGER.



On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB
ram) this query averages 1.5 seconds each time it runs after a brief
warmup period.  In PostGreSQL it averages about 15 seconds.


What hard drive(s) and controller(s) do you have?  Please post EXPLAIN 
ANALYZE output of the problem query and your postgresql.conf also.


--
Benjamin Minshall <[EMAIL PROTECTED]>
Senior Developer -- Intellicon, Inc.


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PERFORM] Simple query, 10 million records...MySQL ten times faster

2007-04-26 Thread Luke Lonergan
NUMERIC operations are very slow in pgsql.  Equality comparisons are somewhat 
faster, but other operations are very slow compared to other vendor's NUMERIC.

We've sped it up a lot here internally, but you may want to consider using 
FLOAT for what you are doing.

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Bill Moran [mailto:[EMAIL PROTECTED]
Sent:   Thursday, April 26, 2007 05:13 PM Eastern Standard Time
To: zardozrocks
Cc: pgsql-performance@postgresql.org
Subject:Re: [PERFORM] Simple query, 10 million records...MySQL ten 
times faster

In response to zardozrocks <[EMAIL PROTECTED]>:

> I have this table:
> 
> CREATE TABLE test_zip_assoc (
> id serial NOT NULL,
> f_id integer DEFAULT 0 NOT NULL,
> lat_radians numeric(6,5) DEFAULT 0.0 NOT NULL,
> long_radians numeric(6,5) DEFAULT 0.0 NOT NULL
> );
> CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians);
> CREATE INDEX long_radians ON test_zip_assoc USING btree
> (long_radians);
> 
> 
> 
> It's basically a table that associates some foreign_key (for an event,
> for instance) with a particular location using longitude and
> latitude.  I'm basically doing a simple proximity search.  I have
> populated the database with *10 million* records.  I then test
> performance by picking 50 zip codes at random and finding the records
> within 50 miles with a query like this:
> 
> SELECT id
>   FROM test_zip_assoc
>   WHERE
>   lat_radians > 0.69014816041
>   AND lat_radians < 0.71538026567
>   AND long_radians > -1.35446228028
>   AND long_radians < -1.32923017502
> 
> 
> On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB
> ram) this query averages 1.5 seconds each time it runs after a brief
> warmup period.  In PostGreSQL it averages about 15 seconds.
> 
> Both of those times are too slow.  I need the query to run in under a
> second with as many as a billion records.  I don't know if this is
> possible but I'm really hoping someone can help me restructure my
> indexes (multicolumn?, multiple indexes with a 'where' clause?) so
> that I can get this running as fast as possible.
> 
> If I need to consider some non-database data structure in RAM I will
> do that too.  Any help or tips would be greatly appreciated.  I'm
> willing to go to greath lengths to test this if someone can make a
> good suggestion that sounds like it has a reasonable chance of
> improving the speed of this search.  There's an extensive thread on my
> efforts already here:
> 
> http://phpbuilder.com/board/showthread.php?t=10331619&page=10

Why didn't you investigate/respond to the last posts there?  The advice
to bump shared_buffers is good advice.  work_mem might also need bumped.

Figure out which postgresql.conf your system is using and get it dialed
in for your hardware.  You can make all the indexes you want, but if
you've told Postgres that it only has 8M of RAM to work with, performance
is going to suck.  I don't see hardware specs on that thread (but I
didn't read the whole thing)  If the system you're using is a dedicated
DB system, set shared_buffers to 1/3 - 1/2 of the physical RAM on the
machine for starters.

-- 
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

[EMAIL PROTECTED]
Phone: 412-422-3463x4023

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org



Re: [PERFORM] index structure for 114-dimension vector

2007-04-26 Thread Alexander Staubo

On 4/20/07, Andrew Lazarus <[EMAIL PROTECTED]> wrote:

I have a table with 2.5 million real[] arrays. (They are points in a
time series.) Given a new array X, I'd like to find, say, the 25
closest to X in some sense--for simplification, let's just say in the
usual vector norm. Speed is critical here, and everything I have tried
has been too slow.


Let me chime in with the observation that this is a multidimensional
nearest neighbour (reverse nearest neighbour and its close cousin,
k-NN) that is well known in statistics, and particularly relevant to
statistical learning and classification. Knowing the jargon might help
you dig up efficient algorithms to mine your data; there are tons of
fascinating papers available through Citeseer.

In particular, I recommend the paper "Efficient k-NN Search on
Vertically Decomposed Data" by de Vries et al, SIGMOD 2002 (PDF here:
http://citeseer.ist.psu.edu/618138.html), if only for inspiration. It
proposes an algorithm called BOND to drastically reduce the search
space by probalistic means. They give an example using image
histograms, but the algorithm applies to any multidimensional data.
Briefly put, it points out that proximity comparison can be computed
vertically, a few dimensions at a time, and entire subsets can be
thrown away when it's apparent that they are below a statistically
derived lower bound. The only gotcha is that the algorithm derives
much of its performance from the assumption that your data is
vertically decomposed, one table per dimension, otherwise the search
effectively incurs a sequential scan of the entire dataset, and then
you're pretty much back to square one.

The most common approach to nearest neighbour search is to use a
spatial data structure. The classic algorithm is the kd-tree
(http://en.wikipedia.org/wiki/Kd-tree) and there's the newer K-D-B
tree, neither of which are available in PostgreSQL. If I remember
correctly, R-trees have also been shown to be useful for high numbers
of dimensions; with PostgreSQL you have R-trees and even better
R-tree-equivalent support through GiST. I have no idea whether you can
actually munge your integer vectors into something GiST can index and
search, but it's a thought. (GiST, presumably, can also theoretically
index kd-trees and other spatial trees.)

Alexander.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Simple query, 10 million records...MySQL ten times faster

2007-04-26 Thread Scott Marlowe
On Tue, 2007-04-24 at 16:26, zardozrocks wrote:
> I have this table:
> 
> CREATE TABLE test_zip_assoc (
> id serial NOT NULL,
> f_id integer DEFAULT 0 NOT NULL,
> lat_radians numeric(6,5) DEFAULT 0.0 NOT NULL,
> long_radians numeric(6,5) DEFAULT 0.0 NOT NULL
> );

Like someone else mentioned numeric types are SLOW.  See if you can use
integers, or at least floats.

I also wonder if you might be better served with geometric types and
GiST indexes on them than using your lat / long grid.  With a geometric
type, you could define the lat / long as a point and use geometric
operations with it.

See the pgsql manual:

http://www.postgresql.org/docs/8.1/static/datatype-geometric.html
http://www.postgresql.org/docs/8.1/static/functions-geometry.html

> It's basically a table that associates some foreign_key (for an event,
> for instance) with a particular location using longitude and
> latitude.  I'm basically doing a simple proximity search.  I have
> populated the database with *10 million* records.  I then test
> performance by picking 50 zip codes at random and finding the records
> within 50 miles with a query like this:

I assume that there aren't 10 million zip codes, right?

Are you storing the lat / long of the individual venues?  Or the zip
codes?  If you're storing the lat / long of the zips, then I can't
imagine there are 10 million zip codes.  If you could use the lat / long
numbers to find the zip codes that are in your range, then join that to
a venue table that fks off of the zip code table, I would think it would
be much faster, as you'd have a smaller data set to trundle through.

> SELECT id
>   FROM test_zip_assoc
>   WHERE
>   lat_radians > 0.69014816041
>   AND lat_radians < 0.71538026567
>   AND long_radians > -1.35446228028
>   AND long_radians < -1.32923017502
> 
> 
> On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB
> ram) this query averages 1.5 seconds each time it runs after a brief
> warmup period.  In PostGreSQL it averages about 15 seconds.

I wonder how well it would run if you had 10, 20, 30, 40 etc... users
running it at the same time.  My guess is that you'll be very lucky to
get anything close to linear scaling in any database.  That's because
this is CPU / Memory bandwidth intensive, so it's gonna kill your DB. 
OTOH, if it was I/O bound you could throw more hardware at it (bb cache
RAID controller, etc)

> Both of those times are too slow.  I need the query to run in under a
> second with as many as a billion records.  I don't know if this is
> possible but I'm really hoping someone can help me restructure my
> indexes (multicolumn?, multiple indexes with a 'where' clause?) so
> that I can get this running as fast as possible.

You're trying to do a whole lot of processing in a little time.  You're
either gonna have to accept a less exact answer (i.e. base it on zip
codes) or come up with some way of mining the data for the answers ahead
of time, kind of like a full text search for lat and long.

So, have you tried what I suggested about increasing shared_buffers and
work_mem yet?

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] Simple query, 10 million records...MySQL ten times faster

2007-04-26 Thread Jeff Hoffmann

zardozrocks wrote:

I have this table:

CREATE TABLE test_zip_assoc (
id serial NOT NULL,
f_id integer DEFAULT 0 NOT NULL,
lat_radians numeric(6,5) DEFAULT 0.0 NOT NULL,
long_radians numeric(6,5) DEFAULT 0.0 NOT NULL
);
CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians);
CREATE INDEX long_radians ON test_zip_assoc USING btree
(long_radians);


Maybe I'm missing something, but wouldn't it be easier to just use 
PostGIS?  Or failing that, using the vanilla built-in point type and an 
r-tree index?  That's what r-tree indexes are made for.


--
Jeff Hoffmann
Head Plate Spinner
PropertyKey.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] postgres: 100% CPU utilization

2007-04-26 Thread Scott Marlowe
On Tue, 2007-04-24 at 10:30, Sergey Tsukinovsky wrote:
> Thanks for this reply, Ron.
> This is almost what I was looking for.
> 
> While the upgrade to the latest version is out of the question (which
> unfortunately for me became the subject of this discussion) still, I was
> looking for the ways to improve the performance of the 7.0.2 version. 
> 
> Extensive use of vacuum was almost obvious, though I was hoping to get
> some more tips from postrges gurus (or dinosaurs, if you want).
> 
> Anyways, the 8.2.4 was not performing so well without auto-vacuum. It
> ramped up to 50% of CPU usage in 2 hours under the load.
> With the auto-vacuum ON I've got what I really need and thus I know what
> to do next.

Could you give us a better picture of how you were testing 8.2.4?  My
guess is that you were doing something that seemed right to you, but was
working against yourself, like constant vacuum fulls and getting index
bloat, or something else.  

Why were you trying to not use autovacuum, btw?  I've found it to be
quite capable, with only a few situations (high speed queueing) where I
needed to manually schedule vacuums.  And I've never seen a situation
since about 7.4 where regular full vacuums were required.

> Just for the record - the hardware that was used for the test has the
> following parameters:
> AMD Opteron 2GHZ
> 2GB RAM
> LSI Logic SCSI

Nice hardware.  I'd really like to hear the logic behind your statement
that upgrading to 8.1 or 8.2 is out of the question.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Simple query, 10 million records...MySQL ten times faster

2007-04-26 Thread Alexander Staubo

On 24 Apr 2007 14:26:46 -0700, zardozrocks <[EMAIL PROTECTED]> wrote:

I have this table:

CREATE TABLE test_zip_assoc (
id serial NOT NULL,
f_id integer DEFAULT 0 NOT NULL,
lat_radians numeric(6,5) DEFAULT 0.0 NOT NULL,
long_radians numeric(6,5) DEFAULT 0.0 NOT NULL
);
CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians);
CREATE INDEX long_radians ON test_zip_assoc USING btree
(long_radians);


This is a spatial search -- B-tree indexes are much less efficient
than this than certain other data structures. The R-tree and its many
variants are based on subdividing the space in regions, allowing you
to do efficient checks on containment, intersection, etc., based on
points or bounding boxes.

PostgreSQL implements R-trees natively as well as through a mechanism
called GiST, a framework for implementing pluggable tree-like indexes.
It also provides some geometric data types. However, as far as I know,
PostgreSQL's R-tree/GiST indexes do *not* provide the operators to do
bounding box searches. For this you need PostGIS.

PostGIS implements the whole GIS stack, and it's so good at this that
it's practically the de facto tool among GIS analysts. Installing
PostGIS into a database is simple, and once you have done this, you
can augment your table with a geometry (*):

 alter table test_zip_assoc add column lonlat geometry;
 update test_zip_assoc set lonlat = makepoint(
   long_radians / (3.14159265358979 / 180),
   lat_radians / (3.14159265358979 / 180));

The division is to convert your radians into degrees; PostGIS works
with degrees, at least out of the box.

Now you can query on a bounding box (although, are you sure you got
your lons and lats in order? That's Antarctica, isn't it?):

 select * from test_zip_assoc
 where lonlat && makebox2d(
   makepoint(-77.6049721697096, 39.5425768302107),
   makepoint(-76.1592790300818,  40.9882699698386))

This is bound to be blazingly fast. Next you can order by geographic
distance if you like:

 order by distance_sphere(lonlat,
   makepoint(-77.6049721697096, 39.5425768302107))

Nobody has mentioned PostGIS so far, so I hope I'm not missing some
crucial detail, like "no spatial indexes allowed!".

(*) I cheated here. The PostGIS manual recommends that you use a
function to create geometric column, because it will set up some
auxilary data structures for you that are needed for certain
operations. The recommended syntax is:

 select AddGeometryColumn('', 'test_zip_assoc', 'geom', -1, 'POINT', 2);

Alexander.

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] Simple query, 10 million records...MySQL ten times faster

2007-04-26 Thread Alexander Staubo

On 4/27/07, Alexander Staubo <[EMAIL PROTECTED]> wrote:
[snip]

PostGIS implements the whole GIS stack, and it's so good at this that
it's practically the de facto tool among GIS analysts. Installing
PostGIS into a database is simple, and once you have done this, you
can augment your table with a geometry (*):

  alter table test_zip_assoc add column lonlat geometry;


I forgot to include the crucial step, of course:

 create index test_zip_assoc_lonlat_index on test_zip_assoc
   using gist (lonlat gist_geometry_ops);
 analyze test_zip_assoc_lonlat;

This creates a GiST index on the geometry and (significantly) updates
the table statistics.

Alexander.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] postgres: 100% CPU utilization

2007-04-26 Thread Mark Kirkwood

Sergey Tsukinovsky wrote:


Just for the record - the hardware that was used for the test has the
following parameters:
AMD Opteron 2GHZ
2GB RAM
LSI Logic SCSI



And you ran FreeBSD 4.4 on it right? This may be a source of high cpu 
utilization in itself if the box is SMP or dual core, as multi-cpu 
support was pretty primitive in that release (4.12 would be better if 
you are required to stick to the 4.x branch, if not the 6.2 is recommended)!


Cheers

Mark



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] Simple query, 10 million records...MySQL ten times faster

2007-04-26 Thread joe


Is there a reason you are not using postgis. The R tree indexes are
designed for exactly this type of query and should be able to do it very
quickly.

Hope that helps,

Joe

> I have this table:
>
> CREATE TABLE test_zip_assoc (
> id serial NOT NULL,
> f_id integer DEFAULT 0 NOT NULL,
> lat_radians numeric(6,5) DEFAULT 0.0 NOT NULL,
> long_radians numeric(6,5) DEFAULT 0.0 NOT NULL
> );
> CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians);
> CREATE INDEX long_radians ON test_zip_assoc USING btree
> (long_radians);
>
>
>
> It's basically a table that associates some foreign_key (for an event,
> for instance) with a particular location using longitude and
> latitude.  I'm basically doing a simple proximity search.  I have
> populated the database with *10 million* records.  I then test
> performance by picking 50 zip codes at random and finding the records
> within 50 miles with a query like this:
>
> SELECT id
>   FROM test_zip_assoc
>   WHERE
>   lat_radians > 0.69014816041
>   AND lat_radians < 0.71538026567
>   AND long_radians > -1.35446228028
>   AND long_radians < -1.32923017502
>
>
> On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB
> ram) this query averages 1.5 seconds each time it runs after a brief
> warmup period.  In PostGreSQL it averages about 15 seconds.
>
> Both of those times are too slow.  I need the query to run in under a
> second with as many as a billion records.  I don't know if this is
> possible but I'm really hoping someone can help me restructure my
> indexes (multicolumn?, multiple indexes with a 'where' clause?) so
> that I can get this running as fast as possible.
>
> If I need to consider some non-database data structure in RAM I will
> do that too.  Any help or tips would be greatly appreciated.  I'm
> willing to go to greath lengths to test this if someone can make a
> good suggestion that sounds like it has a reasonable chance of
> improving the speed of this search.  There's an extensive thread on my
> efforts already here:
>
> http://phpbuilder.com/board/showthread.php?t=10331619&page=10
>
>
> ---(end of broadcast)---
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>choose an index scan if your joining column's datatypes do not
>match
>



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-04-26 Thread Carlos Moreno

Steve Crawford wrote:

Have you changed _anything_ from the defaults? The defaults are set so
PG will run on as many installations as practical. They are not set for
performance - that is specific to your equipment, your data, and how you
need to handle the data. 

Is this really the sensible thing to do?   I know we should not encourage
the world we're leaving in even more in the ways of "have the computer
do everything for us so that we don't need to have even a clue about what
we're doing" ...  But, wouldn't it make sense that the configure script
determines the amount of physical memory and perhaps even do a HD
speed estimate to set up defaults that are closer to a 
performance-optimized

configuration?

Then, perhaps command switches so that you could specify the type of
access you estimate for your system.  Perhaps something like:

./configure --db-size=100GB --write-percentage=20   etc.

(switch write-percentage above indicates that we estimate that 20% of
the DB activity would be writing to the disk --- there may be other
switches to indicate the percentage of queries that are transactions,
the percentage of queries that are complex;  percentage that require
index usage, etc. etc. etc.)...  And then, based on that, a better set of
defaults could be set by the configuration script.

Does this make sense?  Or perhaps I'm watching too much science
fiction?

Carlos
--


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] Feature Request --- was: PostgreSQL Performance Tuning

2007-04-26 Thread Tom Lane
Carlos Moreno <[EMAIL PROTECTED]> writes:
> ...  But, wouldn't it make sense that the configure script
> determines the amount of physical memory and perhaps even do a HD
> speed estimate to set up defaults that are closer to a 
> performance-optimized
> configuration?

No.  Most copies of Postgres these days are executed on machines very
far away from where the code was built.  It's a little bit safer to
try to tune things at initdb time ... as indeed we already do.  But
the fundamental problem remains that we don't know that much about
how the installation will be used.  For example, the planner
configuration parameters turn out to have not that much to do with the
absolute speed of your drive, and a whole lot to do with the ratio
of the size of your database to the amount of RAM you've got; and the
ultimate size of the DB is one thing initdb certainly can't guess.

Also, there is an extremely good reason why Postgres will never be set
up to try to take over the whole machine by default: most of the
developers run multiple postmasters on their machines.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


[PERFORM] Usage up to 50% CPU

2007-04-26 Thread Andres Retzlaff

Hi,

I have pg 8.1.4 running in
Windows XP Pro
wirh a Pentium D

and I notice that I can not use more than 50% of the cpus (Pentium D has 2 
cpus), how can I change the settings to use the 100% of it.


Regards,
Andrew Retzlaff

_
Advertisement: Visit LetsShop.com to WIN Fabulous Books Weekly 
http://a.ninemsn.com.au/b.aspx?URL=http%3A%2F%2Fwww%2Eletsshop%2Ecom%2FLetsShopBookClub%2Ftabid%2F866%2FDefault%2Easpx&_t=751480117&_r=HM_Tagline_books&_m=EXT



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] index structure for 114-dimension vector

2007-04-26 Thread Oleg Bartunov

On Fri, 27 Apr 2007, Alexander Staubo wrote:


On 4/20/07, Andrew Lazarus <[EMAIL PROTECTED]> wrote:

I have a table with 2.5 million real[] arrays. (They are points in a
time series.) Given a new array X, I'd like to find, say, the 25
closest to X in some sense--for simplification, let's just say in the
usual vector norm. Speed is critical here, and everything I have tried
has been too slow.


Let me chime in with the observation that this is a multidimensional
nearest neighbour (reverse nearest neighbour and its close cousin,
k-NN) that is well known in statistics, and particularly relevant to
statistical learning and classification. Knowing the jargon might help
you dig up efficient algorithms to mine your data; there are tons of
fascinating papers available through Citeseer.

In particular, I recommend the paper "Efficient k-NN Search on
Vertically Decomposed Data" by de Vries et al, SIGMOD 2002 (PDF here:
http://citeseer.ist.psu.edu/618138.html), if only for inspiration. It
proposes an algorithm called BOND to drastically reduce the search
space by probalistic means. They give an example using image
histograms, but the algorithm applies to any multidimensional data.
Briefly put, it points out that proximity comparison can be computed
vertically, a few dimensions at a time, and entire subsets can be
thrown away when it's apparent that they are below a statistically
derived lower bound. The only gotcha is that the algorithm derives
much of its performance from the assumption that your data is
vertically decomposed, one table per dimension, otherwise the search
effectively incurs a sequential scan of the entire dataset, and then
you're pretty much back to square one.

The most common approach to nearest neighbour search is to use a
spatial data structure. The classic algorithm is the kd-tree
(http://en.wikipedia.org/wiki/Kd-tree) and there's the newer K-D-B
tree, neither of which are available in PostgreSQL. If I remember
correctly, R-trees have also been shown to be useful for high numbers
of dimensions; with PostgreSQL you have R-trees and even better
R-tree-equivalent support through GiST. I have no idea whether you can
actually munge your integer vectors into something GiST can index and
search, but it's a thought. (GiST, presumably, can also theoretically
index kd-trees and other spatial trees.)


you're right, but currently  only theoretically due to interface restriction.
We have plan to improve it sometime. There was SP-GiST project, which 
could be used for k-d-b tree,  see http://www.cs.purdue.edu/spgist/

I don't know if it works with 8.2 version. Also, it doesn't supports
concurrency and recovery



Alexander.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

 http://archives.postgresql.org



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] Simple query, 10 million records...MySQL ten times faster

2007-04-26 Thread Oleg Bartunov

Folks,

we in astronomy permanently work with billiards objects with spherical
atributes and have several sky-indexing schemes. See my page
for links http://www.sai.msu.su/~megera/wiki/SkyPixelization

We have q3c package for PostgreSQL available from q3c.sf.net, which 
we use in production with terabytes-sized database.


Oleg
On Thu, 26 Apr 2007, Scott Marlowe wrote:


On Tue, 2007-04-24 at 16:26, zardozrocks wrote:

I have this table:

CREATE TABLE test_zip_assoc (
id serial NOT NULL,
f_id integer DEFAULT 0 NOT NULL,
lat_radians numeric(6,5) DEFAULT 0.0 NOT NULL,
long_radians numeric(6,5) DEFAULT 0.0 NOT NULL
);


Like someone else mentioned numeric types are SLOW.  See if you can use
integers, or at least floats.

I also wonder if you might be better served with geometric types and
GiST indexes on them than using your lat / long grid.  With a geometric
type, you could define the lat / long as a point and use geometric
operations with it.

See the pgsql manual:

http://www.postgresql.org/docs/8.1/static/datatype-geometric.html
http://www.postgresql.org/docs/8.1/static/functions-geometry.html


It's basically a table that associates some foreign_key (for an event,
for instance) with a particular location using longitude and
latitude.  I'm basically doing a simple proximity search.  I have
populated the database with *10 million* records.  I then test
performance by picking 50 zip codes at random and finding the records
within 50 miles with a query like this:


I assume that there aren't 10 million zip codes, right?

Are you storing the lat / long of the individual venues?  Or the zip
codes?  If you're storing the lat / long of the zips, then I can't
imagine there are 10 million zip codes.  If you could use the lat / long
numbers to find the zip codes that are in your range, then join that to
a venue table that fks off of the zip code table, I would think it would
be much faster, as you'd have a smaller data set to trundle through.


SELECT id
FROM test_zip_assoc
WHERE
lat_radians > 0.69014816041
AND lat_radians < 0.71538026567
AND long_radians > -1.35446228028
AND long_radians < -1.32923017502


On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB
ram) this query averages 1.5 seconds each time it runs after a brief
warmup period.  In PostGreSQL it averages about 15 seconds.


I wonder how well it would run if you had 10, 20, 30, 40 etc... users
running it at the same time.  My guess is that you'll be very lucky to
get anything close to linear scaling in any database.  That's because
this is CPU / Memory bandwidth intensive, so it's gonna kill your DB.
OTOH, if it was I/O bound you could throw more hardware at it (bb cache
RAID controller, etc)


Both of those times are too slow.  I need the query to run in under a
second with as many as a billion records.  I don't know if this is
possible but I'm really hoping someone can help me restructure my
indexes (multicolumn?, multiple indexes with a 'where' clause?) so
that I can get this running as fast as possible.


You're trying to do a whole lot of processing in a little time.  You're
either gonna have to accept a less exact answer (i.e. base it on zip
codes) or come up with some way of mining the data for the answers ahead
of time, kind of like a full text search for lat and long.

So, have you tried what I suggested about increasing shared_buffers and
work_mem yet?

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Fragmentation of WAL files

2007-04-26 Thread Greg Smith

On Thu, 26 Apr 2007, Bill Moran wrote:

I've seen marketing material that claims that modern NTFS doesn't suffer 
performance problems from fragmentation.


You're only reading half of the marketing material then.  For a balanced 
picture, read the stuff generated by the companies that sell defragmenting 
tools.  A good one to start with is 
http://files.diskeeper.com/pdf/HowFileFragmentationOccursonWindowsXP.pdf


Going back to the Jim's original question, they suggest a Microsoft paper 
that talks about how the defrag report can be misleading in respect to 
open files.  See http://support.microsoft.com/kb/228198


Also, some of the most interesting details they gloss over are specific to 
which version of Windows you're using; the reference guide to the subject 
of how NTFS decides how much space to pre-allocate at a time is available 
at http://support.microsoft.com/kb/841551 (ZIP file wrapped into EXE, 
yuck!)


If you compare them, you can see that the approach they're using in NTFS 
has evolved to become more like that used by a good UNIX filesystem over 
time.  I think your typical UNIX still has a healthy lead in this area, 
but the difference isn't as big as it used to be.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] index structure for 114-dimension vector

2007-04-26 Thread Arjen van der Meijden

On 21-4-2007 1:42 Mark Kirkwood wrote:

I don't think that will work for the vector norm i.e:

|x - y| = sqrt(sum over j ((x[j] - y[j])^2))


I don't know if this is usefull here, but I was able to rewrite that 
algorithm for a set of very sparse vectors (i.e. they had very little 
overlapping factors) to something like:

|x - y| = sum over j (x[j]^2) + sum over j (y[j]^2)
   + for each j where x[j] and y[j] are both non-zero: - (x[j]^2 + 
y[j]^2) + (x[j] - y[j])^2


The first two parts sums can be calculated only once. So if you have 
very little overlap, this is therefore much more efficient (if there is 
no overlap at all you end up with x[j]^2 + y[j]^2 anyway). Besides, this 
rewritten calculation allows you to store the X and Y vectors using a 
trivial table-layout vector(x,i,value) which is only filled with 
non-zero's and which you can trivially self-join to find the closest 
matches. You don't care about the j's where there is either no x or 
y-value anyway with this rewrite.


I can compare over 1000 y's of on average 100 elements to two x's of 
over 1000 elements on just a single 1.8Ghz amd processor. (I use it for 
a bi-kmeans algorithm, so there are only two buckets to compare to).


So it might be possible to rewrite your algorithm to be less 
calculation-intensive. Obviously, with a dense-matrix this isn't going 
to work, but there may be other ways to circumvent parts of the 
algorithm or to cache large parts of it.
It might also help to extract only the 6 relevant columns into a 
seperate temporary table which will have much smaller records and thus 
can fit more records per page.


Best regards,

Arjen

---(end of broadcast)---
TIP 6: explain analyze is your friend