Re: [PERFORM] Performace Optimization for Dummies

2006-10-03 Thread Markus Schaber
Hi, Carlo,

Carlo Stonebanks wrote:
 Did you think about putting the whole data into PostgreSQL using COPY in
 a nearly unprocessed manner, index it properly, and then use SQL and
 stored functions to transform the data inside the database to the
 desired result?
 
 This is actually what we are doing. The slowness is on the row-by-row 
 transformation. Every row reqauires that all the inserts and updates of the 
 pvious row be committed - that's why we have problems figuring out how to 
 use this using SQL set logic.

Maybe group by, order by, distinct on and hand-written functions
and aggregates (like first() or best()) may help.

You could combine all relevant columns into an user-defined compund
type, then group by entity, and have a self-defined aggregate generate
the accumulated tuple for each entity.

Markus
-- 
Markus Schaber | Logical TrackingTracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Performace Optimization for Dummies

2006-10-03 Thread Markus Schaber
Hi, Carlo,

Carlo Stonebanks wrote:

 Trying to achieve a high level of data quality in one large project is
 not often possible. Focus on the most critical areas of checking and get
 that working first with acceptable performance, then layer on additional
 checks while tuning. The complexity of the load programs you have also
 means they are susceptible to introducing data quality problems rather
 than removing them, so an incremental approach will also aid debugging
 of the load suite.
 
 I couldn't agree more.

I still think that using a PL in the backend might be more performant
than having an external client, alone being the SPI interface more
efficient compared to the network serialization for external applications.

HTH,
Markus

-- 
Markus Schaber | Logical TrackingTracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Performace Optimization for Dummies

2006-10-03 Thread Carlo Stonebanks
 I still think that using a PL in the backend might be more performant
 than having an external client, alone being the SPI interface more
 efficient compared to the network serialization for external applications.

I would actually love for this to work better, as this is technology that I 
would like to develop in general - I see db servers with strong server-side 
programming languages as being able to operate as application servers, with 
the enterprises business logic centralised on the server.

The import routine that I wrote will actually work on the server as well - 
it will detect the presence of the spi_ calls, and replace the pg_* calls 
with spi_* calls. So, you see this WAS my intention.

However, the last time I tried to run something that complex from the db 
server, it ran quite slowly compared to from a client. This may have had 
something to do with the client that I used to call the stored procedure - I 
thought that perhaps the client created an implicit transaction around my 
SQL statement to allow a rollback, and all of the updates and inserts got 
backed up in a massive transaction queue that took forever to commit.

Carlo 



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Performace Optimization for Dummies

2006-10-03 Thread Carlo Stonebanks
 Maybe group by, order by, distinct on and hand-written functions
 and aggregates (like first() or best()) may help.

We use these - we have lexical analysis functions which assign a rating to 
each row in a set, and the likelyhood that the data is a match, and then we 
sort our results.

I thought this would be the cause of the slowdowns - and it is, but a very 
small part of it. I have identified the problem code, and the problems are 
within some very simple joins. I have posted the code under a related topic 
header. I obviously have a few things to learn about optimising SQL joins.

Carlo


 You could combine all relevant columns into an user-defined compund
 type, then group by entity, and have a self-defined aggregate generate
 the accumulated tuple for each entity.

 Markus
 -- 
 Markus Schaber | Logical TrackingTracing International AG
 Dipl. Inf. | Software Development GIS

 Fight against software patents in Europe! www.ffii.org
 www.nosoftwarepatents.org

 ---(end of broadcast)---
 TIP 6: explain analyze is your friend
 



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Performace Optimization for Dummies

2006-10-02 Thread Carlo Stonebanks
I have loaded three of the four cores by running three different versions of 
the import program to import three different segments of the table to 
import. The server jumps to 75% usage, with three postgresql processes 
eating up 25% each., the actual client itself taking up just a few ticks.

Heikki Linnakangas [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
 Carlo Stonebanks wrote:
 We urgently need a major performance improvement. We are running the
 PostgreSQL 8.1.4 on a Windows 2003 x64 Server on a dual processor, dual 
 core
 3.2Ghz Xeon box with 4gb RAM and a RAID (sorry, I don't know what type) 
 disc
 subsystem. Sorry about the long intro, but here are my questions:

 Others have already drilled down to the way you do the inserts and 
 statistics etc., so I'll just point out:

 Are you fully utilizing all the 4 cores you have? Could you parallelize 
 the loading process, if you're currently running just one client? Are you 
 I/O bound or CPU bound?

 -- 
 Heikki Linnakangas
 EnterpriseDB http://www.enterprisedb.com

 ---(end of broadcast)---
 TIP 2: Don't 'kill -9' the postmaster
 



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Performace Optimization for Dummies

2006-10-02 Thread Carlo Stonebanks
 Did you think about putting the whole data into PostgreSQL using COPY in
 a nearly unprocessed manner, index it properly, and then use SQL and
 stored functions to transform the data inside the database to the
 desired result?

This is actually what we are doing. The slowness is on the row-by-row 
transformation. Every row reqauires that all the inserts and updates of the 
pvious row be committed - that's why we have problems figuring out how to 
use this using SQL set logic.

Carlo 



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Performace Optimization for Dummies

2006-10-02 Thread Carlo Stonebanks
 My experience with that type of load process is that doing this
 row-by-row is a very expensive approach and your results bear that out.

I expected this, and had warned the client before the project started that 
this is exactly where SQL underperforms.

 It is often better to write each step as an SQL statement that operates
 on a set of rows at one time.

The problem with this approach is that every row of data is dependent on the 
previous row's data being validated and imported. e.g.

Import Row 1:
John Q Smith
Foobar Corp
123 Main St,
Bigtown, MD 12345-6789

Import Row 2:
John Quincy Smith
FuzzyLoginc Inc
123 Main St, Suite 301
Bigtown, MD 12345-6789

Import Row 3:
Bobby Jones
Foobar Corp
123 Main Strett Suite 300,
Bigtown, MD 12345

Every row must be imported into the table so that the next row may see the 
data and consider it when assigning ID's to the name, company and address. 
(all data must be normalised) How can this be done using set logic?

 You can also improve performance by ordering your checks so that the
 ones most likely to fail happen first.

Already done - I believe the problem is definitely in the navigational 
access model. What I am doing now makes perfect sense as far as the logic of 
the process goes - any other developer will read it and understand what is 
going on. At 3000 lines of code, this will be tedious, but understandable. 
But SQL hates it.

 Trying to achieve a high level of data quality in one large project is
 not often possible. Focus on the most critical areas of checking and get
 that working first with acceptable performance, then layer on additional
 checks while tuning. The complexity of the load programs you have also
 means they are susceptible to introducing data quality problems rather
 than removing them, so an incremental approach will also aid debugging
 of the load suite.

I couldn't agree more.

Carlo



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Performace Optimization for Dummies

2006-10-02 Thread Carlo Stonebanks
 1. fork your import somhow to get all 4 cores running

This is already happening, albeit only 3. No improvement - it appears we 
have taken the same problem, and divided it by 3. Same projected completion 
time. this is really curious, to say the least.

 2. write the code that actually does the insert in C and use the
 parameterized prepared statement.

I had already tried the paremetrised prepare statement; I had mentioned that 
I was surprised that it had no effect. No one here seemed surprised, or at 
least didn't think of commenting on it.

 however, your general approach has been 'please give me advice, but
 only the advice that i want'.

I'm sorry I don't understand - I had actually originally come asking four 
questions asking for recommendations and opinions on hardware, O/S and 
commercial support. I did also ask for comments on my config setup.



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Performace Optimization for Dummies

2006-09-29 Thread Heikki Linnakangas

Carlo Stonebanks wrote:

We urgently need a major performance improvement. We are running the
PostgreSQL 8.1.4 on a Windows 2003 x64 Server on a dual processor, 
dual core
3.2Ghz Xeon box with 4gb RAM and a RAID (sorry, I don't know what 
type) disc

subsystem. Sorry about the long intro, but here are my questions:


Others have already drilled down to the way you do the inserts and 
statistics etc., so I'll just point out:


Are you fully utilizing all the 4 cores you have? Could you parallelize 
the loading process, if you're currently running just one client? Are 
you I/O bound or CPU bound?


--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Performace Optimization for Dummies

2006-09-29 Thread Simon Riggs
On Thu, 2006-09-28 at 12:44 -0400, Carlo Stonebanks wrote:

 At this early stage in the project, we are initializing our portal's 
 database with millions of rows of imported data in over 50 different 
 flattened tables; each table's structure is unique to the data provider. 
 This requires a pretty complex import program, because the data must be 
 matched semantically, not literally. Even with all of the expression 
 matching and fuzzy logic in the code,our performance statistics show that 
 the program spends over 75% of its time in SQL queries looking for matching 
 and/or duplicate data.

My experience with that type of load process is that doing this
row-by-row is a very expensive approach and your results bear that out.

It is often better to write each step as an SQL statement that operates
on a set of rows at one time. The lookup operations become merge joins
rather than individual SQL Selects via an index, so increase the
efficiency of the lookup process by using bulk optimisations and
completely avoiding any program/server call traffic. Data can move from
step to step by using Insert Selects into temporary tables, as Jim has
already suggested.

The SQL set approach is different to the idea of simply moving the code
server-side by dropping it in a function. That helps with the net
traffic but has other issues also. You don't need to use esoteric
in-memory thingies if you use the more efficient join types already
available when you do set based operations (i.e. join all rows at once
in one big SQL statement).

You can also improve performance by ordering your checks so that the
ones most likely to fail happen first.

Trying to achieve a high level of data quality in one large project is
not often possible. Focus on the most critical areas of checking and get
that working first with acceptable performance, then layer on additional
checks while tuning. The complexity of the load programs you have also
means they are susceptible to introducing data quality problems rather
than removing them, so an incremental approach will also aid debugging
of the load suite. Dynamic SQL programs are particularly susceptible to
this kind of bug because you can't eyeball the code.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Performace Optimization for Dummies

2006-09-29 Thread Markus Schaber
Hi, Carlo,

Carlo Stonebanks wrote:

 We urgently need a major performance improvement.

Did you think about putting the whole data into PostgreSQL using COPY in
a nearly unprocessed manner, index it properly, and then use SQL and
stored functions to transform the data inside the database to the
desired result?

We're using this way for some 3rd-party databases we have to process
in-house.

HTH,
Markus

-- 
Markus Schaber | Logical TrackingTracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Performace Optimization for Dummies

2006-09-29 Thread Markus Schaber
Hi, Carlo,

Carlo Stonebanks wrote:

 From what I can see, autovacuum is hitting the db's in question about once 
 every five minutes. Does this imply an ANALYZE is being done automatically 
 that would meet the requirements we are talking about here? Is there any 
 benefit ot explicitly performing an ANALYZE?

Autovacuum looks at the modification statistics (they count how much
modifications happened on the table), and decides whether it's time to
VACUUM (reclaim empty space) and ANALYZE (update column value
distributions) the table.

The exact thresholds for Autovacuum to kick in are configurable, see the
docs.

HTH,
Markus


-- 
Markus Schaber | Logical TrackingTracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Performace Optimization for Dummies

2006-09-29 Thread Bill Moran
In response to Carlo Stonebanks [EMAIL PROTECTED]:

  indexes. I don't know whether autovacuum will also analyze tables
  for you automagically, but it would be a good idea to analyze the table
 
  It does.
 
 So, I have checked my log and I see an autovacuum running once every minute 
 on our various databases being hosted on the server - once every minute!
 
 From what I can see, autovacuum is hitting the db's in question about once 
 every five minutes. Does this imply an ANALYZE is being done automatically 
 that would meet the requirements we are talking about here? Is there any 
 benefit ot explicitly performing an ANALYZE?
 
 (Or does this go hand-in-and with turning off autovacuum...?)

It's only checking to see if vacuum/analyze needs done every 5 minutes.
It may or may not do any actual work at that time, based on how much
the tables have changed.  See:
http://www.postgresql.org/docs/8.1/interactive/maintenance.html#AUTOVACUUM

This is a case, during your bulk loads, where autovacuum might actually
hurt you.  How many records are you inserting/updating in 5 minutes?
You may be exceeding autovacuum's ability to keep things clean.

I can't say for sure, but I would suspect that you'd be better off not
using autovacuum until after the initial data loads are done.  My
guess is that you'll get better performance if you disable autovac and
write manual vacuum/analyze into your load scripts.  Exactly how often
to have your script do it is something that will require testing to
figure out, but probably starting with every 100 or so, then adjust
it up and down and see what works best.

Explicitly performing a vacuum or analyze can be very beneficial,
especially if you know what kind of changes your creating in the data.
(Now that I think of it, there's no reason to disable autovac, as it
will notice if you've just manually vacuumed a table and not do it
again.)  If you know that you're radically changing the kind of data
in a table, manually running analyze is a good idea.  If you know that
you're creating a lot of dead tuples, manually vacuuming is a good
idea.  Especially during a big data load where these changes might be
taking place faster than autovac notices.

-- 
Bill Moran
Collaborative Fusion Inc.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Performace Optimization for Dummies

2006-09-29 Thread Merlin Moncure

On 9/29/06, Carlo Stonebanks [EMAIL PROTECTED] wrote:

For reasons I've exlained elsewhere, the import process is not well suited
to breaking up the data into smaller segments. However, I'm interested in
what can be indexed. I am used to the idea that indexing only applies to
expressions that allows the data to be sorted, and then binary searches can
be performed on the sorted list. For example, I can see how you can create
an index to support:

where foo like 'bar%'

But is there any way to create an index expression that will help with:

where foo like '%bar%'?

I don't see it - but then again, I'm ready to be surprised!


using standard (btree) index, you can create an index on any constant
expression.  so, you  can create in index that matches '%bar%, but if
you also want to match '%bat%', you need another index.  there are
other exotic methods like t_search and gist approach which may or may
not be suitable.

regarding your import process, you came to this list and asked for
advice on how to fix your particular problem.  tweaking
postgresql.conf, etc will get you incremental gains but are unlikely
to have a huge impact.  as i understand it, your best shot at
improvement using current process is to:
1. fork your import somhow to get all 4 cores running
2. write the code that actually does the insert in C and use the
parameterized prepared statement.

however, your general approach has been 'please give me advice, but
only the advice that i want'.  if you really want to fix your problem,
give more specific details about your import and open the door to
improvements in your methodology which i suspect is not optimal.  you
concluded that client side coding was the way to go, but here you are
asking how to make it work.  if you want help (and there are some
extremely smart people here who may give you world class advice for
free), you need to lay your cards on the table and be willing to
consider alternative solutions.   you may find that a few properly
consructed queries will knock out 75% of your code and running time.

remember often the key to optimization is choosing the right algorithm

merlin

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


[PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Carlo Stonebanks
I am a software developer who is acting in a (temporary) dba role for a 
project. I had recommended PostgreSQL be brought in to replace the proposed 
MySQL DB - I chose PostgreSQL because of its reputation as a more stable 
solution than MySQL.

At this early stage in the project, we are initializing our portal's 
database with millions of rows of imported data in over 50 different 
flattened tables; each table's structure is unique to the data provider. 
This requires a pretty complex import program, because the data must be 
matched semantically, not literally. Even with all of the expression 
matching and fuzzy logic in the code,our performance statistics show that 
the program spends over 75% of its time in SQL queries looking for matching 
and/or duplicate data.

The import is slow - and degrades as the tables grow. With even more 
millions of rows in dozens of import tables to come, the imports will take 
forever. My ability to analyse the queries is limited; because of the nature 
of the import process, the SQL queries are mutable, every imported row can 
change the structure of a SQL query as the program adds and subtracts search 
conditions to the SQL command text before execution. The import program is 
scripted in Tcl. An attempt to convert our queries to prepared queries 
(curiousy) did not bring any performance improvements, and we converted back 
to simplify the code.

We urgently need a major performance improvement. We are running the 
PostgreSQL 8.1.4 on a Windows 2003 x64 Server on a dual processor, dual core 
3.2Ghz Xeon box with 4gb RAM and a RAID (sorry, I don't know what type) disc 
subsystem. Sorry about the long intro, but here are my questions:

1) Are we paying any big penalties by running Windows vs LINUX (or any other 
OS)?

2) Has the debate over PostgreSQL and Xeon processors been settled? Is this 
a factor?

3) Are there any easy-to-use performance analysis/optimisation tools that we 
can use? I am dreaming of one that could point out problems and suggest 
and.or effect solutions.

4) Can anyone recommend any commercial PostgreSQL service providers that may 
be able to swiftly come in and assist us with our performance issues?

Below, please find what I believe are the configuration settings of interest 
in our system

Any help and advice will be much appreciated. TIA,

Carlo

max_connections = 100
shared_buffers = 5
work_mem = 32768
maintenance_work_mem = 32768
checkpoint_segments = 128
effective_cache_size = 1
random_page_cost = 3
stats_start_collector = on
stats_command_string = on
stats_row_level = on
autovacuum = on








---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Joshua D. Drake

 The import is slow - and degrades as the tables grow. With even more 
 millions of rows in dozens of import tables to come, the imports will take 
 forever. My ability to analyse the queries is limited; because of the nature 
 of the import process, the SQL queries are mutable, every imported row can 
 change the structure of a SQL query as the program adds and subtracts search 
 conditions to the SQL command text before execution. The import program is 
 scripted in Tcl. An attempt to convert our queries to prepared queries 
 (curiousy) did not bring any performance improvements, and we converted back 
 to simplify the code.

How are you loading the tables? Copy? Insert?

 
 We urgently need a major performance improvement. We are running the 
 PostgreSQL 8.1.4 on a Windows 2003 x64 Server on a dual processor, dual core 
 3.2Ghz Xeon box with 4gb RAM and a RAID (sorry, I don't know what type) disc 
 subsystem. Sorry about the long intro, but here are my questions:
 
 1) Are we paying any big penalties by running Windows vs LINUX (or any other 
 OS)?

Yes. Linux or FreeBSD is going to stomp Win32 for PostgreSQL performance.

 
 2) Has the debate over PostgreSQL and Xeon processors been settled? Is this 
 a factor?

Depends. PostgreSQL is much better with the Xeon in general, but are you
running woodcrest based CPUs or the older models?

 
 3) Are there any easy-to-use performance analysis/optimisation tools that we 
 can use? I am dreaming of one that could point out problems and suggest 
 and.or effect solutions.

I don't know about Windows, but *nix has a number of tools available
directly at the operating system level to help you determine various
bottlenecks.

 
 4) Can anyone recommend any commercial PostgreSQL service providers that may 
 be able to swiftly come in and assist us with our performance issues?

http://www.commandprompt.com/ (disclaimer, I am an employee)

 
 Below, please find what I believe are the configuration settings of interest 
 in our system
 
 Any help and advice will be much appreciated. TIA,
 
 Carlo
 
 max_connections = 100
 shared_buffers = 5

This could probably be higher.

 work_mem = 32768

Depending on what you are doing, this is could be to low or to high.

 maintenance_work_mem = 32768
 checkpoint_segments = 128
 effective_cache_size = 1

This coudl probably be higher.

 random_page_cost = 3
 stats_start_collector = on
 stats_command_string = on
 stats_row_level = on
 autovacuum = on

Stats are a hit... you need to determine if you actually need them.

Joshua D. Drake



 
 
 
 
 
 
 
 
 ---(end of broadcast)---
 TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly
 


-- 

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Merlin Moncure

On 9/28/06, Carlo Stonebanks [EMAIL PROTECTED] wrote:

We urgently need a major performance improvement. We are running the
PostgreSQL 8.1.4 on a Windows 2003 x64 Server on a dual processor, dual core
3.2Ghz Xeon box with 4gb RAM and a RAID (sorry, I don't know what type) disc
subsystem. Sorry about the long intro, but here are my questions:


are you using the 'copy' interface?


1) Are we paying any big penalties by running Windows vs LINUX (or any other
OS)?


thats a tough question.  my gut says that windows will not scale as
well as recent linux kernels in high load environments.


2) Has the debate over PostgreSQL and Xeon processors been settled? Is this
a factor?


hearing good things about the woodcrest. pre-woodcrest xeon (dempsey
down) is outclassed by the opteron.


Below, please find what I believe are the configuration settings of interest
in our system


1. can probably run fsync=off during the import
2. if import is single proecess, consider temporary bump to memory for
index creation. or, since you have four cores consider having four
processes import the data somehow.
3. turn off stats collector, stats_command_string, stats_row_level,
and autovacuum during import.

merlin


Any help and advice will be much appreciated. TIA,

Carlo

max_connections = 100
shared_buffers = 5
work_mem = 32768
maintenance_work_mem = 32768
checkpoint_segments = 128
effective_cache_size = 1
random_page_cost = 3
stats_start_collector = on
stats_command_string = on
stats_row_level = on
autovacuum = on








---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Carlo Stonebanks
 How are you loading the tables? Copy? Insert?

Once the data is transformed, it is inserted. I don't have stats, but the 
programs visual feedback does not spend a lot of time on the inserting 
data message. Then again, if there is an asynchronous component to an 
insert, perhaps I am not seeing how slow an insert really is until I query 
the table.

 Yes. Linux or FreeBSD is going to stomp Win32 for PostgreSQL performance.

Don't suppose you'd care to hazard a guess on what sort of scale we're 
talking about? Are we talking about 10%? 100% I know this is a hard one to 
judge,  My impression was that the *NIX improvements were with concurrent 
use and right now, I am obsessing on this single-threaded issue.

 Depends. PostgreSQL is much better with the Xeon in general, but are you
 running woodcrest based CPUs or the older models?

Weren't those released in July? This server is a few months older, so I 
guess not. But maybe? Does Dell have the ability to install CPUs from the 
future like Cyberdyne does? ;-)

 I don't know about Windows, but *nix has a number of tools available
 directly at the operating system level to help you determine various
 bottlenecks.

Are we talking about I/O operations? I was thinking of SQL query analysis. 
The stuff I read here about query analysis is pretty intruiging, but to 
someone unfamiliar with this type of query analysis it all looks quite 
uncertain to me. I mean, I read the threads and it all looks like a lot of 
trying ot figure out how to cajole PostgreSQL to do what you want, rather 
than telling it: HEY I CREATED THAT INDEX FOR A REASON, USE IT!

I know this may be non-dba sophistication on my part, but I would like a 
tool that would make this whole process a little less arcane. I'm not the 
Gandalf type.

 4) Can anyone recommend any commercial PostgreSQL service providers that 
 may
 be able to swiftly come in and assist us with our performance issues?

 http://www.commandprompt.com/ (disclaimer, I am an employee)

Very much appreciated.

 max_connections = 100
 shared_buffers = 5

 This could probably be higher.

Ok, good start...


 work_mem = 32768

 Depending on what you are doing, this is could be to low or to high.

Is this like You could be too fat or too thin? Aren't you impressed with 
the fact that I managed to pick the one number that was not right for 
anything?


 maintenance_work_mem = 32768
 checkpoint_segments = 128
 effective_cache_size = 1

 This coudl probably be higher.

... noted...


 random_page_cost = 3
 stats_start_collector = on
 stats_command_string = on
 stats_row_level = on
 autovacuum = on

 Stats are a hit... you need to determine if you actually need them.

Unfortunately, this is the only way I know of of getting the query string to 
appear in the PostgreSQL server status display. While trying to figure out 
what is slowing things down, having that is really helpful. I also imagined 
that this sort of thing would be a performance hit when you are getting lots 
of small, concurrent queries. In my case, we have queries which are taking 
around a second to perform outer joins. They aren't competing with any other 
requests as the site is not running, we are just running one app to seed the 
data.



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Carlo Stonebanks
 are you using the 'copy' interface?

Straightforward inserts - the import data has to transformed, normalised and 
de-duped by the import program. I imagine the copy interface is for more 
straightforward data importing. These are - buy necessity - single row 
inserts.

 thats a tough question.  my gut says that windows will not scale as
 well as recent linux kernels in high load environments.

But not in the case of a single import program trying to seed a massive 
database?

 hearing good things about the woodcrest. pre-woodcrest xeon (dempsey
 down) is outclassed by the opteron.

Need to find a way to deterimine the Xeon type. The server was bought in 
early 2006, and it looks like woodcrest was form July.

 1. can probably run fsync=off during the import
 2. if import is single proecess, consider temporary bump to memory for
 index creation. or, since you have four cores consider having four
 processes import the data somehow.
 3. turn off stats collector, stats_command_string, stats_row_level,
 and autovacuum during import.

Very helpful, thanks.

Carlo 



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Dave Dutcher
 -Original Message-
 From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Carlo
Stonebanks
 Subject: [PERFORM] Performace Optimization for Dummies
 
 At this early stage in the project, we are initializing our portal's 
 database with millions of rows of imported data in over 50 different 
 flattened tables; each table's structure is unique to the 
 data provider. 
 This requires a pretty complex import program, because the 
 data must be 
 matched semantically, not literally. Even with all of the expression 
 matching and fuzzy logic in the code,our performance 
 statistics show that 
 the program spends over 75% of its time in SQL queries 
 looking for matching 
 and/or duplicate data.
 
 The import is slow - and degrades as the tables grow. 

So your program first transforms the data and then inserts it?  And it is
the transforming process which is running select statements that is slow?
If that is the case you could use duration logging to find the slow select
statement, and then you could post an EXPLAIN ANALYZE of the select.  

One question off the top of my head is are you using regular expressions for
your fuzzy logic if so do your indexes have the right operator classes?
(see http://www.postgresql.org/docs/8.1/static/indexes-opclass.html)

Dave


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Jim C. Nasby
On Thu, Sep 28, 2006 at 10:11:31AM -0700, Joshua D. Drake wrote:
  4) Can anyone recommend any commercial PostgreSQL service providers that 
  may 
  be able to swiftly come in and assist us with our performance issues?
 
 http://www.commandprompt.com/ (disclaimer, I am an employee)
 
You forgot us. :)
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Jim C. Nasby
On Thu, Sep 28, 2006 at 01:47:44PM -0400, Carlo Stonebanks wrote:
  How are you loading the tables? Copy? Insert?
 
 Once the data is transformed, it is inserted. I don't have stats, but the 
 programs visual feedback does not spend a lot of time on the inserting 
 data message. Then again, if there is an asynchronous component to an 
 insert, perhaps I am not seeing how slow an insert really is until I query 
 the table.
 
Well, individual inserts are slow, especially if they're not wrapped up
in a transaction. And you also mentioned checking for dupes. I suspect
that you're not going to find any huge gains in tuning the database...
it sounds like the application (as in: how it's using the database) is
what needs help.

  work_mem = 32768
 
  Depending on what you are doing, this is could be to low or to high.
 
 Is this like You could be too fat or too thin? Aren't you impressed with 
 the fact that I managed to pick the one number that was not right for 
 anything?

For what you're doing, it's probably fine where it is... but while
you're in the single-thread case, you can safely make that pretty big
(like 100).

 
  maintenance_work_mem = 32768
  checkpoint_segments = 128
  effective_cache_size = 1
 
  This coudl probably be higher.

I'd suggest setting it to about 3G, or 375000.
 
  random_page_cost = 3
  stats_start_collector = on
  stats_command_string = on
  stats_row_level = on
  autovacuum = on
 
  Stats are a hit... you need to determine if you actually need them.
 
 Unfortunately, this is the only way I know of of getting the query string to 
 appear in the PostgreSQL server status display. While trying to figure out 
 what is slowing things down, having that is really helpful. I also imagined 
 that this sort of thing would be a performance hit when you are getting lots 
 of small, concurrent queries. In my case, we have queries which are taking 
 around a second to perform outer joins. They aren't competing with any other 
 requests as the site is not running, we are just running one app to seed the 
 data.

stats_command_string can extract a huge penalty pre-8.2, on the order of
30%. I'd turn it off unless you *really* need it. Command logging (ie:
log_min_duration_statement) is much less of a burden.

The fact that you're doing outer joins while loading data really makes
me suspect that the application needs to be changed for any real
benefits to be had. But you should still look at what EXPLAIN ANALYZE is
showing you on those queries; you might be able to find some gains
there.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Jim C. Nasby
On Thu, Sep 28, 2006 at 01:53:22PM -0400, Carlo Stonebanks wrote:
  are you using the 'copy' interface?
 
 Straightforward inserts - the import data has to transformed, normalised and 
 de-duped by the import program. I imagine the copy interface is for more 
 straightforward data importing. These are - buy necessity - single row 
 inserts.

BTW, stuff like de-duping is something you really want the database -
not an external program - to be doing. Think about loading the data into
a temporary table and then working on it from there.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Matthew Nuzum
Carlo Stonebanks wrote:
 are you using the 'copy' interface?
 
 Straightforward inserts - the import data has to transformed, normalised and 
 de-duped by the import program. I imagine the copy interface is for more 
 straightforward data importing. These are - buy necessity - single row 
 inserts.
 

I know this is an answer to a question you didn't ask, but here it is. I
was once doing stuff where I processed log files and had to do many
lookups to normalize the data before insertion.

I started out doing everything in SQL and using postgresql tables and it
took a little over 24 hours to process 24 hours worth of data. Like you,
it was single process, many lookups.

I found a better way. I rewrote it (once in c#, again in python) and
used hashtables/dictionaries instead of tables for the lookup data. For
example, I'd start by loading the data into hash tables (yes, this took
a *lot* of ram) then for each row I did something like:
 1. is it in the hash table?
   1. If not, insert it into the db
   1. Insert it into the hashtable
 2. Get the lookup field out of the hash table
 3. Output normalized data

This allow me to create text files containing the data in COPY format
which can then be inserted into the database at dramatically increased
speeds.

My first version in C# (mono) cut the time down to 6 hours for 24 hours
worth of data. I tweaked the algorithms and rewrote it in Python and got
it down to 45 min. (Python can't take all the credit for the performance
boost, I used an improved technique that could have been done in C# as
well) This time included the time needed to do the copy and update the
indexes.

I created a version that also used gdb databases instead of hash tables.
It increased the time from 45 min to a little over an hour but decreased
the memory usage to something like 45MB (vs dozens or hundreds of MB per
hashtable)
-- 
Matthew Nuzum
newz2000 on freenode

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Carlo Stonebanks

 Are you wrapping all this in a transaction?

Yes, the transactions can typically wrap 1 to 10 single-table, single-row 
inserts and updates.


 You're doing some dynamically generated selects as part of the
 de-duping process? They're probably the expensive bit. What
 do those queries tend to look like?

Without a doubt, this is the expensive bit.

 Are you analysing the table periodically? If not, then you might
 have statistics based on an empty table, or default statistics, which
 might cause the planner to choose bad plans for those selects.

Now there's something I didn't know - I thought that analysis and planning 
was done with every select, and the performance benefit of prepared 
statements was to plan-once, execute many. I can easily put in a periodic 
analyse statement. I obviously missed how to use analyze properluy, I 
thought it was just human-readable output - do I understand correctly, that 
it can be used to get the SQL server to revaluate its plan based on newer 
statistics - even on non-prepared queries?

 Talking of which, are there indexes on the table? Normally you
 wouldn't have indexes in place during a bulk import, but if you're
 doing selects as part of the data load process then you'd be forcing
 sequential scans for every query, which would explain why it gets
 slower as the table gets bigger.

There are indexes for every obvious where this = that clauses. I don't 
believe that they will work for ilike expressions.


 Cheers,
   Steve


 ---(end of broadcast)---
 TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq
 



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Carlo Stonebanks
 So your program first transforms the data and then inserts it?  And it is
 the transforming process which is running select statements that is slow?

There are cross-referencing and deduplication processes. Does this person 
have an office at this exact address? In a similarily named building in the 
same zip code? City? What is the similarity of the building or enterprise 
names? Is there a person with a similar name with the same type of 
professional license nearby? We basically look for the statistical 
likelyhood that they already exist to decide whether to update their data, 
or insert a new data element.

These are all extremely soft queries and require left outer joins with all 
of the related tables that would contain this data (the left outer join 
tells us whether the related element satisfied the search condition). As I 
mentioned, as the data comes in, we examine what we have to work with and 
modify the tables and columns we can check - which is what I meant by  the 
SQL queries are mutable, every imported row can change the structure of a 
SQL query as the program adds and subtracts search conditions to the SQL 
command text before execution.

 If that is the case you could use duration logging to find the slow select
 statement, and then you could post an EXPLAIN ANALYZE of the select.

I'm pretty sure I know who the culprit is, and - like I said, it comes from 
a section of code that creates a mutable statement. However, everyone is 
being so helpful and I should post this data as soon as I can.

 One question off the top of my head is are you using regular expressions 
 for
 your fuzzy logic if so do your indexes have the right operator classes?
 (see http://www.postgresql.org/docs/8.1/static/indexes-opclass.html)

I am using regular expressions and fuzzy logic, but mostly on the client 
side (I have a Tcl implementation of levenshtein, for example). I don't 
think you can use indexes on functions such as levenshtein, because it 
requires a parameter only available at execution time. The link you sent me 
was very interesting - I will definitely reconsider my query logic if I can 
optimise regular expression searches on the server. Thanks!

Carlo 



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Merlin Moncure

On 9/28/06, Carlo Stonebanks [EMAIL PROTECTED] wrote:

 are you using the 'copy' interface?

Straightforward inserts - the import data has to transformed, normalised and
de-duped by the import program. I imagine the copy interface is for more
straightforward data importing. These are - buy necessity - single row
inserts.


right. see comments below.


 thats a tough question.  my gut says that windows will not scale as
 well as recent linux kernels in high load environments.

But not in the case of a single import program trying to seed a massive
database?


probably not.


 hearing good things about the woodcrest. pre-woodcrest xeon (dempsey
 down) is outclassed by the opteron.

Need to find a way to deterimine the Xeon type. The server was bought in
early 2006, and it looks like woodcrest was form July.


ok, there are better chips out there but again this is not something
you would really notice outside of high load environements.


 1. can probably run fsync=off during the import
 2. if import is single proecess, consider temporary bump to memory for
 index creation. or, since you have four cores consider having four
 processes import the data somehow.
 3. turn off stats collector, stats_command_string, stats_row_level,
 and autovacuum during import.


by the way, stats_command_string is a known performance killer that
iirc was improved in 8.2. just fyi.

I would suggest at least consideration of retooling your import as
follows...it might be a fun project to learn some postgresql
internals.  I'm assuming you are doing some script preprocessing in a
language like perl:

bulk load denomalized tables into scratch tables into the postgresql
database. create indexes appropriate to the nomalization process
remembering you can index on virtually any expression in postgresql
(including regex substitution).

use sql to process the data. if tables are too large to handle with
monolithic queries, use cursors and/or functions to handle the
conversion.  now you can keep track of progress using pl/pgsql raise
notice command for example.

merlin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Carlo Stonebanks
The deduplication process requires so many programmed procedures that it 
runs on the client. Most of the de-dupe lookups are not straight lookups, 
but calculated ones emplying fuzzy logic. This is because we cannot dictate 
the format of our input data and must deduplicate with what we get.

This was one of the reasons why I went with PostgreSQL in the first place, 
because of the server-side programming options. However, I saw incredible 
performance hits when running processes on the server and I partially 
abandoned the idea (some custom-buiilt name-comparison functions still run 
on the server).

I am using Tcl on both the server and the client. I'm not a fan of Tcl, but 
it appears to be quite well implemented and feature-rich in PostgreSQL. I 
find PL/pgsql awkward - even compared to Tcl. (After all, I'm just a 
programmer...  we do tend to be a little limited.)

The import program actually runs on the server box as a db client and 
involves about 3000 lines of code (and it will certainly grow steadily as we 
add compatability with more import formats). Could a process involving that 
much logic run on the db server, and would there really be a benefit?

Carlo


Jim C. Nasby [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
 On Thu, Sep 28, 2006 at 01:53:22PM -0400, Carlo Stonebanks wrote:
  are you using the 'copy' interface?

 Straightforward inserts - the import data has to transformed, normalised 
 and
 de-duped by the import program. I imagine the copy interface is for more
 straightforward data importing. These are - buy necessity - single row
 inserts.

 BTW, stuff like de-duping is something you really want the database -
 not an external program - to be doing. Think about loading the data into
 a temporary table and then working on it from there.
 -- 
 Jim Nasby[EMAIL PROTECTED]
 EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

 ---(end of broadcast)---
 TIP 6: explain analyze is your friend
 



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Merlin Moncure

On 9/28/06, Carlo Stonebanks [EMAIL PROTECTED] wrote:

The deduplication process requires so many programmed procedures that it
runs on the client. Most of the de-dupe lookups are not straight lookups,
but calculated ones emplying fuzzy logic. This is because we cannot dictate
the format of our input data and must deduplicate with what we get.

This was one of the reasons why I went with PostgreSQL in the first place,
because of the server-side programming options. However, I saw incredible
performance hits when running processes on the server and I partially
abandoned the idea (some custom-buiilt name-comparison functions still run
on the server).


imo, the key to high performance big data movements in postgresql is
mastering sql and pl/pgsql, especially the latter.  once you get good
at it, your net time of copy+plpgsql is going to be less than
insert+tcl.

merlin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Carlo Stonebanks
Lots of great info here, I will see what applies to my situation. However, I 
don't see bulk inserts of the tables working, because all of the tables need 
to be refreshed as values to deduplicate and match will change with every 
row added. In order for this to work, i would have to write queries against 
the hash tables. This is where something like MySQL's in-memory tables would 
have come in handy...

What is GDB?

Carlo

Matthew Nuzum [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
 Carlo Stonebanks wrote:
 are you using the 'copy' interface?

 Straightforward inserts - the import data has to transformed, normalised 
 and
 de-duped by the import program. I imagine the copy interface is for more
 straightforward data importing. These are - buy necessity - single row
 inserts.


 I know this is an answer to a question you didn't ask, but here it is. I
 was once doing stuff where I processed log files and had to do many
 lookups to normalize the data before insertion.

 I started out doing everything in SQL and using postgresql tables and it
 took a little over 24 hours to process 24 hours worth of data. Like you,
 it was single process, many lookups.

 I found a better way. I rewrote it (once in c#, again in python) and
 used hashtables/dictionaries instead of tables for the lookup data. For
 example, I'd start by loading the data into hash tables (yes, this took
 a *lot* of ram) then for each row I did something like:
 1. is it in the hash table?
   1. If not, insert it into the db
   1. Insert it into the hashtable
 2. Get the lookup field out of the hash table
 3. Output normalized data

 This allow me to create text files containing the data in COPY format
 which can then be inserted into the database at dramatically increased
 speeds.

 My first version in C# (mono) cut the time down to 6 hours for 24 hours
 worth of data. I tweaked the algorithms and rewrote it in Python and got
 it down to 45 min. (Python can't take all the credit for the performance
 boost, I used an improved technique that could have been done in C# as
 well) This time included the time needed to do the copy and update the
 indexes.

 I created a version that also used gdb databases instead of hash tables.
 It increased the time from 45 min to a little over an hour but decreased
 the memory usage to something like 45MB (vs dozens or hundreds of MB per
 hashtable)
 -- 
 Matthew Nuzum
 newz2000 on freenode

 ---(end of broadcast)---
 TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq
 



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Steve Atkins


On Sep 28, 2006, at 12:10 PM, Carlo Stonebanks wrote:




Are you wrapping all this in a transaction?


Yes, the transactions can typically wrap 1 to 10 single-table,  
single-row

inserts and updates.



You're doing some dynamically generated selects as part of the
de-duping process? They're probably the expensive bit. What
do those queries tend to look like?


Without a doubt, this is the expensive bit.


If you could give some samples of those queries here I suspect
people could be a lot more helpful with some optimisations, or
at least pinpoint where the performance issues are likely to be.




Are you analysing the table periodically? If not, then you might
have statistics based on an empty table, or default statistics, which
might cause the planner to choose bad plans for those selects.


Now there's something I didn't know - I thought that analysis and  
planning

was done with every select, and the performance benefit of prepared
statements was to plan-once, execute many. I can easily put in a  
periodic

analyse statement. I obviously missed how to use analyze properluy, I
thought it was just human-readable output - do I understand  
correctly, that
it can be used to get the SQL server to revaluate its plan based on  
newer

statistics - even on non-prepared queries?


I think you're confusing explain and analyze. Explain gives you
human readable output as to what the planner decided to do with the
query you give it.

Analyze samples the data in tables and stores the statistical  
distribution

of the data, and estimates of table size and that sort of thing for the
planner to use to decide on a good query plan. You need to run
analyze when the statistics or size of a table has changed  
significantly,

so as to give the planner the best chance of choosing an appropriate
plan.

If you're not running analyze occasionally then the planner will be
working on default stats or empty table stats and will tend to avoid
indexes. I don't know whether autovacuum will also analyze tables
for you automagically, but it would be a good idea to analyze the table
every so often, especially early on in the load - as the stats  
gathered for

a small table will likely give painful performance once the table has
grown a lot.




Talking of which, are there indexes on the table? Normally you
wouldn't have indexes in place during a bulk import, but if you're
doing selects as part of the data load process then you'd be forcing
sequential scans for every query, which would explain why it gets
slower as the table gets bigger.


There are indexes for every obvious where this = that clauses. I  
don't

believe that they will work for ilike expressions.


If you're doing a lot of where foo ilike 'bar%' queries, with the  
pattern

anchored to the left you might want to look at using a functional index
on lower(foo) and rewriting the query to look like where lower(foo)  
like

lower('bar%').

Similarly if you have many queries where the pattern is anchored
at the right of the string then a functional index on the reverse of the
string can be useful.

Cheers,
  Steve


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Jim C. Nasby
On Thu, Sep 28, 2006 at 02:04:21PM -0700, Steve Atkins wrote:
 I think you're confusing explain and analyze. Explain gives you
 human readable output as to what the planner decided to do with the
 query you give it.
 
Don't forget about EXPLAIN ANALYZE, which is related to EXPLAIN but has
nothing to do with the ANALYZE command.

 indexes. I don't know whether autovacuum will also analyze tables
 for you automagically, but it would be a good idea to analyze the table

It does.

 Talking of which, are there indexes on the table? Normally you
 wouldn't have indexes in place during a bulk import, but if you're
 doing selects as part of the data load process then you'd be forcing
 sequential scans for every query, which would explain why it gets
 slower as the table gets bigger.
 
 There are indexes for every obvious where this = that clauses. I  
 don't
 believe that they will work for ilike expressions.
 
 If you're doing a lot of where foo ilike 'bar%' queries, with the  
 pattern
 anchored to the left you might want to look at using a functional index
 on lower(foo) and rewriting the query to look like where lower(foo)  
 like
 lower('bar%').
 
 Similarly if you have many queries where the pattern is anchored
 at the right of the string then a functional index on the reverse of the
 string can be useful.

tsearch might prove helpful... I'm not sure how it handles substrings.

Something else to consider... databases love doing bulk operations. It
might be useful to load prospective data into a temporary table, and
then do as many operations as you can locally (ie: within the database)
on that table, hopefully eleminating as many candidate rows as possible
along the way.

I also suspect that running multiple merge processes at once would help.
Right now, your workload looks something like this:

client sends query  database is idle
client is idle  database runs query
client gets query back  database is idle

Oversimplification, but you get the point. There's a lot of time spent
waiting on each side. If the import code is running on the server, you
should probably run one import process per CPU. If it's on an external
server, 2 per CPU would probably be better (and that might be faster
than running local on the server at that point).
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Matthew Nuzum
Carlo Stonebanks wrote:
 Lots of great info here, I will see what applies to my situation. However, I 
 don't see bulk inserts of the tables working, because all of the tables need 
 to be refreshed as values to deduplicate and match will change with every 
 row added. In order for this to work, i would have to write queries against 
 the hash tables. This is where something like MySQL's in-memory tables would 
 have come in handy...
 
 What is GDB?
 
 Carlo

Sorry, meant GDBM (disk based hash/lookup table).

With Postgres if your tables fit into RAM then they are in-memory as
long as they're actively being used.

Hashtables and GDBM, as far as I know, are only useful for key-value
lookups. However, for this they are *fast*. If you can figure out a way
to make them work I'll bet things speed up.
-- 
Matthew Nuzum
newz2000 on freenode

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Carlo Stonebanks
 Something else to consider... databases love doing bulk operations. It
 might be useful to load prospective data into a temporary table, and
 then do as many operations as you can locally (ie: within the database)
 on that table, hopefully eleminating as many candidate rows as possible
 along the way.

I wish this would work... it was definitely something I considered early on 
in the project. Even thinking of explaining why it won't work is giving me a 
headache...

But I sure wish it would. 



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Carlo Stonebanks
 imo, the key to high performance big data movements in postgresql is
 mastering sql and pl/pgsql, especially the latter.  once you get good
 at it, your net time of copy+plpgsql is going to be less than
 insert+tcl.

If this implies bulk inserts, I'm afraid I have to consider something else. 
Any data that has been imported and dedpulicated has to be placed back into 
the database so that it can be available for the next imported row (there 
are currently 16 tables affected, and more to come). If I was to cache all 
inserts into a seperate resource, then I would have to search 32 tables - 
the local pending resources, as well as the data still in the system. I am 
not even mentioning that imports do not just insert rows, they could just 
rows, adding their own complexity. 



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Carlo Stonebanks
 by the way, stats_command_string is a known performance killer that
 iirc was improved in 8.2. just fyi.

This is a handy fact, I will get on this right away.

 bulk load denomalized tables into scratch tables into the postgresql
 database. create indexes appropriate to the nomalization process
 remembering you can index on virtually any expression in postgresql
 (including regex substitution).

 use sql to process the data. if tables are too large to handle with
 monolithic queries, use cursors and/or functions to handle the
 conversion.  now you can keep track of progress using pl/pgsql raise
 notice command for example.

For reasons I've exlained elsewhere, the import process is not well suited 
to breaking up the data into smaller segments. However, I'm interested in 
what can be indexed. I am used to the idea that indexing only applies to 
expressions that allows the data to be sorted, and then binary searches can 
be performed on the sorted list. For example, I can see how you can create 
an index to support:

where foo like 'bar%'

But is there any way to create an index expression that will help with:

where foo like '%bar%'?

I don't see it - but then again, I'm ready to be surprised!

Carlo 



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Carlo Stonebanks
 Don't forget about EXPLAIN ANALYZE, which is related to EXPLAIN but has
 nothing to do with the ANALYZE command.

Ah, hence my confusion. Thanks for the clarification... I never knew about 
ANALYZE as a seperate command. 



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Tom Lane
Carlo Stonebanks [EMAIL PROTECTED] writes:
 But is there any way to create an index expression that will help with:
 where foo like '%bar%'?

If you are concerned about that, what you are probably really looking
for is full-text indexing.  See contrib/tsearch2 for our current best
answer to that.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly