Re: [PERFORM] Benchmark Data requested

2008-02-08 Thread Mark Wong
On Mon, 4 Feb 2008 15:09:58 -0500 (EST) Greg Smith [EMAIL PROTECTED] wrote: On Mon, 4 Feb 2008, Simon Riggs wrote: Would anybody like to repeat these tests with the latest production versions of these databases (i.e. with PGSQL 8.3) Do you have any suggestions on how people should run

Re: [PERFORM] Benchmark Data requested

2008-02-08 Thread Mark Wong
On Mon, 04 Feb 2008 17:33:34 -0500 Jignesh K. Shah [EMAIL PROTECTED] wrote: Hi Simon, I have some insight into TPC-H on how it works. First of all I think it is a violation of TPC rules to publish numbers without auditing them first. So even if I do the test to show the better

Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-07 Thread Dimitri Fontaine
Le jeudi 07 février 2008, Greg Smith a écrit : Le mercredi 06 février 2008, Dimitri Fontaine a écrit : In other cases, a logical line is a physical line, so we start after first newline met from given lseek start position, and continue reading after the last lseek position until a newline.

Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-07 Thread Kenneth Marshall
On Thu, Feb 07, 2008 at 12:06:42PM -0500, Greg Smith wrote: On Thu, 7 Feb 2008, Dimitri Fontaine wrote: I was thinking of not even reading the file content from the controller thread, just decide splitting points in bytes (0..ST_SIZE/4 - ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading

Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-07 Thread Greg Smith
On Thu, 7 Feb 2008, Dimitri Fontaine wrote: I was thinking of not even reading the file content from the controller thread, just decide splitting points in bytes (0..ST_SIZE/4 - ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading thread fine-tune by beginning to process input after having read

Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-07 Thread Mark Lewis
I was thinking of not even reading the file content from the controller thread, just decide splitting points in bytes (0..ST_SIZE/4 - ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading thread fine-tune by beginning to process input after having read first newline, etc. The problem I was

Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-07 Thread Matthew
On Thu, 7 Feb 2008, Greg Smith wrote: The problem I was pointing out is that if chunk#2 moved foward a few bytes before it started reading in search of a newline, how will chunk#1 know that it's supposed to read up to that further point? You have to stop #1 from reading further when it

Re: [PERFORM] Benchmark Data requested

2008-02-06 Thread Dimitri Fontaine
Le mercredi 06 février 2008, Greg Smith a écrit : pgloader is a great tool for a lot of things, particularly if there's any chance that some of your rows will get rejected. But the way things pass through the Python/psycopg layer made it uncompetative (more than 50% slowdown) against the

Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Simon Riggs
On Wed, 2008-02-06 at 12:27 +0100, Dimitri Fontaine wrote: Multi-Threading behavior and CE support -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Now, pgloader will be able to run N threads, each one loading some data to a partitionned child-table target. N will certainly be configured depending on

Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Dimitri Fontaine
Hi, I've been thinking about this topic some more, and as I don't know when I'll be able to go and implement it I'd want to publish the ideas here. This way I'll be able to find them again :) Le mardi 05 février 2008, Dimitri Fontaine a écrit : Le mardi 05 février 2008, Simon Riggs a écrit :

Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Dimitri Fontaine
Le mercredi 06 février 2008, Simon Riggs a écrit : For me, it would be good to see a --parallel=n parameter that would allow pg_loader to distribute rows in round-robin manner to n different concurrent COPY statements. i.e. a non-routing version. What happen when you want at most N parallel

Re: [PERFORM] Benchmark Data requested

2008-02-06 Thread Greg Smith
On Wed, 6 Feb 2008, Dimitri Fontaine wrote: Did you compare to COPY or \copy? COPY. If you're loading a TB, if you're smart it's going onto the server itself if it all possible and loading directly from there. Would probably get a closer comparision against psql \copy, but recognize

Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Greg Smith
On Wed, 6 Feb 2008, Simon Riggs wrote: For me, it would be good to see a --parallel=n parameter that would allow pg_loader to distribute rows in round-robin manner to n different concurrent COPY statements. i.e. a non-routing version. Let me expand on this. In many of these giant COPY

Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Luke Lonergan
Hi Greg, On 2/6/08 7:56 AM, Greg Smith [EMAIL PROTECTED] wrote: If I'm loading a TB file, odds are good I can split that into 4 or more vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4 loaders at once, and get way more than 1 disk worth of throughput reading. You have to

Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Jignesh K. Shah
Greg Smith wrote: On Wed, 6 Feb 2008, Simon Riggs wrote: For me, it would be good to see a --parallel=n parameter that would allow pg_loader to distribute rows in round-robin manner to n different concurrent COPY statements. i.e. a non-routing version. Let me expand on this. In many of

Re: [PERFORM] Benchmark Data requested

2008-02-06 Thread Dimitri Fontaine
Le mercredi 06 février 2008, Greg Smith a écrit : COPY. If you're loading a TB, if you're smart it's going onto the server itself if it all possible and loading directly from there. Would probably get a closer comparision against psql \copy, but recognize you're always going to be compared

Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Dimitri Fontaine
Le mercredi 06 février 2008, Greg Smith a écrit : If I'm loading a TB file, odds are good I can split that into 4 or more vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4 loaders at once, and get way more than 1 disk worth of throughput reading. pgloader already supports

Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Luke Lonergan
[mailto:[EMAIL PROTECTED] Sent: Wednesday, February 06, 2008 12:41 PM Eastern Standard Time To: pgsql-performance@postgresql.org Cc: Greg Smith Subject:Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas Le mercredi 06 février 2008, Greg Smith a écrit : If I'm

Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Dimitri Fontaine
Le Wednesday 06 February 2008 18:37:41 Dimitri Fontaine, vous avez écrit : Le mercredi 06 février 2008, Greg Smith a écrit : If I'm loading a TB file, odds are good I can split that into 4 or more vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4 loaders at once, and get

Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Dimitri Fontaine
Le Wednesday 06 February 2008 18:49:56 Luke Lonergan, vous avez écrit : Improvements are welcome, but to compete in the industry, loading will need to speed up by a factor of 100. Oh, I meant to compete with internal COPY command instead of \copy one, not with the competition. AIUI competing

Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Greg Smith
On Wed, 6 Feb 2008, Dimitri Fontaine wrote: In fact, the -F option works by having pgloader read the given number of lines but skip processing them, which is not at all what Greg is talking about here I think. Yeah, that's not useful. Greg, what would you think of a pgloader which will

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Simon Riggs
On Mon, 2008-02-04 at 17:33 -0500, Jignesh K. Shah wrote: First of all I think it is a violation of TPC rules to publish numbers without auditing them first. So even if I do the test to show the better performance of PostgreSQL 8.3, I cannot post it here or any public forum without doing

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Simon Riggs
On Mon, 2008-02-04 at 17:55 -0500, Jignesh K. Shah wrote: Doing it at low scales is not attractive. Commercial databases are publishing at scale factor of 1000(about 1TB) to 1(10TB) with one in 30TB space. So ideally right now tuning should start at 1000 scale factor. I don't

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Dimitri Fontaine
Hi, Le lundi 04 février 2008, Jignesh K. Shah a écrit : Single stream loader of PostgreSQL takes hours to load data. (Single stream load... wasting all the extra cores out there) I wanted to work on this at the pgloader level, so CVS version of pgloader is now able to load data in parallel,

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Simon Riggs
On Tue, 2008-02-05 at 14:43 +, Richard Huxton wrote: Simon Riggs wrote: On Tue, 2008-02-05 at 15:06 +0100, Dimitri Fontaine wrote: Le lundi 04 février 2008, Jignesh K. Shah a écrit : Multiple table loads ( 1 per table) spawned via script is bit better but hits wal problems.

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Matthew
On Tue, 5 Feb 2008, Richard Huxton wrote: In the case of a bulk upload to an empty table (or partition?) could you not optimise the WAL away? Argh. If I hadn't had to retype my email, I would have suggested that before you. ;) Matthew -- Unfortunately, university regulations probably

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Simon Riggs
On Tue, 2008-02-05 at 15:06 +0100, Dimitri Fontaine wrote: Hi, Le lundi 04 février 2008, Jignesh K. Shah a écrit : Single stream loader of PostgreSQL takes hours to load data. (Single stream load... wasting all the extra cores out there) I wanted to work on this at the pgloader level,

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Richard Huxton
Simon Riggs wrote: On Tue, 2008-02-05 at 14:43 +, Richard Huxton wrote: Simon Riggs wrote: On Tue, 2008-02-05 at 15:06 +0100, Dimitri Fontaine wrote: Le lundi 04 février 2008, Jignesh K. Shah a écrit : Multiple table loads ( 1 per table) spawned via script is bit better but hits wal

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Simon Riggs
On Tue, 2008-02-05 at 15:05 +, Richard Huxton wrote: Only by locking the table, which serializes access, which then slows you down or at least restricts other options. Plus if you use pg_loader then you'll find only the first few rows optimized and all the rest not. Hmm - the

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Matthew
On Tue, 5 Feb 2008, Simon Riggs wrote: In the case of a bulk upload to an empty table (or partition?) could you not optimise the WAL away? That is, shouldn't the WAL basically be a simple transformation of the on-disk blocks? You'd have to explicitly sync the file(s) for the table/indexes of

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Richard Huxton
Simon Riggs wrote: On Tue, 2008-02-05 at 15:05 +, Richard Huxton wrote: Only by locking the table, which serializes access, which then slows you down or at least restricts other options. Plus if you use pg_loader then you'll find only the first few rows optimized and all the rest not. Hmm

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Matthew
---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Richard Huxton
Simon Riggs wrote: On Tue, 2008-02-05 at 15:06 +0100, Dimitri Fontaine wrote: Le lundi 04 février 2008, Jignesh K. Shah a écrit : Multiple table loads ( 1 per table) spawned via script is bit better but hits wal problems. pgloader will too hit the WAL problem, but it still may have its

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Matthew
Apologies for the blank email - mailer problems. I lost all my nicely typed stuff, too. On Tue, 5 Feb 2008, Dimitri Fontaine wrote: Multiple table loads ( 1 per table) spawned via script is bit better but hits wal problems. pgloader will too hit the WAL problem, but it still may have its

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Richard Huxton
Matthew wrote: On Tue, 5 Feb 2008, Richard Huxton wrote: Why would you need to lock the table? Because you're not really writing the WAL, which means you can't let anyone else get their data into any of the blocks you are writing into. You'd basically want to write the disk blocks then

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Jignesh K. Shah
One of the problems with Empty Table optimization is that if there are indexes created then it is considered as no longer empty. Commercial databases have options like IRRECOVERABLE clause along with DISK PARTITIONS and CPU partitions for their bulk loaders. So one option turns off logging,

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Matthew
On Tue, 5 Feb 2008, Richard Huxton wrote: So what's wrong with reserving the space using the WAL, then everyone else will know. After all, when you write the data to the WAL, you must have an idea of where it is meant to end up. My suggestion is that you go through all the motions of writing

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Richard Huxton
Matthew wrote: On Tue, 5 Feb 2008, Simon Riggs wrote: In the case of a bulk upload to an empty table (or partition?) could you not optimise the WAL away? That is, shouldn't the WAL basically be a simple transformation of the on-disk blocks? You'd have to explicitly sync the file(s) for the

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Matthew
On Tue, 5 Feb 2008, Richard Huxton wrote: Why would you need to lock the table? Because you're not really writing the WAL, which means you can't let anyone else get their data into any of the blocks you are writing into. You'd basically want to write the disk blocks then attach them in some

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Dimitri Fontaine
Le mardi 05 février 2008, Simon Riggs a écrit : It runs a stream of COPY statements, so only first would be optimized with the empty table optimization. The number of rows per COPY statement is configurable, so provided you have an estimation of the volume to import (wc -l), you could tweak

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Dimitri Fontaine
Le mardi 05 février 2008, Simon Riggs a écrit : I'll look at COPY FROM internals to make this faster. I'm looking at this now to refresh my memory; I already had some plans on the shelf. Maybe stealing some ideas from pg_bulkload could somewhat help here?

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Simon Riggs
On Tue, 2008-02-05 at 18:15 +0100, Dimitri Fontaine wrote: Le mardi 05 février 2008, Simon Riggs a écrit : I'll look at COPY FROM internals to make this faster. I'm looking at this now to refresh my memory; I already had some plans on the shelf. Maybe stealing some ideas from pg_bulkload

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Simon Riggs
On Tue, 2008-02-05 at 13:47 -0500, Jignesh K. Shah wrote: That sounds cool to me too.. How much work is to make pg_bulkload to work on 8.3? An Integrated version is certainly more beneficial. Specially I think it will also help for other setups like TPC-E too where this is a problem.

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Heikki Linnakangas
Dimitri Fontaine wrote: Le mardi 05 février 2008, Simon Riggs a écrit : I'll look at COPY FROM internals to make this faster. I'm looking at this now to refresh my memory; I already had some plans on the shelf. Maybe stealing some ideas from pg_bulkload could somewhat help here?

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Jignesh K. Shah
Hi Heikki, Is there a way such an operation can be spawned as a worker process? Generally during such loading - which most people will do during offpeak hours I expect additional CPU resources available. By delegating such additional work to worker processes, we should be able to capitalize

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Jignesh K. Shah
Commercial Db bulk loaders work the same way.. they give you an option as a fast loader provided in case of error, the whole table is truncated. This I think also has real life advantages where PostgreSQL is used as datamarts which are recreated every now and then from other systems and they

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Heikki Linnakangas
Jignesh K. Shah wrote: Is there a way such an operation can be spawned as a worker process? Generally during such loading - which most people will do during offpeak hours I expect additional CPU resources available. By delegating such additional work to worker processes, we should be able to

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Simon Riggs
On Tue, 2008-02-05 at 15:50 -0500, Jignesh K. Shah wrote: Is there a way such an operation can be spawned as a worker process? Generally during such loading - which most people will do during offpeak hours I expect additional CPU resources available. By delegating such additional work to

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread Greg Smith
On Tue, 5 Feb 2008, Simon Riggs wrote: On Tue, 2008-02-05 at 15:50 -0500, Jignesh K. Shah wrote: Even if it is a single core, the mere fact that the loading process will eventually wait for a read from the input file which cannot be non-blocking, the OS can timeslice it well for the second

Re: [PERFORM] Benchmark Data requested

2008-02-05 Thread NikhilS
Hi, On Feb 6, 2008 9:05 AM, Greg Smith [EMAIL PROTECTED] wrote: On Tue, 5 Feb 2008, Simon Riggs wrote: On Tue, 2008-02-05 at 15:50 -0500, Jignesh K. Shah wrote: Even if it is a single core, the mere fact that the loading process will eventually wait for a read from the input file

Re: [PERFORM] Benchmark Data requested

2008-02-04 Thread Claus Guttesen
There are some results here that show PostgreSQL is slower in some cases than Monet and MySQL. Of course these results were published immediately prior to 8.2 being released, plus run out-of-the-box, so without even basic performance tuning. Would anybody like to repeat these tests with the

Re: [PERFORM] Benchmark Data requested

2008-02-04 Thread Luke Lonergan
Hi Simon, Note that MonetDB/X100 does not have a SQL optimizer, they ran raw hand-coded plans. As a consequence, these comparisons should be taken as an executor-executor test and we/you should be sure that the PG planner has generated the best possible plan. That said, we've already done the

[PERFORM] Benchmark Data requested

2008-02-04 Thread Simon Riggs
Can I ask for some help with benchmarking? There are some results here that show PostgreSQL is slower in some cases than Monet and MySQL. Of course these results were published immediately prior to 8.2 being released, plus run out-of-the-box, so without even basic performance tuning. Would

Re: [PERFORM] Benchmark Data requested

2008-02-04 Thread Simon Riggs
On Mon, 2008-02-04 at 10:47 -0800, Luke Lonergan wrote: Note that MonetDB/X100 does not have a SQL optimizer, they ran raw hand-coded plans. As a consequence, these comparisons should be taken as an executor-executor test and we/you should be sure that the PG planner has generated the best

Re: [PERFORM] Benchmark Data requested

2008-02-04 Thread Greg Smith
On Mon, 4 Feb 2008, Simon Riggs wrote: Would anybody like to repeat these tests with the latest production versions of these databases (i.e. with PGSQL 8.3) Do you have any suggestions on how people should run TPC-H? It looked like a bit of work to sort through how to even start this

Re: [PERFORM] Benchmark Data requested

2008-02-04 Thread Jignesh K. Shah
Doing it at low scales is not attractive. Commercial databases are publishing at scale factor of 1000(about 1TB) to 1(10TB) with one in 30TB space. So ideally right now tuning should start at 1000 scale factor. Unfortunately I have tried that before with PostgreSQL the few of the

Re: [PERFORM] Benchmark Data requested

2008-02-04 Thread Simon Riggs
On Mon, 2008-02-04 at 15:09 -0500, Greg Smith wrote: On Mon, 4 Feb 2008, Simon Riggs wrote: Would anybody like to repeat these tests with the latest production versions of these databases (i.e. with PGSQL 8.3) Do you have any suggestions on how people should run TPC-H? It looked like

Re: [PERFORM] Benchmark Data requested

2008-02-04 Thread Jignesh K. Shah
Hi Simon, I have some insight into TPC-H on how it works. First of all I think it is a violation of TPC rules to publish numbers without auditing them first. So even if I do the test to show the better performance of PostgreSQL 8.3, I cannot post it here or any public forum without doing

Re: [PERFORM] Benchmark Data requested

2008-02-04 Thread Greg Smith
On Mon, 4 Feb 2008, Jignesh K. Shah wrote: Doing it at low scales is not attractive. Commercial databases are publishing at scale factor of 1000(about 1TB) to 1(10TB) with one in 30TB space. So ideally right now tuning should start at 1000 scale factor. I think what Simon was trying to

Re: [PERFORM] Benchmark Data requested

2008-02-04 Thread Gregory Stark
Jignesh K. Shah [EMAIL PROTECTED] writes: Then for the power run that is essentially running one query at a time should essentially be able to utilize the full system (specially multi-core systems), unfortunately PostgreSQL can use only one core. (Plus since this is read only and there is no

Re: [PERFORM] Benchmark Data requested

2008-02-04 Thread Luke Lonergan
Hi Simon, On 2/4/08 11:07 AM, Simon Riggs [EMAIL PROTECTED] wrote: executor-executor test and we/you should be sure that the PG planner has generated the best possible plan. If it doesn't then I'd regard that as a performance issue in itself. Agreed, though that's two problems to

Re: [PERFORM] Benchmark Data requested

2008-02-04 Thread Luke Lonergan
Hi Greg, On 2/4/08 12:09 PM, Greg Smith [EMAIL PROTECTED] wrote: Do you have any suggestions on how people should run TPC-H? It looked like a bit of work to sort through how to even start this exercise. To run TPC-H requires a license to publish, etc. However, I think you can use their

Re: [PERFORM] Benchmark Data requested

2008-02-04 Thread Greg Smith
On Mon, 4 Feb 2008, Luke Lonergan wrote: However, I think you can use their published data and query generation kit to run the queries, which aren't the benchmark per-se. That's what the Monet/X100 people did. Right; I was just hoping someone might suggest some relatively standardized way

Re: [PERFORM] Benchmark Data requested

2008-02-04 Thread Jignesh K. Shah
Gregory Stark wrote: Incidentally we found some cases that Solaris was particularly bad at. Is there anybody in particular that would be interested in hearing about them? (Not meant to be a knock on Solaris, I'm sure there are other cases Linux or BSD handle poorly too) Send me the

Re: [PERFORM] Benchmark Data requested

2008-02-04 Thread Jignesh K. Shah
TPC-H has two runs PowerRun which is single stream (Q1-22 RF1, RF2) And Throughput Runs which has N (depends on scale) running simultaneously in a mixed sequence of the same queries and the two update functions. During throughput run you can expect to max out CPU... But commerial databases