On Mon, 4 Feb 2008 15:09:58 -0500 (EST)
Greg Smith [EMAIL PROTECTED] wrote:
On Mon, 4 Feb 2008, Simon Riggs wrote:
Would anybody like to repeat these tests with the latest production
versions of these databases (i.e. with PGSQL 8.3)
Do you have any suggestions on how people should run
On Mon, 04 Feb 2008 17:33:34 -0500
Jignesh K. Shah [EMAIL PROTECTED] wrote:
Hi Simon,
I have some insight into TPC-H on how it works.
First of all I think it is a violation of TPC rules to publish numbers
without auditing them first. So even if I do the test to show the
better
Le jeudi 07 février 2008, Greg Smith a écrit :
Le mercredi 06 février 2008, Dimitri Fontaine a écrit :
In other cases, a logical line is a physical line, so we start after first
newline met from given lseek start position, and continue reading after the
last lseek position until a newline.
On Thu, Feb 07, 2008 at 12:06:42PM -0500, Greg Smith wrote:
On Thu, 7 Feb 2008, Dimitri Fontaine wrote:
I was thinking of not even reading the file content from the controller
thread, just decide splitting points in bytes (0..ST_SIZE/4 -
ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading
On Thu, 7 Feb 2008, Dimitri Fontaine wrote:
I was thinking of not even reading the file content from the controller
thread, just decide splitting points in bytes (0..ST_SIZE/4 -
ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading thread fine-tune by
beginning to process input after having read
I was thinking of not even reading the file content from the controller
thread, just decide splitting points in bytes (0..ST_SIZE/4 -
ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading thread fine-tune by
beginning to process input after having read first newline, etc.
The problem I was
On Thu, 7 Feb 2008, Greg Smith wrote:
The problem I was pointing out is that if chunk#2 moved foward a few bytes
before it started reading in search of a newline, how will chunk#1 know that
it's supposed to read up to that further point? You have to stop #1 from
reading further when it
Le mercredi 06 février 2008, Greg Smith a écrit :
pgloader is a great tool for a lot of things, particularly if there's any
chance that some of your rows will get rejected. But the way things pass
through the Python/psycopg layer made it uncompetative (more than 50%
slowdown) against the
On Wed, 2008-02-06 at 12:27 +0100, Dimitri Fontaine wrote:
Multi-Threading behavior and CE support
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Now, pgloader will be able to run N threads, each one loading some
data to a
partitionned child-table target. N will certainly be configured
depending on
Hi,
I've been thinking about this topic some more, and as I don't know when I'll
be able to go and implement it I'd want to publish the ideas here. This way
I'll be able to find them again :)
Le mardi 05 février 2008, Dimitri Fontaine a écrit :
Le mardi 05 février 2008, Simon Riggs a écrit :
Le mercredi 06 février 2008, Simon Riggs a écrit :
For me, it would be good to see a --parallel=n parameter that would
allow pg_loader to distribute rows in round-robin manner to n
different concurrent COPY statements. i.e. a non-routing version.
What happen when you want at most N parallel
On Wed, 6 Feb 2008, Dimitri Fontaine wrote:
Did you compare to COPY or \copy?
COPY. If you're loading a TB, if you're smart it's going onto the server
itself if it all possible and loading directly from there. Would probably
get a closer comparision against psql \copy, but recognize
On Wed, 6 Feb 2008, Simon Riggs wrote:
For me, it would be good to see a --parallel=n parameter that would
allow pg_loader to distribute rows in round-robin manner to n
different concurrent COPY statements. i.e. a non-routing version.
Let me expand on this. In many of these giant COPY
Hi Greg,
On 2/6/08 7:56 AM, Greg Smith [EMAIL PROTECTED] wrote:
If I'm loading a TB file, odds are good I can split that into 4 or more
vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4 loaders
at once, and get way more than 1 disk worth of throughput reading. You
have to
Greg Smith wrote:
On Wed, 6 Feb 2008, Simon Riggs wrote:
For me, it would be good to see a --parallel=n parameter that would
allow pg_loader to distribute rows in round-robin manner to n
different concurrent COPY statements. i.e. a non-routing version.
Let me expand on this. In many of
Le mercredi 06 février 2008, Greg Smith a écrit :
COPY. If you're loading a TB, if you're smart it's going onto the server
itself if it all possible and loading directly from there. Would probably
get a closer comparision against psql \copy, but recognize you're always
going to be compared
Le mercredi 06 février 2008, Greg Smith a écrit :
If I'm loading a TB file, odds are good I can split that into 4 or more
vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4 loaders
at once, and get way more than 1 disk worth of throughput reading.
pgloader already supports
[mailto:[EMAIL PROTECTED]
Sent: Wednesday, February 06, 2008 12:41 PM Eastern Standard Time
To: pgsql-performance@postgresql.org
Cc: Greg Smith
Subject:Re: [PERFORM] Benchmark Data requested --- pgloader CE design
ideas
Le mercredi 06 février 2008, Greg Smith a écrit :
If I'm
Le Wednesday 06 February 2008 18:37:41 Dimitri Fontaine, vous avez écrit :
Le mercredi 06 février 2008, Greg Smith a écrit :
If I'm loading a TB file, odds are good I can split that into 4 or more
vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4
loaders at once, and get
Le Wednesday 06 February 2008 18:49:56 Luke Lonergan, vous avez écrit :
Improvements are welcome, but to compete in the industry, loading will need
to speed up by a factor of 100.
Oh, I meant to compete with internal COPY command instead of \copy one, not
with the competition. AIUI competing
On Wed, 6 Feb 2008, Dimitri Fontaine wrote:
In fact, the -F option works by having pgloader read the given number of lines
but skip processing them, which is not at all what Greg is talking about here
I think.
Yeah, that's not useful.
Greg, what would you think of a pgloader which will
On Mon, 2008-02-04 at 17:33 -0500, Jignesh K. Shah wrote:
First of all I think it is a violation of TPC rules to publish numbers
without auditing them first. So even if I do the test to show the
better performance of PostgreSQL 8.3, I cannot post it here or any
public forum without doing
On Mon, 2008-02-04 at 17:55 -0500, Jignesh K. Shah wrote:
Doing it at low scales is not attractive.
Commercial databases are publishing at scale factor of 1000(about 1TB)
to 1(10TB) with one in 30TB space. So ideally right now tuning
should start at 1000 scale factor.
I don't
Hi,
Le lundi 04 février 2008, Jignesh K. Shah a écrit :
Single stream loader of PostgreSQL takes hours to load data. (Single
stream load... wasting all the extra cores out there)
I wanted to work on this at the pgloader level, so CVS version of pgloader is
now able to load data in parallel,
On Tue, 2008-02-05 at 14:43 +, Richard Huxton wrote:
Simon Riggs wrote:
On Tue, 2008-02-05 at 15:06 +0100, Dimitri Fontaine wrote:
Le lundi 04 février 2008, Jignesh K. Shah a écrit :
Multiple table loads ( 1 per table) spawned via script is bit better
but hits wal problems.
On Tue, 5 Feb 2008, Richard Huxton wrote:
In the case of a bulk upload to an empty table (or partition?) could you not
optimise the WAL away?
Argh. If I hadn't had to retype my email, I would have suggested that
before you.
;)
Matthew
--
Unfortunately, university regulations probably
On Tue, 2008-02-05 at 15:06 +0100, Dimitri Fontaine wrote:
Hi,
Le lundi 04 février 2008, Jignesh K. Shah a écrit :
Single stream loader of PostgreSQL takes hours to load data. (Single
stream load... wasting all the extra cores out there)
I wanted to work on this at the pgloader level,
Simon Riggs wrote:
On Tue, 2008-02-05 at 14:43 +, Richard Huxton wrote:
Simon Riggs wrote:
On Tue, 2008-02-05 at 15:06 +0100, Dimitri Fontaine wrote:
Le lundi 04 février 2008, Jignesh K. Shah a écrit :
Multiple table loads ( 1 per table) spawned via script is bit better
but hits wal
On Tue, 2008-02-05 at 15:05 +, Richard Huxton wrote:
Only by locking the table, which serializes access, which then slows you
down or at least restricts other options. Plus if you use pg_loader then
you'll find only the first few rows optimized and all the rest not.
Hmm - the
On Tue, 5 Feb 2008, Simon Riggs wrote:
In the case of a bulk upload to an empty table (or partition?) could you
not optimise the WAL away? That is, shouldn't the WAL basically be a
simple transformation of the on-disk blocks? You'd have to explicitly
sync the file(s) for the table/indexes of
Simon Riggs wrote:
On Tue, 2008-02-05 at 15:05 +, Richard Huxton wrote:
Only by locking the table, which serializes access, which then slows you
down or at least restricts other options. Plus if you use pg_loader then
you'll find only the first few rows optimized and all the rest not.
Hmm
---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster
Simon Riggs wrote:
On Tue, 2008-02-05 at 15:06 +0100, Dimitri Fontaine wrote:
Le lundi 04 février 2008, Jignesh K. Shah a écrit :
Multiple table loads ( 1 per table) spawned via script is bit better
but hits wal problems.
pgloader will too hit the WAL problem, but it still may have its
Apologies for the blank email - mailer problems. I lost all my nicely
typed stuff, too.
On Tue, 5 Feb 2008, Dimitri Fontaine wrote:
Multiple table loads ( 1 per table) spawned via script is bit better
but hits wal problems.
pgloader will too hit the WAL problem, but it still may have its
Matthew wrote:
On Tue, 5 Feb 2008, Richard Huxton wrote:
Why would you need to lock the table?
Because you're not really writing the WAL, which means you can't let
anyone else get their data into any of the blocks you are writing
into. You'd basically want to write the disk blocks then
One of the problems with Empty Table optimization is that if there are
indexes created then it is considered as no longer empty.
Commercial databases have options like IRRECOVERABLE clause along
with DISK PARTITIONS and CPU partitions for their bulk loaders.
So one option turns off logging,
On Tue, 5 Feb 2008, Richard Huxton wrote:
So what's wrong with reserving the space using the WAL, then everyone
else will know. After all, when you write the data to the WAL, you must
have an idea of where it is meant to end up. My suggestion is that you go
through all the motions of writing
Matthew wrote:
On Tue, 5 Feb 2008, Simon Riggs wrote:
In the case of a bulk upload to an empty table (or partition?) could you
not optimise the WAL away? That is, shouldn't the WAL basically be a
simple transformation of the on-disk blocks? You'd have to explicitly
sync the file(s) for the
On Tue, 5 Feb 2008, Richard Huxton wrote:
Why would you need to lock the table?
Because you're not really writing the WAL, which means you can't let anyone
else get their data into any of the blocks you are writing into. You'd
basically want to write the disk blocks then attach them in some
Le mardi 05 février 2008, Simon Riggs a écrit :
It runs a stream of COPY statements, so only first would be optimized
with the empty table optimization.
The number of rows per COPY statement is configurable, so provided you have an
estimation of the volume to import (wc -l), you could tweak
Le mardi 05 février 2008, Simon Riggs a écrit :
I'll look at COPY FROM internals to make this faster. I'm looking at
this now to refresh my memory; I already had some plans on the shelf.
Maybe stealing some ideas from pg_bulkload could somewhat help here?
On Tue, 2008-02-05 at 18:15 +0100, Dimitri Fontaine wrote:
Le mardi 05 février 2008, Simon Riggs a écrit :
I'll look at COPY FROM internals to make this faster. I'm looking at
this now to refresh my memory; I already had some plans on the shelf.
Maybe stealing some ideas from pg_bulkload
On Tue, 2008-02-05 at 13:47 -0500, Jignesh K. Shah wrote:
That sounds cool to me too..
How much work is to make pg_bulkload to work on 8.3? An Integrated
version is certainly more beneficial.
Specially I think it will also help for other setups like TPC-E too
where this is a problem.
Dimitri Fontaine wrote:
Le mardi 05 février 2008, Simon Riggs a écrit :
I'll look at COPY FROM internals to make this faster. I'm looking at
this now to refresh my memory; I already had some plans on the shelf.
Maybe stealing some ideas from pg_bulkload could somewhat help here?
Hi Heikki,
Is there a way such an operation can be spawned as a worker process?
Generally during such loading - which most people will do during
offpeak hours I expect additional CPU resources available. By
delegating such additional work to worker processes, we should be able
to capitalize
Commercial Db bulk loaders work the same way.. they give you an option
as a fast loader provided in case of error, the whole table is
truncated. This I think also has real life advantages where PostgreSQL
is used as datamarts which are recreated every now and then from other
systems and they
Jignesh K. Shah wrote:
Is there a way such an operation can be spawned as a worker process?
Generally during such loading - which most people will do during
offpeak hours I expect additional CPU resources available. By
delegating such additional work to worker processes, we should be able
to
On Tue, 2008-02-05 at 15:50 -0500, Jignesh K. Shah wrote:
Is there a way such an operation can be spawned as a worker process?
Generally during such loading - which most people will do during
offpeak hours I expect additional CPU resources available. By
delegating such additional work to
On Tue, 5 Feb 2008, Simon Riggs wrote:
On Tue, 2008-02-05 at 15:50 -0500, Jignesh K. Shah wrote:
Even if it is a single core, the mere fact that the loading process will
eventually wait for a read from the input file which cannot be
non-blocking, the OS can timeslice it well for the second
Hi,
On Feb 6, 2008 9:05 AM, Greg Smith [EMAIL PROTECTED] wrote:
On Tue, 5 Feb 2008, Simon Riggs wrote:
On Tue, 2008-02-05 at 15:50 -0500, Jignesh K. Shah wrote:
Even if it is a single core, the mere fact that the loading process
will
eventually wait for a read from the input file
There are some results here that show PostgreSQL is slower in some cases
than Monet and MySQL. Of course these results were published immediately
prior to 8.2 being released, plus run out-of-the-box, so without even
basic performance tuning.
Would anybody like to repeat these tests with the
Hi Simon,
Note that MonetDB/X100 does not have a SQL optimizer, they ran raw
hand-coded plans. As a consequence, these comparisons should be taken as an
executor-executor test and we/you should be sure that the PG planner has
generated the best possible plan.
That said, we've already done the
Can I ask for some help with benchmarking?
There are some results here that show PostgreSQL is slower in some cases
than Monet and MySQL. Of course these results were published immediately
prior to 8.2 being released, plus run out-of-the-box, so without even
basic performance tuning.
Would
On Mon, 2008-02-04 at 10:47 -0800, Luke Lonergan wrote:
Note that MonetDB/X100 does not have a SQL optimizer, they ran raw
hand-coded plans. As a consequence, these comparisons should be taken as an
executor-executor test and we/you should be sure that the PG planner has
generated the best
On Mon, 4 Feb 2008, Simon Riggs wrote:
Would anybody like to repeat these tests with the latest production
versions of these databases (i.e. with PGSQL 8.3)
Do you have any suggestions on how people should run TPC-H? It looked
like a bit of work to sort through how to even start this
Doing it at low scales is not attractive.
Commercial databases are publishing at scale factor of 1000(about 1TB)
to 1(10TB) with one in 30TB space. So ideally right now tuning
should start at 1000 scale factor.
Unfortunately I have tried that before with PostgreSQL the few of the
On Mon, 2008-02-04 at 15:09 -0500, Greg Smith wrote:
On Mon, 4 Feb 2008, Simon Riggs wrote:
Would anybody like to repeat these tests with the latest production
versions of these databases (i.e. with PGSQL 8.3)
Do you have any suggestions on how people should run TPC-H? It looked
like
Hi Simon,
I have some insight into TPC-H on how it works.
First of all I think it is a violation of TPC rules to publish numbers
without auditing them first. So even if I do the test to show the
better performance of PostgreSQL 8.3, I cannot post it here or any
public forum without doing
On Mon, 4 Feb 2008, Jignesh K. Shah wrote:
Doing it at low scales is not attractive. Commercial databases are
publishing at scale factor of 1000(about 1TB) to 1(10TB) with one in
30TB space. So ideally right now tuning should start at 1000 scale
factor.
I think what Simon was trying to
Jignesh K. Shah [EMAIL PROTECTED] writes:
Then for the power run that is essentially running one query at a time should
essentially be able to utilize the full system (specially multi-core systems),
unfortunately PostgreSQL can use only one core. (Plus since this is read only
and there is no
Hi Simon,
On 2/4/08 11:07 AM, Simon Riggs [EMAIL PROTECTED] wrote:
executor-executor test and we/you should be sure that the PG planner has
generated the best possible plan.
If it doesn't then I'd regard that as a performance issue in itself.
Agreed, though that's two problems to
Hi Greg,
On 2/4/08 12:09 PM, Greg Smith [EMAIL PROTECTED] wrote:
Do you have any suggestions on how people should run TPC-H? It looked
like a bit of work to sort through how to even start this exercise.
To run TPC-H requires a license to publish, etc.
However, I think you can use their
On Mon, 4 Feb 2008, Luke Lonergan wrote:
However, I think you can use their published data and query generation kit
to run the queries, which aren't the benchmark per-se. That's what the
Monet/X100 people did.
Right; I was just hoping someone might suggest some relatively
standardized way
Gregory Stark wrote:
Incidentally we found some cases that Solaris was particularly bad at. Is
there anybody in particular that would be interested in hearing about them?
(Not meant to be a knock on Solaris, I'm sure there are other cases Linux or
BSD handle poorly too)
Send me the
TPC-H has two runs
PowerRun which is single stream (Q1-22 RF1, RF2)
And Throughput Runs which has N (depends on scale) running
simultaneously in a mixed sequence of the same queries and the two
update functions. During throughput run you can expect to max out CPU...
But commerial databases
65 matches
Mail list logo