Re: [PERFORM] tuning postgresql 9.3.5 and multiple cores

2014-08-26 Thread Soni M
Changing to a higher rate CPU would be more helpful if you run less than 32
queries at a time.


On Tue, Aug 26, 2014 at 8:51 AM, Jeff Janes jeff.ja...@gmail.com wrote:

 On Monday, August 25, 2014, Jeison Bedoya Delgado 
 jeis...@audifarma.com.co wrote:

 hi, recently i change the hardware of my database 32 cores up to 64 cores
 and 128GB Ram, but the performance is the same.  Perhaps i have to change
 any parameter in the postgresql.conf?.



 PostgreSQL does not (yet) automatically parallelize queries.

 Unless you have more than 32 queries trying to run at the same time,
 increasing the number of cores from 32 to 64 is unlikely to be useful.

 Cheers,

 Jeff




-- 
Regards,

Soni Maula Harriz


Re: [PERFORM] tuning postgresql 9.3.5 and multiple cores

2014-08-26 Thread Mark Kirkwood

On 26/08/14 06:47, Jeison Bedoya Delgado wrote:

hi, recently i change the hardware of my database 32 cores up to 64
cores and 128GB Ram, but the performance is the same.  Perhaps i have to
change any parameter in the postgresql.conf?.



In addition to the points that others have made, even if you do have  
32 active sessions it it not clear that 64 cores will automagically get 
you twice (or in fact any) better performance than 32. We are seeing 
exactly this effect with a (60 core) machine that gets pretty much the 
same performance as an older generation 32 core one.


Interestingly while this is *likely* a software issue - it is not 
immediately obvious where it lies - we tested Postgres (9.3/9.4/9.5) and 
Mysql (5.5/5.6/5.7) *all* of which exhibited the the lack of improvement 
with more cores.


Profiling suggested numa effects - but trying to eliminate these seemed 
to simply throw up new factors to inhibit performance. My *guess* (and 
it is a guess) is that we are seeing 2 (perhaps more) performance 
bottlenecks very close to each other: numa and spinlock contention at least.


Regards

Mark




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] autocommit (true/false) for more than 1 million records

2014-08-26 Thread Alex Goncharov
On the COPY's atomicity -- looking for a definitive answer from a core
developer, not a user's guess, please.

Suppose I COPY a huge amount of data, e.g. 100 records.

My 99 records are fine for the target, and the 100-th is not -- it
comes with a wrong record format or a target constraint violation.

The whole thing is aborted then, and the good 99 records are not
making it into the target table.

My question is: Where are these 99 records have been living, on the
database server, while the 100-th one hasn't come yet, and the need to
throw the previous data accumulation away has not come yet?

There have to be some limits to the space and/or counts taken by the
new, uncommitted, data, while the COPY operation is still in progress.
What are they?

Say, I am COPYing 100 TB of data and the bad records are close to the
end of the feed -- how will this all error out?

Thanks,

-- Alex



On Mon, Aug 25, 2014 at 11:48 AM, Jeff Janes jeff.ja...@gmail.com wrote:

 On Fri, Aug 22, 2014 at 1:49 PM, Emi Lu em...@encs.concordia.ca wrote:

 Hello,


 Trying to insert into one table with 1 million records through java JDBC
 into psql8.3. May I know (1) or (2) is better please?

 (1) set autocommit(true)
 (2) set autocommit(false)
  commit every n records (e.g., 100, 500, 1000, etc)


 In general it is better to use COPY (however JDBC for 8.3. exposes that),
 as that is designed specifically for bulk loading.

 Then it doesn't matter whether autocommit is on or off, because the COPY
 is a single statement.

 Cheers,

 Jeff



Re: [PERFORM] autocommit (true/false) for more than 1 million records

2014-08-26 Thread Kevin Grittner
Alex Goncharov alex.goncharov@gmail.com wrote:

 Suppose I COPY a huge amount of data, e.g. 100 records.

 My 99 records are fine for the target, and the 100-th is not --
 it comes with a wrong record format or a target constraint
 violation.

 The whole thing is aborted then, and the good 99 records are not
 making it into the target table.

Right.  This is one reason people often batch such copies or check
the data very closely before copying in.

 My question is: Where are these 99 records have been living, on
 the database server, while the 100-th one hasn't come yet, and
 the need to throw the previous data accumulation away has not
 come yet?

They will have been written into the table.  They do not become
visible to any other transaction until and unless the inserting
transaction successfully commits.  These slides may help:

http://momjian.us/main/writings/pgsql/mvcc.pdf

 There have to be some limits to the space and/or counts taken by
 the new, uncommitted, data, while the COPY operation is still in
 progress.  What are they?

Primarily disk space for the table.  If you are not taking
advantage of the unlogged load optimization, you will have
written Write Ahead Log (WAL) records, too -- which (depending on
your configuration) you may be archiving.  In that case, you may
need to be concerned about the archive space required.  If you have
foreign keys defined for the table, you may get into trouble on the
RAM used to track pending checks for those constraints.  I would
recommend adding any FKs after you are done with the big bulk load.

PostgreSQL does *not* have a rollback log which will impose a limit.

 Say, I am COPYing 100 TB of data and the bad records are close
 to the end of the feed -- how will this all error out?

The rows will all be in the table, but not visible to any other
transaction.  Autovacuum will clean them out in the background, but
if you want to restart your load against an empty table it might be
a good idea to TRUNCATE that table; it will be a lot faster.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] autocommit (true/false) for more than 1 million records

2014-08-26 Thread Alex Goncharov
Thank you, Kevin -- this is helpful.

But it still leaves questions for me.

Kevin Grittner kgri...@ymail.com wrote:

 Alex Goncharov alex.goncharov@gmail.com wrote:

  The whole thing is aborted then, and the good 99 records are not
  making it into the target table.

 Right.  This is one reason people often batch such copies or check
 the data very closely before copying in.

How do I decide, before starting a COPY data load, whether such a load
protection (complexity) makes sense (is necessary)?

Clearly not needed for 1 MB of data in a realistic environment.

Clearly is needed for loading 1 TB in a realistic environment.

To put it differently: If I COPY 1 TB of data, what criteria should I
use for choosing the size of the chunks to split the data into?

For INSERT-loading, for the database client interfaces offering the
array mode, the performance difference between loading 100 or 1000
rows at a time is usually negligible if any.  Therefore 100- and
1000-row's array sizes are both reasonable choices.

But what is a reasonable size for a COPY chunk?  It can't even be
measured in rows.

Note, that if you have a 1 TB record-formatted file to load, you can't
just split it in 1 MB chunks and feed them to COPY -- the file has to
be split on the record boundaries.

So, splitting the data for COPY is not a trivial operation, and if
such splitting can be avoided, a reasonable operator will avoid it.

But then again: when can it be avoided?

  My question is: Where are these 99 records have been living, on
  the database server, while the 100-th one hasn't come yet, and
  the need to throw the previous data accumulation away has not
  come yet?

 They will have been written into the table.  They do not become
 visible to any other transaction until and unless the inserting
 transaction successfully commits.  These slides may help:

 http://momjian.us/main/writings/pgsql/mvcc.pdf

Yeah, I know about the MVCC model...  The question is about the huge
data storage to be reserved without a commitment while the load is not
completed, about the size constrains in effect here.

  There have to be some limits to the space and/or counts taken by
  the new, uncommitted, data, while the COPY operation is still in
  progress.  What are they?

 Primarily disk space for the table.

How can that be found?  Is df /mount/point the deciding factor? Or
some 2^32 or 2^64 number?

 If you are not taking advantage of the unlogged load optimization,
 you will have written Write Ahead Log (WAL) records, too -- which
 (depending on your configuration) you may be archiving.  In that
 case, you may need to be concerned about the archive space required.

... may need to be concerned ... if what?  Loading 1 MB? 1 GB? 1 TB?

If I am always concerned, and check something before a COPY, what
should I be checking?  What are the OK-to-proceed criteria?

 If you have foreign keys defined for the table, you may get into
 trouble on the RAM used to track pending checks for those
 constraints.  I would recommend adding any FKs after you are done
 with the big bulk load.

I am curious about the simplest case where only the data storage is to
be worried about. (As an aside: the CHECK and NOT NULL constrains are
not a storage factor, right?)

 PostgreSQL does *not* have a rollback log which will impose a
 limit.

Something will though, right?  What would that be? The available disk
space on a file system? (I would be surprised.)

  Say, I am COPYing 100 TB of data and the bad records are close
  to the end of the feed -- how will this all error out?

 The rows will all be in the table, but not visible to any other
 transaction.

I see.  How much data can I fit there while doing COPY?  Not 1 TB?

-- Alex



On Tue, Aug 26, 2014 at 6:33 PM, Kevin Grittner kgri...@ymail.com wrote:

 Alex Goncharov alex.goncharov@gmail.com wrote:

  Suppose I COPY a huge amount of data, e.g. 100 records.
 
  My 99 records are fine for the target, and the 100-th is not --
  it comes with a wrong record format or a target constraint
  violation.
 
  The whole thing is aborted then, and the good 99 records are not
  making it into the target table.

 Right.  This is one reason people often batch such copies or check
 the data very closely before copying in.

  My question is: Where are these 99 records have been living, on
  the database server, while the 100-th one hasn't come yet, and
  the need to throw the previous data accumulation away has not
  come yet?

 They will have been written into the table.  They do not become
 visible to any other transaction until and unless the inserting
 transaction successfully commits.  These slides may help:

 http://momjian.us/main/writings/pgsql/mvcc.pdf

  There have to be some limits to the space and/or counts taken by
  the new, uncommitted, data, while the COPY operation is still in
  progress.  What are they?

 Primarily disk space for the table.  If you are not taking
 advantage of the unlogged load optimization, you 

Re: [PERFORM] autocommit (true/false) for more than 1 million records

2014-08-26 Thread David G Johnston
On Tue, Aug 26, 2014 at 9:21 PM, Alex Goncharov-2 [via PostgreSQL] 
ml-node+s1045698n5816426...@n5.nabble.com wrote:

 Thank you, Kevin -- this is helpful.

 But it still leaves questions for me.


 Kevin Grittner [hidden email]
 http://user/SendEmail.jtp?type=nodenode=5816426i=0 wrote:

  Alex Goncharov [hidden email]
 http://user/SendEmail.jtp?type=nodenode=5816426i=1 wrote:

   The whole thing is aborted then, and the good 99 records are not
   making it into the target table.
 
  Right.  This is one reason people often batch such copies or check
  the data very closely before copying in.

 How do I decide, before starting a COPY data load, whether such a load
 protection (complexity) makes sense (is necessary)?


​You should probably consider something like:

http://pgloader.io/

​(I know there are others, this one apparently has the best marketing
team...)​

Normal case, with normal COPY, you load a bad file​ into an empty table, it
fails, you truncate and get better data for the next attempt.

How long that will take is system (IOPS/CPU) and data dependent.

The probability of failure is source dependent - and prior experience plays
a large role here as well.

If you plan to load directly into a live table the wasted space from a bad
load could kill you so smaller partial loads are better - if you can afford
the implicit system inconsistency such a partial load would cause.

If you understand how the system works you should be able to evaluate the
different pieces and come to a conclusion as how best to proceed in a
specific situation.  No one else on this list has the relevant information
to make that judgement call.  If this is just asking about rules-of-thumb
I'd say figure out how many records 100MB consumes and COMMIT after that
many records.  10,000 records is also a nice round number to pick -
regardless of the amount of MB consumed.  Start there and tweak based upon
experience.

 If you are not taking advantage of the unlogged load optimization,
  you will have written Write Ahead Log (WAL) records, too -- which
  (depending on your configuration) you may be archiving.  In that
  case, you may need to be concerned about the archive space required.

 ... may need to be concerned ... if what?  Loading 1 MB? 1 GB? 1 TB?

 If I am always concerned, and check something before a COPY, what
 should I be checking?  What are the OK-to-proceed criteria?


​If you only have 500k free in your archive directory that 1MB file will
pose a problem...though if you have 4TB of archive available the 1TB would
fit easily.  Do you compress your WAL files before shipping them off to the
archive?  How compressible is your data?

I'm sure people have decent rules-of-thumb here but in the end your
specific environment and data, especially at the TB scale, is going to be
important; and is something that you will only discover through testing.



  If you have foreign keys defined for the table, you may get into
  trouble on the RAM used to track pending checks for those
  constraints.  I would recommend adding any FKs after you are done
  with the big bulk load.

 I am curious about the simplest case where only the data storage is to
 be worried about. (As an aside: the CHECK and NOT NULL constrains are
 not a storage factor, right?)


Correct



  PostgreSQL does *not* have a rollback log which will impose a
  limit.

 Something will though, right?  What would that be? The available disk
 space on a file system? (I would be surprised.)


   Say, I am COPYing 100 TB of data and the bad records are close
   to the end of the feed -- how will this all error out?
 
  The rows will all be in the table, but not visible to any other
  transaction.

 I see.  How much data can I fit there while doing COPY?  Not 1 TB?

 -- Alex


​You need the same amount of space that you would require if the file
imported to completion.

​PostgreSQL is optimistic in this regard - it assumes you will commit and
so up until failure there is no difference between a good and bad import.​
 The magic is described in Slide 24 of the MVCC link above (
http://momjian.us/main/writings/pgsql/mvcc.pdf) - if the transaction is
aborted then as far as the system is concerned the written data has been
deleted and can be cleaned up just like if the following sequence of
commands occurred:

BEGIN;
COPY tbl FROM ;
COMMIT; ---success
DELETE FROM tbl ;

​Hence the comment to TRUNCATE after a failed load if at all possible -
to avoid the unnecessary VACUUM on tbl...

QUESTION: would the vacuum reclaim the disk space in this situation (I
presume yes) because if not, and another imported was to be attempted,
ideally the allocated space could be reused.

I'm not sure what a reasonable formula would be, especially at the TB
scale, but roughly 2x the size of the imported (uncompressed) file would be
a good starting point (table + WAL).  You likely would want many multiples
of this unless you are dealing with a one-off event.  Indexes and dead
tuples in 

Re: [PERFORM] Turn off Hyperthreading! WAS: 60 core performance with 9.3

2014-08-26 Thread Mark Kirkwood

On 26/08/14 10:13, Josh Berkus wrote:

On 08/22/2014 07:02 AM, Andres Freund wrote:

On 2014-08-21 14:02:26 -0700, Josh Berkus wrote:

On 08/20/2014 07:40 PM, Bruce Momjian wrote:

Not sure how you can make such a blanket statement when so many people
have tested and shown the benefits of hyper-threading.


Actually, I don't know that anyone has posted the benefits of HT.
Link?


There's definitely cases where it can help. But it's highly workload
*and* hardware dependent.


The only cases I've seen where HT can be beneficial is when you have
large numbers of idle connections.  Then the idle connections can be
parked on the HT virtual cores.  However, even in this case I haven't
seen a head-to-head performance comparison.



I've just had a pair of Crucial m550's arrive, so a bit of benchmarking 
is in order. The results (below) seem to suggest that HT enabled is 
certainly not inhibiting scaling performance for single socket i7's. I 
performed several runs (typical results shown below).


Intel i7-4770 3.4 Ghz, 16G
2x Crucial m550
Ubuntu 14.04
Postgres 9.4 beta2

logging_collector = on
max_connections = 600
shared_buffers = 1GB
wal_buffers = 32MB
checkpoint_segments = 128
effective_cache_size = 10GB

pgbench scale = 300
test duration (each) = 600s

db on 1x m550
xlog on 1x m550

clients |  tps (HT)|  tps (no HT)
+--+-
4   |  517 |  520
8   | 1013 |  999
16  | 1938 | 1913
32  | 3574 | 3560
64  | 5873 | 5412
128 | 8351 | 7450
256 | 9426 | 7840
512 | 9357 | 7288


Regards

Mark


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] autocommit (true/false) for more than 1 million records

2014-08-26 Thread Alex Goncharov
 Thank you, Kevin -- this is helpful.

Thank you David, too.

 But it still leaves questions for me.

Still...

Alex Goncharov alex.goncharov@gmail.com wrote:

 How do I decide, before starting a COPY data load, whether such a load
 protection (complexity) makes sense (is necessary)?

This is *the* practical question.

David G Johnston david.g.johns...@gmail.com wrote:

 You should probably consider something like: http://pgloader.io/

This is not my question; I want to see if anybody can offer a
meaningful situation evaluation strategy for a potential using or not
using COPY for loading the big data.

If nobody can, fine: it'll give me the reason to claim Nobody knows.

 Normal case, with normal COPY,

This is the case I am asking about: the COPY operation limitations for
the big data: until what point a plain COPY can be used.

 you load a bad file into an empty table, it fails, you truncate and
 get better data for the next attempt.

This is not how many businesses operate.

 How long that will take is system (IOPS/CPU) and data dependent.

How long, was not the question: my question was originally about the
behavior for a bad record at the end of a large data set submitted to
COPY; when it was stated that the data in process becomes an
invisible (until committed) part of the target table, it became
obvious to me that the fundamental question has to be asked: How much
can fit there, in the temporary operational space (whatever it's
called in PostgreSQL.)?  df /mount - free or 2^32?

 The probability of failure is source dependent - and prior
 experience plays a large role here as well.

Not the question.

 If you plan to load directly into a live table the wasted space from
 a bad load could kill you so smaller partial loads are better - if
 you can afford the implicit system inconsistency such a partial load
 would cause.

Not the question.

 If you understand how the system works

I don't, to the necessary extent, so I asked for an expert opinion :)

 you should be able to evaluate the different pieces and come to a
 conclusion as how best to proceed in a specific situation.  No one
 else on this list has the relevant information to make that
 judgement call.

We'll see; too early to tell yet :)

 If this is just asking about rules-of-thumb

Yes.

 I'd say figure out how many records 100MB consumes and COMMIT after that
 many records.

Pardon me: I am running COPY and know how many records are processed
so far?.. (Hmm... can't be.)

 10,000 records is also a nice round number to pick - regardless of
 the amount of MB consumed.  Start there and tweak based upon
 experience.

You are clearly suggesting to split the large data file into many
small ones.  To split very intelligently, on the record boundaries.

And since this is very hard and would involve quite another, external
processing machinery, I am trying to understand until what point this
is safe not to do (subject to what factors.)

 If you are not taking advantage of the unlogged load optimization,

I don't see any way to control this for COPY only.  Are you talking
about the 'postgresql.conf' settings?

 If you only have 500k free in your archive directory that 1MB file
 will pose a problem...though if you have 4TB of archive available
 the 1TB would fit easily.

So the answer to the How much data can fit in the COPY storage
areas? question is solely a df /mount/point thing?

I.e. before initiating the COPY, I should:

   ls -l DATA-FILE
   df -m /server/db-cluster/pg_data-or-something

compare the two values and be assured that my COPY will reach the end
of my DATA-FILE (whether is stumbles in the end or not) if the former
value is meaningfully smaller than the latter?

I would take this for the answer. (Let's see if there are other
evaluation suggestions.)

 Do you compress your WAL files before shipping them off to the
 archive?  How compressible is your data?

Try to give me the upper limit evaluation strategy, when all the
compression and archive factors are working in my favor.

 I'm sure people have decent rules-of-thumb here

I would love to hear about them.

 but in the end your specific environment and data, especially at the
 TB scale, is going to be important; and is something that you will
 only discover through testing.

Don't malloc 2 GB on a system with 100 MB RAM is a meaningful rule
of thumb, not requiring any testing.  I am looking for similar simple
guiding principles for COPY.

   Say, I am COPYing 100 TB of data and the bad records are close
   to the end of the feed -- how will this all error out?
 
  The rows will all be in the table, but not visible to any other
  transaction.

 I see.  How much data can I fit there while doing COPY?  Not 1 TB?

 You need the same amount of space that you would require if the file
 imported to completion.

 PostgreSQL is optimistic in this regard - it assumes you will commit
 and so up until failure there is no difference between a good and
 bad import.

I can see it now, thanks.