Re: serial + db key, or guid?

2020-08-11 Thread Christopher Browne
On Tue, 11 Aug 2020 at 12:40, Christophe Pettus  wrote:

> > On Aug 11, 2020, at 09:37, Mark Phillips 
> wrote:
> >
> > I posed the question on the chance things had evolved since 2012,
> specifically as it relates to postgres.
>
> The essentials haven't changed.  Keys (such as UUIDs, especially UUID v4)
> that have most of their randomness in the most significant bits can cause
> significant cache hit problems on large indexes.  128 bit keys are usually
> overkill for most applications, unless you need actual *global* uniqueness
> across more than a single database or installation; 64 bit keys are usually
> sufficient.
>

Thus, if performance is highly significant (e.g. - there's lots of data in
the table, and it is heavily read/written) then it may be tempting to use a
sequential value instead because that can be smaller, faster to compare,
and won't trash caches (e.g. - with UUIDs, seemingly adjacent data will be
spread wider across indexes and will need more cache accesses to get to the
data).

If the table is small, or data is infrequently queried/updated, these costs
may be irrelevant.


> UUIDs (and similar very large random keys) do have the advantage that they
> are somewhat self-secure: You can expose them to outsiders without having
> to worry about other keys being guessable.
>

Not overly obvious from this; the "not guessable" part comes in that
chronologically adjacent records won't have any apparent similarity.

With serially assigned transaction IDs, if you, as a user, buy something,
and discover that your transaction ID was 1460795, you might well guess
that other recent purchases were on transactions 1460794, 1460793, and
such, and maybe get at someone else's data by messing with a web URL or
such.   Whereas, here's 5 uuids I just generated (dbus-uuidgen isn't
generating RFC 4122 compliant values, but in context of a little
illustration, who cares?)
$ for i in 1 2 3 4 5; do
\ dbus-uuidgen
\ done
0ff745301515c646498cd1165f32cc6e
a9ca459ab6330f24d24af5095f32cc6e
b1cff235d77b1f4d8504920a5f32cc6e
58773af20b34b3c550f4eebf5f32cc6e
f9a13ce961b28751b102c5545f32cc6e

There are some identical low-order bits, but they are pretty well hidden by
the high-order stuff.

It's somewhat cache-destroying, but not especially well guessable.

There is something which has evolved since 2012; see <
https://www.postgresql.org/docs/9.5/uuid-ossp.html>, notably
F.43.2. Building uuid-ossp

It used to be that adding UUID generator support required the OSSP library,
which was sufficiently inconvenient that this would often not be built-in.
Since 9.4 (released in 2014), uuid-ossp can use common built-in libraries
on Linux, OSX, BSD systems, so it's more likely that it will be included
"out of the box" on package-managed deployments of PostgreSQL.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: Efficiently advancing a sequence without risking it going backwards.

2020-07-09 Thread Christopher Browne
On Thu, 9 Jul 2020 at 12:59, Jeremy Schneider 
wrote:

>
> > On Jul 6, 2020, at 19:06, Paul McGarry  wrote:
> >
> > I don't think I can use setval(), because it risks making sequences go
> backwards, eg:
> >
> > 1) Check values
> > DB1sequence: 1234
> > DB2sequence: 1233 (1 behind)
> > 2) setval('DB2sequence',1234);
> >
> > but if between (1) and (2) there are 2 nextval(DB2sequence) calls on
> another process,  (2) would take the sequence back from 1235 to 1234 and I
> would end up trying to create a duplicate key ID from the sequence.
>
> An ability to “lock” the sequence momentarily would give you the tool you
> need, but I don’t think it’s there.
>
> Total hack, but if your application or users can retry when the rare error
> is encountered then one idea is to rename the sequence momentarily while
> you do the setval() then rename it back. Do an initial check without
> renaming, then re-check after renaming and before the setval() call.
>
> If you put retry logic into your application then make sure to include
> back-off logic so you don’t get an outage induced by thundering herd.
>

This is increasingly looking like a set of attempts to intentionally abuse
what sequences were designed for.

The use-case where you need a lock on the value so that there can't
possibly be a hole in the sequence points at the notion of having some
other kind of a function that takes out a lock on a table, and serially
gives out "MAX+1" as the next value.

That isn't a very difficult function to write; the problem with it is that
that sort of function will forcibly serialize all inserts through the
function+table lock that is giving out "MAX+1" values.  That's going to be
WAY slower than using a sequence object, and about 98% of the time, people
will prefer the sequence object, particularly because it's about 98% faster.

I'm not quite sure if anyone has put out there a standard-ish idiom for
this; that seems like a not TOO difficult "exercise for the user."

There will definitely be more failure cases, and *wildly* more fighting, in
a concurrent environment, over tuple locks.

- An obvious failure is that if one connection asks for the new MAX+1, gets
it, and then the transaction fails, for some later, out-of-relevant-scope,
reason, you'll still potentially get some "holes" in the series of values.

- If there are 10 connections trying to get MAX+1 concurrently, only one
can get it at a time, and that connection can't relinquish the lock until
its transaction has completed, and the 9 must wait, regardless of how much
work the "winner" still has to do.

These are amongst the reasons why people conclude they *don't* want that
kind of functionality.

It makes me think that the problem needs to be taken back to that initial
point of "I think I need some somewhat coordinated sequences", and poke at
what the *real* requirement is there, and why someone thinks that the
values should be "somewhat coordinated."  Something seems off there.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: Kerberos-Postgresql implementation for user authentication

2020-07-09 Thread Christopher Browne
On Thu, 9 Jul 2020 at 10:15, Brajendra Pratap Singh <
singh.bpratap...@gmail.com> wrote:

> Hi Niels,
>
> Thanks for your prompt response and I apologise for the incomplete
> information.
> Actually we are trying to implement the Kerberos authentication while
> anyone trying to connect with postgresql DB user/role.
>
> Plz help us here with document.
>

The offerings you'll see here are going to be rather more generic; this is
a public mailing list where you will get general sorts of answers.

The generic "best answer" is to point to the standard documentation.  You
should review, for each version of PostgreSQL you are interested in, the
appropriate version of
https://www.postgresql.org/docs/12/client-authentication.html, and most
likely, what you should be looking at is the sections on GSSAPI.

And you have received some links to other material that appears to have
useful explanations.  Of course, nobody cooked that into a document that
speaks procedurally to your purposes on your several versions of CentOS and
of PostgreSQL; they pointed to explanations and tutorial material.

If you require a "procedural document" prepared in custom fashion, tailored
to your specific local requirements, then I rather think you will need to
engage one of the numerous firms that are available to provide those sorts
of professional services.  Here is a link to where you can find some such:

https://www.postgresql.org/support/professional_support/


Re: Table partitioning for cloud service?

2020-05-21 Thread Christopher Browne
On Thu, 21 May 2020 at 11:53, Israel Brewster 
wrote:

> On May 21, 2020, at 7:36 AM, Adrian Klaver 
> wrote:
>
> On 5/21/20 8:29 AM, Israel Brewster wrote:
>
> I’m working on my first cloud service, which will be backed by a
> postgresql database. Currently I only have a single customer, but of course
> I want to design with the possibility of multiple customers in mind. In
> that vein, I’m wondering what is “typical” in terms of designing the DB
> structure to make sure that one customer doesn’t “accidentally" get data
> for another customer? At the moment I am leaning towards giving each
> customer their own set of tables, with a unique prefix for each. This would
> provide a “hard” separation for the data,
>
>  I think that is called a schema:)
>
>
> Ok. That’s probably an option. Although it looks from a cursory perusal
> that for that to work, I would have to have separate DB users with
> different permissions. Which would be fine, except that I don’t have
> permissions to create users.
>
> Or set up a separate database for each in the cluster.
>
>
> Same as above - no permissions.
>
> At the moment, I am running on Heroku, which gives me a postgresql
> database, but not full control over the cluster. I may need to move to
> something more robust, if having completely separate databases is the best
> option. I was hoping to avoid SysAdmin stuff as much as possible, and focus
> on the software side, but obviously some sys admin is required.
>

There's a whole lot of "that depends" to this.

If there is not much data shared across customers, then it's a pretty good
answer to create a database for each one.  This is especially good if they
are only occasionally connected.

If there is a LOT of shared data, then "life gets more complicated."

It's a decently well documented problem out there; I just searched for
"multitenant database design" which showed up a number of decent
(not-Postgres-specific) answers

https://www.google.com/search?client=firefox-b-d=multitenant+database+design

Some common patterns include:
- A database per tenant (on Postgres, that means that PGDATABASE and/or
connection URIs change for each tenant)
- A schema per tenant (on Postgres, that means each time a tenant is added,
you need "CREATE NAMESPACE" to establish the tenancy and "CREATE TABLE" for
each table in that tenancy, and connections use "set
search_path=tenantname;" to select data from the right tenant)
- Table-based tenancy (e.g. - each table has a "tenant_id" and queries need
to specify the tenant)

An interesting answer, if there needs to be shared data, is for the shared
data to go in its own database, and use a Foreign Data Wrapper to have each
tenants' database access it <
https://www.postgresql.org/docs/12/postgres-fdw.html>

There are lots of tradeoffs involved in each case; each of the above
patterns has merits and demerits particularly as the number of tenants
scales, as well as when you discover there are both tiny and large tenants
with differing requirements.

You need to look at it from various perspectives:
- How do application schema changes get handled as the application evolves?
- What are the security concerns about data sharing across tenants?
- What issues come up when managing storage across tenants?  (Some
approaches are easier to cope with than others)

If you don't have a fair bit of technical expertise locally, then
sophisticated choices will cause you problems that you won't be able to
solve.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: Loading 500m json files to database

2020-03-23 Thread Christopher Browne
On Mon, 23 Mar 2020 at 06:24, pinker  wrote:

> Hi, do you have maybe idea how to make loading process faster?
>
> I have 500 millions of json files (1 json per file) that I need to load to
> db.
> My test set is "only" 1 million files.
>
> What I came up with now is:
>
> time for i in datafiles/*; do
>   psql -c "\copy json_parts(json_data) FROM $i"&
> done
>
> which is the fastest so far. But it's not what i expect. Loading 1m of data
> takes me ~3h so loading 500 times more is just unacceptable.
>
> some facts:
> * the target db is on cloud so there is no option to do tricks like turning
> fsync off
> * version postgres 11
> * i can spin up huge postgres instance if necessary in terms of cpu/ram
> * i tried already hash partitioning (to write to 10 different tables
> instead
> of 1)
>
>
> Any ideas?
>


Well, you're paying for a lot of overhead in that, as you're
establishing a psql command, connecting to a database, spawning a backend
process, starting a transactions, committing a transaction, closing the
backend
process, disconnecting from the database, and cleaning up after the
launching
of the psql command.  And you're doing that 500 million times.

The one thing I left off that was the loading of a single tuple into
json_parts.

What you could do to improve things quite a lot would be to group some
number
of those files together, so that each time you pay for the overhead, you at
least
get the benefit of loading several entries into json_parts.

So, loosely, I'd commend using /bin/cat (or similar) to assemble several
files together
into one, and then \copy that one file in.

Having 2 tuples loaded at once drops overhead by 50%
Having 10 tuples loaded at once drops overhead by 90%
Having 100 tuples loaded at once drops overhead by 99%
Having 1000 tuples loaded at once drops overhead by 99.9%

There probably isn't too much real value to going past 1000 tuples per
batch; the
overhead, by that point, is getting pretty immaterial.

Reducing that overhead is the single most important thing you can do.

It is also quite likely that you could run such streams in parallel,
although
it would require quite a bit more information about the I/O capabilities of
your
hardware to know if that would do any good.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: Could postgres12 support millions of sequences? (like 10 million)

2020-03-22 Thread Christopher Browne
On Sun, 22 Mar 2020 at 17:54, pabloa98  wrote:

>
> So the question may actually be:
>>
>> How do we improve our locking code, so we don't have to spawn millions
>> of sequences?
>>
>> What is the locking method you are using?
>>
>
> I am not using locking with the million sequence solution. I do not want
> something that locks because the problems described below
>
> I prefer the solution generates a gap (skip a couple of numbers) and not
> using locks.
>

If you can cope with gaps, that's a good thing.  That means that some kind
of usage of sequences would be compatible with your application.


>
>> > The lock part is because we solved a similar problem with a counter by
>> > row locking the counter and increasing it in another part of the
>> > database. The result is that all the queries using that table are
>> queued
>> > by pair (group, element) that is not that bad because we are not
>> > inserting thousands of rows by second. Still is killing cluster
>> > performance (but performance is still OK from the business point of
>> > view). The problem using locks is that they are too sensitive to
>> > developer errors and bugs. Sometimes connected clients aborts and the
>> > connection is returned to the pool with the lock active until the
>> > connection is closed or someone unlocks the row. I would prefer to have
>> > something more resilient to developers/programming errors, if possible.
>> >
>>
>> Now I read this paragraph, I realize I was not clear enough.
> I am saying we do not want to use locks because of all the problems
> described.
>

Cool, that means you have been thinking through similar considerations to
what others have in mind, and it doesn't sound like there are dramatically
different understandings.

Let's circle back to the schema that you provided...

CREATE TABLE counter(
group INT NOT NULL,
element INT NOT NULL,
seq_number INT NOT NULL default 0,
CONSTRAINT PRIMARY KEY (group, element)
);

That's not a table using the sequential values; that's what you imagining
you could have as a way of referencing the sequences, right?

I would represent it slightly differently...

create table counter (
   group integer not null,
   element integer not null,
   sequence_name name,
   primary key (group, element)
);
Arguably, there's no need for sequence_name altogether, as it's never
directly referenced by anything.

And then have a function that might fire upon creation of new entries in
this table.

create or replace function generate_sequence (i_group integer, i_element
integer) returns name
as $$
declare
  c_seqname name;
  c_query text;
begin
   c_seqname := 'obj_counter_' || i_group || '_'  || i_element;
   c_query := 'create sequence if not exists ' || c_seqname || ';';
   execute c_query;
   update counter set sequence_name = c_seqname where group = i_group and
element = i_element;
   return c_seqname;
end
$$ language plpgsql;

You'd need a trigger function to put onto the table that runs this
function; that is left as an exercise for the reader.

Then, on any of the tables where you need to assign sequence values, you'd
need to run an "after" trigger to do the assignment.  The function that
finds the sequence value is kind of analagous:
create or replace function get_next_counter (i_group integer, i_element
integer) returns integer -- or bigint?
as $$
declare
  c_seqname name;
  c_query text;
  c_seqval integer;
begin
   c_seqname := 'obj_counter_' || i_group || '_'  || i_element;
   c_query := 'select nextval(' || quote_ident( c_seqname_ || ');';
   execute c_query into c_seqval;
   return c_seqval;
end
$$ language plpgsql;

Again, that just grabs a nextval(); you'd need to execute this inside a
trigger function called ON INSERT on any of the tables that need sequence
values assigned.
That encapsulates the usage of this horde of sequences.  You're probably
calling get_next_counter() millions of times, so perhaps that code gets
expanded directly into place in the trigger function.

-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: Setting up an environment of EDB Advance server

2020-01-08 Thread Christopher Browne
On Wed, 8 Jan 2020 at 02:15, Daulat Ram  wrote:

> Hi team,
>
>
>
> We would need your help in setting up an environment of EDB Advance Server
> 11 on Dev CentOS 7 VM.
>
>
>
>1. We need to create three database with separate table spaces :
>
>
>
> Test1
>
> Test2
>
> Test3
>
>  What would be the good strategy for the setup with regards to the
> tablespaces?
>
> Can we create a separate cluster for each database:
>

"EDB Advanced Server 11" is a proprietary product of a company called
Enterprise DB, and is
substantially different from Postgres.

For support relating to their product, you should really contact the vendor.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: UPDATE many records

2020-01-06 Thread Christopher Browne
On Mon, 6 Jan 2020 at 17:38, Israel Brewster  wrote:

>
> Sure. But I feel we are getting a bit off track. Optimizing the runtime of
> the update is great, but this is a one-off (hopefully) event. I want to
> accomplish it as quickly as possible, of course, but at the same time it
> doesn’t make sense to spend a lot of time optimizing every component of the
> query. The main purpose of the question was honestly for my sanity, to
> reduce the likelihood of having it run for several hours only to error out
> due to bad data or whatever and have to start over from the top. Running in
> parallel simply seemed to be a no-brainer option to make it go quicker,
> assuming CPU bound updating. Optimizations that are going to take work are
> probably not worth it. We can wait for the data to be updated.
>

It sounds like you're in a decent place on this, and that you have done a
pretty apropos amount of exploration of the matter.

I was pleased to hear that you have the idempotency of the updates well in
hand, and that the application can cope with the degree of out-of-sync that
things will temporarily be.

The estimate of 10h to update the data doesn't surprise me; that's long
enough that it sure seems tempting to do the work in pieces so that you
don't have your whole set of application data locked for 10h.

I'd be inclined to call this "enough attention" for a one-off event.

I'll poke at the trigger aspect a wee bit; if the trigger function does a
one-tuple-at-a-time handling of things, so that it fires 50M times, you
might get a substantial speedup by replacing that with an equivalent set
operation that processes a few thousand tuples at a time.  That said, if
you're happy with the process running 10h, it's not worth unpeeling the
extra testing needed to ensure identical end states.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: UPDATE many records

2020-01-06 Thread Christopher Browne
On Mon, Jan 6, 2020, 3:15 PM Israel Brewster  wrote:

> On Jan 6, 2020, at 10:08 AM, Christopher Browne 
> wrote:
>
> On Mon, 6 Jan 2020 at 13:36, Israel Brewster 
> wrote:
>
>> Thanks to a change in historical data, I have a need to update a large
>> number of records (around 50 million). The update itself is straight
>> forward, as I can just issue an "UPDATE table_name SET
>> changed_field=new_value();" (yes, new_value is the result of a stored
>> procedure, if that makes a difference) command via psql, and it should
>> work. However, due to the large number of records this command will
>> obviously take a while, and if anything goes wrong during the update (one
>> bad value in row 45 million, lost connection, etc), all the work that has
>> been done already will be lost due to the transactional nature of such
>> commands (unless I am missing something).
>>
>> Given that each row update is completely independent of any other row, I
>> have the following questions:
>>
>> 1) Is there any way to set the command such that each row change is
>> committed as it is calculated?
>> 2) Is there some way to run this command in parallel in order to better
>> utilize multiple processor cores, other than manually breaking the data
>> into chunks and running a separate psql/update process for each chunk?
>> Honestly, manual parallelizing wouldn’t be too bad (there are a number of
>> logical segregations I can apply), I’m just wondering if there is a more
>> automatic option.
>>
>
> Yeah, I'd be inclined to do this in batches.
>
> If, for instance, the table has a nice primary key, then I'd capture the
> primary keys into a side table, and grab tuples from the side table to
> process in more bite-sized batches, say, of a few thousand tuples per batch.
>
> create table just_keys as select pk_column from big_historical_table;
> alter table just_keys add column processed boolean;
> create index jkpk on just_keys(pk_column) where (processed is null);
> then loop repeatedly along the lines...
>
> create temp table iteration as select pk_column from just_keys where
> processed is null limit 1000;
> [do update on big_historical_table where pk_column in (select pk_column
> from iteration)]
> update iteration set processed='true' where pk_column in (select pk_column
> from iteration);
> drop table iteration;
>
>
> Parallelization is absolutely an interesting idea; if you want to use 8
> processes, then use a cycling sequence on the side table to spread tuples
> across the 8 processes, so that they can grab their own tuples and not
> block one another.
>
> In that case, more like...
> create temp sequence seq_procs start with 1 maxval 8 cycle;
> create temp table just_keys as select pk_column, false::boolean as
> processed, nextval('seq_procs') as batch_id from big_historical_table;
>
> The individual iterations then look for values in just_keys corresponding
> to their assigned batch number.
>
>
> Sounds like a reasonable approach. As Justin pointed out, it is actually
> likely that the process will be IO bound rather than CPU bound, so my
> parallel idea may not have much merit after all, but the batching procedure
> makes sense. I assume you meant update just_keys in your sample rather than
> update iteration on that line just before drop table iteration. Thanks for
> the info
>

As for parallelism, if you have really powerful disk, lots of disks on disk
array, it may help.  Or not, as commented.

I didn't test my wee bit of code, so yep, I meant to update just_keys :-).

You won't find something terribly much more automatic.

Oh, yah, there's a possible further complication; does the application need
to get stopped to do this update?  Is the newest version of the app still
generating data that needs the rewriting?  Sure hope not...


Re: UPDATE many records

2020-01-06 Thread Christopher Browne
On Mon, 6 Jan 2020 at 13:36, Israel Brewster  wrote:

> Thanks to a change in historical data, I have a need to update a large
> number of records (around 50 million). The update itself is straight
> forward, as I can just issue an "UPDATE table_name SET
> changed_field=new_value();" (yes, new_value is the result of a stored
> procedure, if that makes a difference) command via psql, and it should
> work. However, due to the large number of records this command will
> obviously take a while, and if anything goes wrong during the update (one
> bad value in row 45 million, lost connection, etc), all the work that has
> been done already will be lost due to the transactional nature of such
> commands (unless I am missing something).
>
> Given that each row update is completely independent of any other row, I
> have the following questions:
>
> 1) Is there any way to set the command such that each row change is
> committed as it is calculated?
> 2) Is there some way to run this command in parallel in order to better
> utilize multiple processor cores, other than manually breaking the data
> into chunks and running a separate psql/update process for each chunk?
> Honestly, manual parallelizing wouldn’t be too bad (there are a number of
> logical segregations I can apply), I’m just wondering if there is a more
> automatic option.
>

Yeah, I'd be inclined to do this in batches.

If, for instance, the table has a nice primary key, then I'd capture the
primary keys into a side table, and grab tuples from the side table to
process in more bite-sized batches, say, of a few thousand tuples per batch.

create table just_keys as select pk_column from big_historical_table;
alter table just_keys add column processed boolean;
create index jkpk on just_keys(pk_column) where (processed is null);
then loop repeatedly along the lines...

create temp table iteration as select pk_column from just_keys where
processed is null limit 1000;
[do update on big_historical_table where pk_column in (select pk_column
from iteration)]
update iteration set processed='true' where pk_column in (select pk_column
from iteration);
drop table iteration;

Parallelization is absolutely an interesting idea; if you want to use 8
processes, then use a cycling sequence on the side table to spread tuples
across the 8 processes, so that they can grab their own tuples and not
block one another.

In that case, more like...
create temp sequence seq_procs start with 1 maxval 8 cycle;
create temp table just_keys as select pk_column, false::boolean as
processed, nextval('seq_procs') as batch_id from big_historical_table;

The individual iterations then look for values in just_keys corresponding
to their assigned batch number.

-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: Work hours?

2019-08-28 Thread Christopher Browne
On Wed, 28 Aug 2019 at 01:02, Ron  wrote:

> We did something similar to that, except all the columns were in one
> single table.  It wasn't a data warehouse, though: the RDBMS we used could
> be coerced into using a date index when large ranges were needed in detail
> tables by joining it to T_CALENDAR, and doing the range filter on
> T_CALENDAR.
>

Ah, interesting!  I like it, mostly...

The one bad thing would be that this sorta mis-matches timestamp with
timezone which is a more or less continuous data type (rather than
discrete, like date).  I could see an argument, in that environment, to put
a DATE type onto detail tables if they are inevitably being joined to
T_CALENDAR.

I recall we had a case where some reports were ridiculously inefficient
because a query involved effectively a "where date_part(something, column)"
clause that made that into a Seq Scan.

Alternatively (and I'm thinking out loud here), I wonder if putting a range
type with a pair of timestamps would help with matching, as the range type
would put the full range of each day into the table; you could have full
date/time stamps match the calendar table via the range type...

select [stuff] from tz_table t, t_calendar tc
where
   [various stuff]
and
  tc.t_workday and
  tc.t_date between '2017-01-01' and '2017-02-01'
   and (to get the range bit)
 t.original_tstz <@ tc.t_range;
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: Work hours?

2019-08-27 Thread Christopher Browne
On Tue, Aug 27, 2019, 6:27 PM stan  wrote:

> I am just starting to explore the power of PostgreSQL's time and date
> functionality. I must say they seem very powerful.
>
> I need to write a function that, given a month, and a year as input returns
> the "work hours" in that month. In other words something like
>
> 8 * the count of all days in the range Monday to Friday) within that
> calendar month.
>
> Any thoughts as to the best way to approach this?


In data warehouse applications, they have the habit of creating tables that
have various interesting attributes about dates.

https://medium.com/@duffn/creating-a-date-dimension-table-in-postgresql-af3f8e2941ac

I'd be inclined to solve this by defining various useful sets of dates; you
might then attach relevant attributes to a dimension table like the d_date
table in the article.

- a table with all weekdays (say, Monday to Friday)

- a table listing statutory holidays that likely need to be excluded

These are going to be small tables even if you put 10 years worth of dates
in it.

Select entries from the first table, excluding holidays from the second,
and you get the set of working days that can then be counted to get the
desired result.

The "data warehouse" approach would probably be to put an extra
is_stat_holiday onto the dimension table; then you could do...

Select count(*) * 8 from d_date where date_actual between '2019-09-01' and
'2019-09-30' and day_of_week in (1,2,3,4,5) and not is_statutory_holiday;

If there are multiple jurisdictions with differing sets of holidays, that's
going to complicate life.  Down that road, I'd define a bunch of tables for
differing jurisdictions' holidays, clearly that diverges a bit from the
data warehousing approach.


Re: Postgres 10.7 Systemd Startup Issue

2019-06-06 Thread Christopher Browne
On Thu, Jun 6, 2019, 8:19 PM Tom Lane  wrote:

> Francisco Olarte  writes:
> > On Thu, Jun 6, 2019 at 6:57 PM Kelly, Kevin  wrote:
> >> We’re attempting to launch postgres via systemd and noticing that when
> invoking via systemctl start postgres.service the prompt never returns. If
> we switch to another tty and check the status it shows as:
> >> Active: activating (start) since Thu 2019-06-06 09:36:32 EDT; 12min ago
> >> If we change the type from notify to forking we see the same behavior.
> The instance seems to be up and running just fine, we just never see the
> active (running) status as we have come to expect.
>
> > Are you sure you have the postgres.service correctly configured? ( or
> > you could post the ExecStart/Type config, or the whole service file ).
>
> > The type tells systemd how to know the service has finished starting
> > and is running, notify means it does systemd integration via socket,
> > with IIRC postgres does not.
>
> I might be confusing this with some other issue, but I think PG does
> support systemd notification if you build it with the --with-systemd
> configuration option.  This is recommended if you're using systemd
> to launch PG, because systemd's other options for figuring out the
> service state work substantially less well.


Yeah, I was one of the reviewers of the patch for this feature.

If the configuration parameter is turned on then Postgres reports in to
SystemD once it completes any crash recovery work to indicate that the
database service is up.

That way, if there was a bunch of WAL needing to be processed, services
depending on Postgres will properly get deferred.

You need to set up the service type in the SystemD service file to
   Type=notify
in order to get this behavior


Re: Converting yes or no to one letter strings.

2019-06-05 Thread Christopher Browne
On Tue, 4 Jun 2019 at 18:30, Lou  wrote:

> Hi everyone,
>
> Is it possible to convert a boolean yes or no field to hold a one letter
> string? For example, the strings: 's' 'f' 'p' 'e'
>
> To start off, I just need to convert true to 's'. false will have to be
> manually changed to 'f' or 'p' or 'e' as appropriate.
>
> Lou
>
Certainly you can.

origin=# create table foo (id serial primary key, name text, tfvalue
boolean);
CREATE TABLE
origin=# alter table foo alter column tfvalue set data type character;
ALTER TABLE

Perfectly fine if the table has no values in that column.

But wait, maybe not...
origin=# create table foo (id serial primary key, name text, tfvalue
boolean);
CREATE TABLE
origin=# insert into foo (name, tfvalue) values ('chris', 'true');
INSERT 0 1
origin=# insert into foo (name, tfvalue) values ('dave', 'false');
INSERT 0 1
origin=# insert into foo (name) values ('brad');
INSERT 0 1
origin=# alter table foo alter column tfvalue set data type character;
ERROR:  value too long for type character(1)

Nope, you can't do that if there's data in the table.

This seems like an altogether terrible idea, actually.  If you need a new
column with a totally different interpretation, you should probably create
a totally new column, that way you can do whatever you wish to the new
column.

Might want to consult the trusty documentation, too.  <
https://www.postgresql.org/docs/9.5/datatype-boolean.html>
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: Feature request (or at least discussion): enable autovaccum on temp tables

2019-05-31 Thread Christopher Browne
On Fri, May 31, 2019, 12:26 PM Ivan Voras  wrote:

> Hi,
> On Fri, 31 May 2019 at 17:43, Andrew Gierth 
> wrote:
>
>> > "Ivan" == Ivan Voras  writes:
>>
>>  Ivan> Since AFAIK temp tables are very close to unlogged ordinary
>>  Ivan> tables, what would stop this feature from being implemented?
>>
>> The key difference between temp tables and other tables is that temp
>> table data does not use the shared_buffers but is buffered only in
>> backend-local memory.
>>
>> This means that other processes (like, say, an autovacuum process) can
>> not access the content of temp tables. So what you want is not possible.
>>
>
> Understood.
>
> So, a related question, since we have dozens of temp tables and a lot of
> code, is there a way to look up what temp tables are being created by the
> current session, so I can do a VACUUM or ANALYZE on all of them in bulk? I
> know I can inspect pg_temp_* schema, but how to figure out which one is
> from the current session?
>


I think you might want to look at information_schema.tables; I seem to
recall that view restricting itself to the tables actually visible to your
connection.

>


Re: Feature request (or at least discussion): enable autovaccum on temp tables

2019-05-31 Thread Christopher Browne
On Fri, 31 May 2019 at 11:26, Ivan Voras  wrote:

> Hello,
>
> The reason why we are using temp tables is to allow concurrent runs on
> some very large reports which involve creating and heavily churning dozens
> of very large tables.
>
> The problem we're facing is that if we DON'T use temp tables, the reports
> finish in a couple of hours. If we DO use temp tables, they finish in a
> couple of weeks. While using regular tables, autovacuum runs pretty much
> continuously.
>
> We'll try manually vacuuming the tables after major operations, but that's
> kind of kludgy. What would help is a GUC setting which toggles this
> behaviour.
>
> Since AFAIK temp tables are very close to unlogged ordinary tables, what
> would stop this feature from being implemented?
>
>
I do hit some situations similar to this; it does NOT lead me to wanting to
VACUUM a lot; rather, after any query where I populate a lot of data in a
temporary table, I need to add an explicit ANALYZE, which tends to be more
than sufficient.

It is problematic for autovacuum to do this for two reasons:

1. Frequently, the moment that the ANALYZE *needs* to be run is the instant
after the table is populated, because if it waits any longer, that will be
too late to help the Immediately Following Query where the freshly
populated temp table's data was used.

In effect, the only "safe" answer for those processes is for the SQL code
to include an ANALYZE in the crucial spot.   If autovacuum tried to do this
work, it might get to the table 5 seconds late, and thereby not do the good
that you need.

2.  Temp tables are only associated (and visible) in the session in which
you are doing the work.  autovacuum operates inside an ordinary session
context, and in a separate connected session, so it can't see your temp
tables; they are not handled in shared memory that crosses contexts to the
one in which those temp tables exist.  Changing that would be way more
difficult than meets the eye.

I would actually suggest that it is likely that VACUUM never needs to be
done, as there's something wrong if a lot of garbage is being generated in
temp tables.  All that should be needed are ANALYZE requests, and it makes
sense to annotate the code with an ANALYZE after any time a table is
massively updated.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: User Details for PostgreSQL

2019-05-09 Thread Christopher Browne
On Thu, 9 May 2019 at 16:43, Kumar, Virendra 
wrote:

> Hello Team,
>
>
>
> We are looking for some audit information about user creation. We need a
> few audit fields which we did not find in PostgreSQL. I would be happy if
> someone help us in finding these details. Besically we need information
> about:
>
> 1.   User creation date
>
> 2.   Last Password change date
>
>
>
> Do we have a way to get these values or can somebody guide us how we can
> store and get these values while creating user.
>
>
>
> Regards,
>
> Virendra
>

Since there is a diversity of ways of managing this information, including
outside the database, there is no way to assert a true-in-general mechanism
for this.

Indeed, if you are interested in managing such information particularly
carefully, you may wish to use mechanisms such as PAM, Kerberos, LDAP,
GSSAPI for this, in which case PostgreSQL may have no responsibility in the
matter of managing passwords.  It is quite likely a good idea to use
something like Kerberos if you have the concerns that you describe, and if
so, the audit information you want would be collected from Kerberos, not
PostgreSQL

-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: plctl extension issue postgresql 11.2

2019-03-28 Thread Christopher Browne
On Thu, 28 Mar 2019 at 17:11, Prakash Ramakrishnan <
prakash.ramakrishnan...@nielsen.com> wrote:

> Hi Tom,
>
> Thanks for your big help i did yum install plctl and extension below path
> but still am getting error please find the below details,
>
> ==> psql
> psql (11.2)
> Type "help" for help.
>
> postgres=# \c "AIBE01PR"
> You are now connected to database "AIBE01PR" as user "postgres".
> *AIBE01PR=# create extension plctl;*
> *ERROR:  could not open extension control file
> "/usr/pgsql-11/share/extension/plctl.control": No such file or directory*
> *AIBE01PR=#*
>

If that is cut/and/pasted, then it is crystal clear that it won't work
because there's a typo.

There is no such computer language as "ctl".  And therefore...
There is no such PostgreSQL extension as "plctl"

Try "create extension pltcl;" instead.

It is an absolute certainty that "create extension plctl;" WILL NOT WORK,
regardless of what software you may have installed.  In contrast "create
extension pltcl;" may quite possibly work.

Many of us don't use your Linux distribution, so can only take wild guesses
as to what commands might install what RPM files that might be helpful.

You almost certainly did not "yum install plctl" with much success, though
conceivably "yum install pltcl" might have worked instead.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: Where to store Blobs?

2019-03-13 Thread Christopher Browne
On Wed, 13 Mar 2019 at 10:27, Thomas Güttler
 wrote:
> I guess most people do not store Blobs in PostgresSQL.
>
> Where do you store Blobs?

Things have changed, but at one time, we were using RT as our
ticketing system (https://bestpractical.com/request-tracker) and it
would capture documents as database objects.

The table where RT stowed downloadable documents was one of the
largest tables in the database because of there being a few 50MB
copies of Access Databases and some many-MB spreadsheets in there.

It worked fine; no problems evident from it.  It was certainly
surprising to find such large documents there, and if people had
gotten in the habit of putting GBs of data into RT, that would have
probably led to some policy changes to prevent it, but unless you're
pretty actively trying to blow the system up, it just works.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"



Re: POSTGRES/MYSQL

2019-03-12 Thread Christopher Browne
On Tue, 12 Mar 2019 at 12:53, Benedict Holland
 wrote:
> I am not saying it is not well documented. I am saying that it isn't ACID 
> compliant, which it isn't, as they document.

I *love* the notion of being able to roll back DDL, but it has long
been common for DDL to *not* be transactional even with some of the
Big Expensive Databases (such as the one whose name begins with an
"O").

Up until version 11.something, "Big O" apparently did NOT have this,
and MS SQL Server didn't in version 2008.

https://docs.oracle.com/cd/B19306_01/server.102/b14220/transact.htm
https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms189122(v=sql.105)

Of course, those are somewhat old versions.  But nobody would have
claimed those systems not to be "ACID Compliant" at the time; you're
setting the bar a bit too high.

Someone's asking the merits of PostgreSQL versus MySQL; it certainly
*is* possible to overplay the case.

I'm perfectly happy with a claim like...

 "PostgreSQL does transactional DDL, which we find quite valuable, and
while MySQL supports ACID for data manipulation, with suitable choice
of storage engines, there is not the same capability to be able to
roll back DDL within a transaction."
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"



Re: POSTGRES/MYSQL

2019-03-11 Thread Christopher Browne
On Mon, 11 Mar 2019 at 12:36, Michael Nolan  wrote:
> There isn't a simple answer to this, it's like asking 'which is better for 
> cooking:  aluminum or stainless steel'.  The answer is 'it depends on what 
> you're trying to do'.i

Metaphors can be dangerous (especially when automotive ones get
chosen!), but this is a pretty good one.

Often, it doesn't terribly much matter which kind of cookware you use;
both aluminum and steel will allow you to cook your dinner, and if
your needs are not challenging, the differences may make little
difference.  That seems likely to be true here; both Postgres and
MySQL have a sizable set of relatively common facilities where they
would function in broadly similar ways.  This wasn't so true in the
MySQL 3 versus PostgreSQL 6 days, when they had enormously different
expressions of basic usage patterns.  (e.g. - in MySQL, you'd get data
eaten by weak data types, or performance killed when writes demand
full table locks, but PostgreSQL would lose performance when VACUUM
would lock the whole table).  Modern MySQL is a lot less sloppy than
it used to be, and Modern PostgreSQL performs a way lot better than it
used to.

And you can certainly fry bacon in either a steel or aluminum pan;
performance of that is liable to depend more on the qualities of stove
and of the bacon than on the qualities of the fry pan.

Paralleling the notion that performance and reliability might depend
more on the qualities of the server, between CPUs, amounts and speed
of RAM, and whether you're persisting data on SSDs versus "spinning
rust disks."  If your disk dies, the database goes away, "poof,"
irrespective of what brand of database you're using...

It is also somewhat foolish to get deep into minutiae when we have no
idea which details do or do not matter for the particular situation.

It's pretty likely that there *is* some relevant minutiae, but, when
the only details given are:

- Retail company
- DB size ~30GB
- Present database is DB2

That doesn't point all that readily at "relevant minutiae".

Sorts of things that might help:
- What kinds of data types are in use?
- What is the architecture of the system in practice?
- What kinds of transaction load are being thrown at the system?
  - Presumably some OLTP activity to record retail activities
  - Reporting on activities (which might involve replication?)
  - Data lifecycles (how does data arrive, how does it exit once irrelevant?)

There are consulting organizations out there that would be quite
prepared to help with that sort of analysis.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"



Re: Multiple LDAP Servers for ldap Authentication

2018-12-20 Thread Christopher Browne
On Thu, 20 Dec 2018 at 16:17, Kumar, Virendra
 wrote:
>
> I figured it out, this is how it works:
> --
> hostall all0.0.0.0/0  ldap 
> ldapserver=server1.com ldapserver=server2.com ldapprefix=PROD01\
>
> So documentation need some update.

Looking at the docs (doc/src/sgml/client-auth.sgml), it appears to be
correct, though it is somewhat ambiguous in that I'd expect
"ldapserver=server1.com server2.com" to be OK rather than needing
"ldapserver=server1.com ldapserver2.com" instead.  I'm not certain of
a way to explain this better.

 
  ldapserver
  
   
Names or IP addresses of LDAP servers to connect to. Multiple
servers may be specified, separated by spaces.
   
  
 

Perhaps the attached?


-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"
diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml
index c2114021c3..48643d2cb6 100644
--- a/doc/src/sgml/client-auth.sgml
+++ b/doc/src/sgml/client-auth.sgml
@@ -1490,7 +1490,7 @@ omicron bryanh  guest1
   

 Names or IP addresses of LDAP servers to connect to. Multiple
-servers may be specified, separated by spaces.
+servers may be specified, each ldapserver=servername separated by whitespace.

   
  


Re: psql is hanging

2018-12-02 Thread Christopher Browne
I like the "add an analyze" idea; the two most likely causes of the
phenomenon (to my mind) are either:

a) Something's getting locked and Tom Lane's idea of checking pg_locks
when you notice it's stopped can help track down the problem.

Further to that, the thing I'd be expecting to see if the problem is
locking is that the connection that's blocked up will be waiting on a
lock held by another connection.

It's near certain that you'll find that the connection doing the work
will have LOTS of locks outstanding; that's not a problem at all;
that's perfectly normal.  You need to look keenly for locks that have
not yet been granted.

b) I find it common in my environments to need to do manual ANALYZE
requests all the time because I'll set up temporary tables (that the
autovacuum daemon can't do anything about) which, as the stats are
lacking, will lead to awful query plans that make queries run badly.

If you use temporary tables, that's like a heavy "thumb on the scale"
that can lead to awful performance, unless those tables get an ANALYZE
after getting populated.

Unpredictable slowness can certainly result from tables having changed
size leading to pathological query plans.  ANALYZE will help.



Re: VM Instance to Google Cloud SQL Migration

2018-11-15 Thread Christopher Browne
On Thu, 15 Nov 2018 at 07:06, Ian Lawrence Barwick  wrote:
>
> 2018年11月15日(木) 17:19 Andreas Kretschmer :
> >
> > Am 15.11.2018 um 08:54 schrieb Sathish Kumar:
> > > We would like to migrate our Postgresql VM instance on Google Cloud
> > > Platform to Google Cloud SQL with a minimal downtime. As I checked, we
> > > have to export and import the SQL file and our database size is large
> > > and cannot afford longer downtime.
> > >
> > > Do any have solution to achieve this?.
> >
> > setup a replication from one to the other?
>
> Doesn't seem possible at the moment; here:
> https://cloud.google.com/sql/docs/postgres/replication/
> it says: "Cloud SQL for PostgreSQL does not yet support replication
> from an external
> master or external replicas for Cloud SQL instances."
>
> Looking at the feature list:
>
>   https://cloud.google.com/sql/docs/postgres/features
>
> among the "Unsupported features" are: "Any features that require
> SUPERUSER privileges"
> (apart from a limited number of extensions), which pretty much rules
> out pglogical or similar solutions.

That usually also rules out Slony-I, although there's a possibility...

Slony-I includes a feature called log shipping, which could perhaps be used
for this, assuming that the "source" environment does allow superuser
privileges.  (And I think you're running on a PostgreSQL instance where
that's possible...)

See: http://slony.info/documentation/logshipping.html

-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"



Re: Please let me know which configuration setting we need to modify to speedup the pg_dump backup.

2018-10-10 Thread Christopher Browne
On Wed, Oct 10, 2018, 10:32 PM Raghavendra Rao J S V <
raghavendra...@gmail.com> wrote:

> Hi All,
>
> pg_dump is taking more time. Please let me know which configuration
> setting we need to modify to speedup the pg_dump backup.We are using 9.2
> version on Centos Box.
>

There certainly isn't a configuration parameter to say "make PG dump
faster."

- It is possible that it is taking longer to backup the database because
the database has grown in size.  If you have a lot more data captured, this
would be a natural consequence, that you need simply to accept.

But there are additional possibilities...

- Perhaps the database is bloated because an update pattern is leading to a
lot of dead data being left behind.  In that case some tables need to be
vacuumed much more often and you should look into the auto vacuum
configuration.

- Perhaps the database has some table that contains a lot of obsolete
data.  This would depend heavily on the nature of your application.

You should look to see what data you are collecting that is not of ongoing
value.  That may represent data that you should trim out of the database.
That should improve the amount of time it takes to do a backup of the
database.

>


Re: COPY threads

2018-10-10 Thread Christopher Browne
On Wed, 10 Oct 2018 at 16:22, Ravi Krishna  wrote:
> You obviously are referring to multiple connections running COPY on different 
> tables, right?  Like what pg_restore does with -j option.
> Doesn't copy take an exclusive lock on the table which makes it incompatible 
> with parallelization.

No, why would that seem to be the case?  If it did so, then you could
not run pg_dump to dump data while regular activity was going on.

That's decidedly not the case.

The challenge in parallelizing a dump via COPY TO is in ensuring that
the multiple requests are attached to the same serializable
transaction.  There's a function now that allows multiple connections
to attach to the same transaction context, I believe...  Also, there's
the challenge in actually splitting the data, so that both requests
are dumping different data; that might be fairly expensive whether
with or without indices.

-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"



Re: How can I get and handle the status of sql statements that run in plpgsql ?

2018-10-02 Thread Christopher Browne
On Tue, 2 Oct 2018 at 16:48, David Gauthier  wrote:
>
> Hi:
> psql (9.6.7, server 9.5.2) on linux
>
> How does one get the status of an sql statement executed in plpgsql?  If that 
> status is cryptic, how can that be translated to something which someone 
> could understand?   Finally, how can I effectively do a start transaction and 
> either rollback or commit based on the results of the sql statements run?

For the situations you describe, I suggest that it is simple enough to
embrace the fact that stored functions run inside a pre-existing
transaction context.

You do not need to rollback or to return errors; you merely need to
raise the exceptions.

If *any* piece of the logic encountered an exception, then the
transaction has fallen into an exception state, and will automatically
be rolled back.

You can "lean on this"; either:
a) All of the logic passed with flying colours, and the transaction
may happily proceed, or
b) If any problem comes up along the way, the transaction is cancelled.

In PostgreSQL 11 (not yet a production release), there now exist
stored procedures that allow having BEGIN/COMMIT logic within a
procedure:
https://www.postgresql.org/docs/11/static/sql-createprocedure.html

I would think it likely that you'd want to use a mixture of stored
functions, that do some work on the assumption that it will all either
succeed or fail, and then use a stored procedure to do transactional
control on top of that.

But as things stand today, the transaction control will need to take
place in whatever layer you are using to control things.  So, if
you're using an app written in Python to control things, you'd submit
the BEGIN/COMMIT on the Python side, and the stored functions run
within transaction context.  And next year, when PG11 is available,
perhaps the BEGIN/COMMIT could reside in a stored procedure, so that
the Python code gets a bit simpler.  (Or s/Python/Java/g, or
s/Python/PHP/g as needed...)
-- 

When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"



Re: Weird procedure question

2018-09-25 Thread Christopher Browne
On Tue, Sep 25, 2018, 2:19 AM digimer  wrote:

> Oh, this is a very interesting approach! I didn't realize any UUIDs
> could be created in a predictable way. Thank you, this might be what I need.
>
Yep, DCE defined about 5 different versions of UUIDs, each with somewhat
differing characteristics.

https://en.m.wikipedia.org/wiki/Universally_unique_identifier

Versions 3 and 5 generate repeatable values, which is possibly what you are
after.

1 and 2 used timestamp info plus node.  At one time MAC addresses were used
as node info, but that seems to have fallen into disrepute.  (I think
because it leaks network identifier info)

It's worth struggling through understanding them; the variations certainly
are useful.

I think I have a function around to generate time-based UUIDs based on
giving the timestamp as parameter; I'll see if I can dig that out.  That
would allow writing up old history with UUIDs that look old.  Bug me off
list if that sounds interesting.


Re: Multiple COPY on the same table

2018-08-21 Thread Christopher Browne
On Mon, 20 Aug 2018 at 16:23, Adrian Klaver  wrote:
>
> On 08/20/2018 08:56 AM, Nicolas Paris wrote:
> >> Can I split a large file into multiple files and then run copy using
> >> each file.
> >
> > AFAIK, copy command locks the table[1] while there is no mention of this
> > in the documentation[2].
>
> [1] Is from Postgres 7.1(17 years ago). I suspect the conditions have
> changed at least a little:).

oxrsdb-generated@localhost->  create temp table foo ();
CREATE TABLE
oxrsdb-generated@localhost->  begin;
BEGIN
oxrsdb-generated@localhost-> * \copy foo from '/dev/null';
COPY 0
oxrsdb-generated@localhost-> * select oid, relname from pg_class where
relname = 'foo';
  oid   | relname
+-
 350686 | foo
(1 row)

oxrsdb-generated@localhost-> * select * from pg_locks ;
  locktype  | database | relation | page | tuple | virtualxid |
transactionid | classid | objid | objsubid | virtualtransaction |  pid
 |   mode   | granted | fastpath
+--+--+--+---++---+-+---+--++---+--+-+--
 relation   |   345742 |11713 |  |   ||
   | |   |  | 2/200573   | 16754 |
AccessShareLock  | t   | t
 relation   |   345742 | 3455 |  |   ||
   | |   |  | 2/200573   | 16754 |
AccessShareLock  | t   | t
 relation   |   345742 | 2663 |  |   ||
   | |   |  | 2/200573   | 16754 |
AccessShareLock  | t   | t
 relation   |   345742 | 2662 |  |   ||
   | |   |  | 2/200573   | 16754 |
AccessShareLock  | t   | t
 relation   |   345742 | 2685 |  |   ||
   | |   |  | 2/200573   | 16754 |
AccessShareLock  | t   | t
 relation   |   345742 | 2684 |  |   ||
   | |   |  | 2/200573   | 16754 |
AccessShareLock  | t   | t
 relation   |   345742 | 2615 |  |   ||
   | |   |  | 2/200573   | 16754 |
AccessShareLock  | t   | t
 relation   |   345742 | 1259 |  |   ||
   | |   |  | 2/200573   | 16754 |
AccessShareLock  | t   | t
 relation   |   345742 |   350686 |  |   ||
   | |   |  | 2/200573   | 16754 |
RowExclusiveLock | t   | t
 virtualxid |  |  |  |   | 2/200573   |
   | |   |  | 2/200573   | 16754 |
ExclusiveLock| t   | t
(10 rows)

Table 'foo' has a RowExclusiveLock lock taken out as a consequence of
running COPY against it.

But that does not prevent other connections from concurrently writing
to the table.

Not all locks block other locks...
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"



Re: [External] Multiple COPY on the same table

2018-08-20 Thread Christopher Browne
On Mon, 20 Aug 2018 at 12:53, Ravi Krishna  wrote:

> > What is the goal you are trying to achieve here.
> > To make pgdump/restore faster?
> > To make replication faster?
> > To make backup faster ?
>
> None of the above.
>
>  We got csv files from external vendor which are 880GB in total size, in 44 
> files.  Some of the large tables had COPY running for several hours. I was 
> just thinking of a faster way to load.


Seems like #4...

#4 - To Make Recovery faster

Using COPY pretty much *is* the "faster way to load"...

The main thing you should consider doing to make it faster is to drop
indexes and foreign keys from the tables, and recreate them
afterwards.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"



Re: Postgres - search for value throughout many tables?

2018-08-08 Thread Christopher Browne
On Wed, 8 Aug 2018 at 10:14, David G. Johnston
 wrote:
>
> On Wednesday, August 8, 2018, czezz  wrote:
>>
>> Hi everyone,
>> I want to aks if anyone knows is there a way to search for specific "value" 
>> throughout list of tables OR all tables in databse?
>
>
> Can you pg_dump your database to plain text and search that?  Nothing built 
> in provides that ability though you possibly could work up something using 
> dynamic sql.

If there are some tables that are extraordinarily large that would not
be good candidates, this could be excessively expensive.

If you can identify a specific set of tables that are good candidates,
then a faster option might involve:
  pg_dump --data-only --table=this_table --table=that_table
--table=other_table databaseURI

or, if there are only a few tables to omit...
  pg_dump --data-only --exclude-table=this_irrelevant_big_table
--exclude-table=another_big_irrelevant_table databaseURI

-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"



Re: User documentation vs Official Docs

2018-07-16 Thread Christopher Browne
On Mon, 16 Jul 2018 at 20:14, Joshua D. Drake  wrote:
>
> On 07/16/2018 05:08 PM, Alvaro Herrera wrote:
> >
> > Sounds like wiki pages could solve need this pretty conveniently.  If
> > and when the content is mature enough and migrates to the tutorial main
> > documentation pages, the wiki pages can be replaced with redirects to
> > those.
>
> Anyone who writes a lot is going to rebel against using a wiki. They are
> one of the worst to write in from a productivity perspective. I would
> rather write in Docbook, at least then I can template everything and we
> could have a standard xsl sheet etc...

Indeed.

I think it would be a fine idea to have some proposals for improved examples
for the tutorial pages.  By putting them there, it becomes easy to reference
material either in reference sections or in the manual pages on SQL commands.
(As "for instances."  It would also be nice to have examples that make reference
to executable programs, whether pg_dump, pg_ctl, or ...)

I'd be willing to help write something of the sort; if I'm to shoot my mouth
off on what we ought to do, best to volunteer to actually make some of it
happen.

Having some small sample applications that do interesting things with
different Postgres facilities seems like a neat approach.

It would be interesting to have an example that makes decent use of
LISTEN/NOTIFY; people keep asking how to have PostgreSQL send
email, and writing a small example that handles it via queueing
requests into a table, where a daemon stops in to send queued
email once in a while.

Another idea would be an app that captures messages into tables
and enables full text search would be nice.

Doing some somewhat simplistic partitioning of a perhaps large
(but simplistic, so it fits into docs) data set would help motivate
use of partitioning facilities.

Applications need to be kept fairly tiny so that they represent
good examples without being of dominant size.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"



Re: Building a notification system.

2018-07-15 Thread Christopher Browne
On Sun, Jul 15, 2018, 5:30 AM Anto Aravinth, 
wrote:

> Hello Everyone,
>
>
> I'm playing around with postgresql with SO datasets. In the process, I
> have dumped 60M questions data onto the postgresql. I'm trying to build a
> notification system on top of this, so that, when a user edits a question,
> I need to show a notification to the user when he/she logs in next time. So
> literally, trying to create the clone of SO (with very minimal feature)
>
> I'm not sure, how to get started with this. Read about NOTIFY:
> https://www.postgresql.org/docs/current/static/sql-notify.html
>
>
> Not sure that fits my use case, thanks for your help in this.
>


I do not think that the NOTIFY command implemented in postgreSQL is
terribly likely to be  useful for your application.

That command is useful for distribution of notifications to applications
that are continuously connected to the database, which is not likely true
for web app connections, particularly in view of your comment about
notifying users "when they log in next time."

Instead, you need a table that captures a log of undelivered notifications
of changes to questions.  It should capture useful attributes such as..
- Who made the change
- Who is to be notified
- The time of the change
- Perhaps the nature of the change, which could be pretty open ended
- A reference to the question, e.g. its ID
- Some lifecycle attribute such as "viewed-on" or "acknowledged-on"

When a user logs in, it should be easy to query that table, providing the
list of unexamined updates.

Perhaps entries may be removed as soon as they are viewed, or you may need
a more sophisticated lifecycle so they are only removed after some express
indication that the change has been fully acknowledged.


Re: Open Source tool to deploy/promote PostgreSQL DDL

2018-07-11 Thread Christopher Browne
On Wed, 11 Jul 2018 at 16:37, Ron  wrote:
>
> On 07/11/2018 03:21 PM, Christopher Browne wrote:
> > I have built one that I call Mahout
> > (https://github.com/cbbrowne/mahout) which has the merit of involving
> > just two shell scripts, one of which is an auditing tool (pgcmp).
> >
> > It implements a "little language" to indicate dependencies between the
> > SQL scripts that implement the DDL changes.
> >
> > The notable thing that I have seen "downthread" that it does not
> > attempt to implement is "rollback scripts."  I find that "necessity"
> > to be a ruby-on-rails dogma that does not seem to fit what I see
> > people doing.
>
> Where I work, the requirement to have rollback scripts is part of the ITIL
> requirement for Changes to have a backout procedure.

ITIL surely does NOT specify the use of database rollback scripts as
THE SPECIFIED MECHANISM for a backout procedure.

In practice, we tend to take database snapshots using filesystem
tools, as that represents a backout procedure that will work regardless
of the complexity of an upgrade.

It is quite possible for an upgrade script to not be reversible.

After all, not all matrices are invertible; there are a surprisingly large
number of preconditions that are required for that in linear algebra.

And in databases, not all upgrades may be reversed via rollback scripts.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"



Re: Open Source tool to deploy/promote PostgreSQL DDL

2018-07-11 Thread Christopher Browne
I have built one that I call Mahout
(https://github.com/cbbrowne/mahout) which has the merit of involving
just two shell scripts, one of which is an auditing tool (pgcmp).

It implements a "little language" to indicate dependencies between the
SQL scripts that implement the DDL changes.

The notable thing that I have seen "downthread" that it does not
attempt to implement is "rollback scripts."  I find that "necessity"
to be a ruby-on-rails dogma that does not seem to fit what I see
people doing.



Re: large numbers of inserts out of memory strategy

2017-11-30 Thread Christopher Browne
On 30 November 2017 at 05:22, Peter J. Holzer  wrote:
> On 2017-11-29 08:32:02 -0600, Ted Toth wrote:
>> Yes I did generate 1 large DO block:
>>
>> DO $$
>> DECLARE thingid bigint; thingrec bigint; thingdataid bigint;
>> BEGIN
>> INSERT INTO thing
>> (ltn,classification,machine,source,thgrec,flags,serial,type) VALUES
>> ('T007336','THING',0,1025,7336,7,'XXX869977564',1) RETURNING id,thgrec
>> INTO thingid,thingrec;
>> INSERT INTO recnum_thing (recnum,thing_id) VALUES (thingrec,thingid);
>> INSERT INTO thingstatus
>> (thing_id,nrpts,rmks_cs,force_type_id,ftn_cs,force_code,arr_cs,mask,toi_state,plot_id,signa_cs,lastchange,des_cs,rig_cs,ownship,correlation,maxrpts,rtn_cs,ctc_cs,group_mask,dep_cs)
>> VALUES 
>> (thingid,121,'{0,0,0,0}',440,0,23,0,0,0,'{23,-1,3803,3805,-1,-1,0,6}',0,1509459164,0,0,0,0,1000,0,0,0,0);
>> INSERT INTO thinger
>> (thing_id,spe_key,cse_unc,lat_spd,cov,dtg,lng,spd,ave_spd,cse,tol,nrpts,lat,alpha,sigma,spd_unc,lng_spd)
>> VALUES 
>> (thingid,-1,0.0,-6.58197336634e-08,'{4.27624291532e-09,0.0,3.07802916488e-09,0.0,4.27624291532e-09,0.0,3.07802916488e-09,4.16110417234e-08,0.0,4.16110417234e-08}',1509459163,2.21596980095,0.000226273215958,1.0,0.0,0.1000149,121,0.584555745125,10.0,4.23079740131e-08,0.0,-2.4881907e-10);
>> INSERT INTO thingdata
>> (thing_id,category,db_num,xref,org_type,trademark,shortname,fcode,platform,callsign,type,orig_xref,shipclass,home_base,uic,service,di,lngfixed,hull,precision,alert,flag,besufx,name,mmsi,catcode,ntds,imo,pn_num,chxref,threat,sconum,latfixed,db_type,pif,echelon,jtn,quantity,overwrite)
>> VALUES 
>> (thingid,'XXX','','','','','004403704','23','','','','','UNEQUATED','','','','',0.0,'','{0,0,0,0,0}','','KS','','UNKNOWN','004403704','','','','','','AFD','',0.0,3,'','',0,0,0)
>> RETURNING id INTO thingdataid;
>> INSERT INTO thingnum (thingdata_id,thgnum,state,dtg,cmd) VALUES
>> (thingdataid,'013086',0,1502970401,'FOO');
>>
>> 
>>
>> END $$;
>>
>> Should I limit the number of 'thing' inserts within a DO block or
>> wrapping each 'thing' insert in it's own DO block?
>
> I would suggest getting rid of the do block entirely if that is
> possible. Just create lots of insert statements. You can get the current
> value of a sequence with currval('sequence_name').
>
> Alternately or in addition, since you are using python, you might want
> to insert directly into the database from python using psycopg2. For
> separate insert statements that should have about the same performance.
> (It is usually much faster to write to a csv file and load that with
> copy than to insert each row, but you don't do that and it might be
> difficult in your case).

A long time ago (pre version 2.2), Slony-I used to do this, having a big
stream of INSERT, UPDATE, and DELETE statements.

This seemed pretty OK when tuples weren't very large, but we did get
some reports in from people with applications with very large tuples
that would run into out-of-memory conditions if a bunch of INSERTs
involving very large tuples were processed together.

The "fix" we initially did was to have the bulk processing stream
hold off on any Very Large Tuples, so that when we'd encounter
them, we'd process the big tuples one by one.  This worked
relatively OK, but meant that we'd be switching logic every time
there was big data, and there was a pretty painful amount of
configuration to force people to worry about.

The "2.2" change was to switch to COPY-based streaming.  In
our case, we put a trigger onto the single log table and have that
initiate tossing data as it came in into the appropriate target table.

That introduced an Amount Of Complexity (e.g. - a rather complex
stored function in C/SPI), but it's notable that we got a pretty big
performance boost as well as complete elimination of memory
allocation worries out of turning the process into COPY streaming.

It may be challenging to get a suitable set of COPY requests to do
what is necessary.  (It sure would be cool if it could just be one...)
But there are three crucial things I'd observe:
a) Loading data via COPY is *way* faster,
b) Parsing INSERT statements is a *big* slowdown,
c) Parsing INSERT statements means that those statements must
be drawn into memory, and that chews a lot of memory if the query
has very large attributes.  (COPY doesn't chew that memory!)

At one point, Jan Wieck was looking into an alternative COPY
protocol that would have allowed more actions, notably:

- You could specify multiple tables to stream into, and switch
  between them on a tuple by tuple basis.

- You could specify actions of INSERT, UPDATE, or DELETE.

It seemed like a really interesting idea at the time; it was intended
to be particularly useful for Slony, but some folks designing data
warehouse ETL systems observed that it would be useful to them,
too.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"



Re: large numbers of inserts out of memory strategy

2017-11-30 Thread Christopher Browne
On 30 November 2017 at 05:22, Peter J. Holzer  wrote:
> On 2017-11-29 08:32:02 -0600, Ted Toth wrote:
>> Yes I did generate 1 large DO block:
>>
>> DO $$
>> DECLARE thingid bigint; thingrec bigint; thingdataid bigint;
>> BEGIN
>> INSERT INTO thing
>> (ltn,classification,machine,source,thgrec,flags,serial,type) VALUES
>> ('T007336','THING',0,1025,7336,7,'XXX869977564',1) RETURNING id,thgrec
>> INTO thingid,thingrec;
>> INSERT INTO recnum_thing (recnum,thing_id) VALUES (thingrec,thingid);
>> INSERT INTO thingstatus
>> (thing_id,nrpts,rmks_cs,force_type_id,ftn_cs,force_code,arr_cs,mask,toi_state,plot_id,signa_cs,lastchange,des_cs,rig_cs,ownship,correlation,maxrpts,rtn_cs,ctc_cs,group_mask,dep_cs)
>> VALUES 
>> (thingid,121,'{0,0,0,0}',440,0,23,0,0,0,'{23,-1,3803,3805,-1,-1,0,6}',0,1509459164,0,0,0,0,1000,0,0,0,0);
>> INSERT INTO thinger
>> (thing_id,spe_key,cse_unc,lat_spd,cov,dtg,lng,spd,ave_spd,cse,tol,nrpts,lat,alpha,sigma,spd_unc,lng_spd)
>> VALUES 
>> (thingid,-1,0.0,-6.58197336634e-08,'{4.27624291532e-09,0.0,3.07802916488e-09,0.0,4.27624291532e-09,0.0,3.07802916488e-09,4.16110417234e-08,0.0,4.16110417234e-08}',1509459163,2.21596980095,0.000226273215958,1.0,0.0,0.1000149,121,0.584555745125,10.0,4.23079740131e-08,0.0,-2.4881907e-10);
>> INSERT INTO thingdata
>> (thing_id,category,db_num,xref,org_type,trademark,shortname,fcode,platform,callsign,type,orig_xref,shipclass,home_base,uic,service,di,lngfixed,hull,precision,alert,flag,besufx,name,mmsi,catcode,ntds,imo,pn_num,chxref,threat,sconum,latfixed,db_type,pif,echelon,jtn,quantity,overwrite)
>> VALUES 
>> (thingid,'XXX','','','','','004403704','23','','','','','UNEQUATED','','','','',0.0,'','{0,0,0,0,0}','','KS','','UNKNOWN','004403704','','','','','','AFD','',0.0,3,'','',0,0,0)
>> RETURNING id INTO thingdataid;
>> INSERT INTO thingnum (thingdata_id,thgnum,state,dtg,cmd) VALUES
>> (thingdataid,'013086',0,1502970401,'FOO');
>>
>> 
>>
>> END $$;
>>
>> Should I limit the number of 'thing' inserts within a DO block or
>> wrapping each 'thing' insert in it's own DO block?
>
> I would suggest getting rid of the do block entirely if that is
> possible. Just create lots of insert statements. You can get the current
> value of a sequence with currval('sequence_name').
>
> Alternately or in addition, since you are using python, you might want
> to insert directly into the database from python using psycopg2. For
> separate insert statements that should have about the same performance.
> (It is usually much faster to write to a csv file and load that with
> copy than to insert each row, but you don't do that and it might be
> difficult in your case).

Once upon a time, Slony-I used to use INSERT/UPDATE/DELETE to
do all of its work, so that the replication stream consisted of gigantic
numbers of INSERT/UPDATE/DELETE statements that had to be
parsed and processed.


-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"