Re: [GENERAL] Column Tetris Automatisation

2017-01-15 Thread Mike Sofen
|From: Nicolas Paris
|Hello,
|In postgresl order of columns does have an non negligeable impact on table
|size[1].
|Table are in many cases dynamic, and new fields can appear in the database
life.
|I suspect re-ordering columns based on types would be an automatisable task
|
|Moreover, most client code should not be dependent on column order. Then a
|TETRIS option in VACUUM FULL would be usefull for many users.
|Does that makes sense ?

For me it doesn't.  As a database architect/engineer, I always create a data
model first (using some software like Xcase, Embarcadero, Erwin), then use
that software to generate the DDL to create the physical database objects,
then start creating stored functions against those objects.

Since my code follows my model, and since I don't use "select *", and since
I prefer to have all of my identifiers at the "top" of a table, order does
matter...it all ties together nicely, making it easier for other developers
to follow an identical pattern across all of the database objects. 

All of that said, the notion of embedding Tetris functionality into a
codebase makes me smile, for some reason...

Mike Sofen



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] LYDB: What advice about stored procedures and other server side code?

2016-12-28 Thread Mike Sofen
|From: Christoph Moench-Tegeder
|Initially, running code in your database can make life easier for the 
developers
|(ise pgTap for testing, pl/profiler and pl/debugger, etc.). But once you have 
to
|change your schema, the hurt begins:
|you'll need downtime for that, or you'll have to deal with the possibility of
|changing the API of your "database side" code, and matching code and tables on
|the database.

I look at this from the opposite direction: with a stable database API (via 
stored procs), I can change the schema and logic within the procs without 
causing any app code breakage…the app tier is completely insulated from those 
changes – that’s worth a lot.  Yes, for deploying the schema change there must 
be an outage, but that’s true regardless of where the data is being manipulated 
– the data is still in a db…and there are ways to mitigate/eliminate the 
duration of the outage.

|The next pain point is scalability: running code on the database server puts 
your
|code on the most expensive and hardest to scale CPUs. You can (almost) always
|add another appserver to your setup (just spin up a VM with a tomcat or
|whatever-you-use). But if the bottleneck is your database CPUs, you'd have to
|move to a larger server 

Our dedicated db servers have not yet shown any real CPU consumption during app 
use - memory, for us, is the only real limiting factor.  The only time CPU 
consumption spikes is during admin activities - reindexing, vacuuming, bulk 
data loads...that sort of thing.  Even the boxplot calculations barely cause a 
ripple.  To me that speaks to the efficiency of language and engine working 
together.  You are right of course on the scaling - if we do run out of CPU 
horsepower and cannot scale up any more, we'd have to scale out, and there are 
ways to do that too.  IOWs, the model doesn't have to change, just the solution 
to solve the scaling (solve the problem, not the symptom).

|TL;DR: database side code can be a great thing in a small application, but once
|the application and traffic grows, "code in the database"
|requires specialist attention and may become a burden.
|Unfortunately, most large applications started small...

In my opinion, having a database specialist work on database stuff is a GOOD 
thing.  Tables get designed properly, correct indexes are built, efficient 
query plans are created, etc.  ORMs are a shortcut to getting an app talking to 
data, but aren't a substitute for a proper, scalable data tier.  IMO...being a 
data specialist...  :-)

Mike Sofen (Synthetic Genomics)



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Performance PLV8 vs PLPGSQL

2016-12-28 Thread Mike Sofen
From: Tim Uckun
I have seen various links on the internet which indicate that PLV8 is 
significantly faster than PL-PGSQL sometimes an order of magnitude faster. 

 

Is there any benefit to choosing PL-PGSQL?



I can’t speak to PLV8.  However, I can speak to plpgsql, and specifically 
stored functions (procs).  I use it exclusively to create a database API for 
real-time web applications to hit.  My API calls (procs) are hitting large 
tables, sometimes doing complex logic within the sproc.  It allows me to 
provide a simple, standardized interface to the web devs, allowing them to 
focus on the app code work.

 

Performance is superb and continues to surprise me (I came from the SQL Server 
world).  As others have mentioned, the natural lashup of plpgsql to postgres (I 
liked Alban’s term, “impedance”), is a key aspect.  Also:

 

-stored procs provide another security layer against sql injection 
attacks.

-Caching SEEMS to be more efficient/effective with stored procs (that 
could be wishful thinking too).

-Stored procs allow skilled sql practitioners to provide far more 
sophisticated sql solutions than the typical python developer is capable of…my 
experience is that most web devs don’t really understand databases (or even 
care about them – they are a necessary evil), so providing a pure encapsulated 
sql solution (via stored procs) removes that mental impedance mismatch.

-Performance?  Simple “get” procs that return data for a specific 
indexed query against larger tables (50m+ rows) in a few milliseconds…I can 
live with that kind of performance.

-I’m also doing some heavy lifting in the sql, calculating histograms 
and boxplots for data visualizations.  This is an unusual scenario, but the 
other option is sending a massive chunk of data to another server for 
processing – just the transit time would kill the deal.  I am mindful that at a 
certain point, there won’t be enough memory and i/o to go around, but the web 
app is a low user count/high user task complexity app, so I’ve tailored the 
model to match.

 

Mike Sofen  (Synthetic Genomics)



Re: [GENERAL] Syntax error needs fresh eyeballs

2016-12-26 Thread Mike Sofen
From: Rich Shepard
   My schema includes three tables and psql throws an error I'm not seeing
when I try to read the schema into the database. I'd appreciate fresh eyes
looking at the table and learning what error I'm not seeing.

 

   The tables:

CREATE TABLE Weather (

   site_id INTEGER PRIMARY KEY,

   site_name TEXT,

   site_location TEXT

);

CREATE TABLE Weather_Params (

   site_id INTEGER

  REFERENCES Weather(site_id),

   param TEXT,

   param_unit TEXT,

   freq INTEGER,

   freq_unit TEXT,

   equip TEXT,

   PRIMARY KEY (site_id, param)

);

CREATE TABLE Weather_Data (

   site_id INTEGER

  REFERENCES Weather(site_id),

   monit_date DATE,

   monit_time TIME,

   read_by TEXT,  -- name of employee collecting data

   param TEXT

  REFERENCES Weather_Params(param),

   param_value REAL,

   PRIMARY KEY (site_id, monit_date, monit_time, param) );

 

   The error:  ERROR:  there is no unique constraint matching given keys for
referenced table "weather_params".

---

Just a guess...You've camel-cased the table names but aren't using double
quotes.  Try either lower-casing all table names and references to them, or
double-quoting all identifiers.  

 

Mike

 



Re: [GENERAL] Storing files: 2.3TBytes, 17M file count

2016-11-28 Thread Mike Sofen
From: Thomas Güttler   Sent: Monday, November 28, 2016 6:28 AM



...I have 2.3TBytes of files. File count is 17M

Since we already store our structured data in postgres, I think about storing 
the files in PostgreSQL, too.

Is it feasible to store file in PostgreSQL?

---

I am doing something similar, but in reverse.  The legacy mysql databases I’m 
converting into a modern Postgres data model, have very large genomic strings 
stored in 3 separate columns.  Out of the 25 TB of legacy data storage (in 800 
dbs across 4 servers, about 22b rows), those 3 columns consume 90% of the total 
space, and they are just used for reference, never used in searches or 
calculations.  They range from 1k to several MB.

 

Since I am collapsing all 800 dbs into a single PG db, being very smart about 
storage was critical.  Since we’re also migrating everything to AWS, we’re 
placing those 3 strings (per row) into a single json document and storing the 
document in S3 bins, with the pointer to the file being the globally unique PK 
for the row…super simple.  The app tier knows to fetch the data from the db and 
large string json from the S3 bins.  The retrieval time is surprisingly fast, 
this is all real time web app stuff.

 

This is a model that could work for anyone dealing with large objects (text or 
binary).  The nice part is, the original 25TB of data storage drops to 5TB – a 
much more manageable number, allowing for significant growth, which is on the 
horizon.

 

Mike Sofen  (Synthetic Genomics USA)



Re: [GENERAL] initdb createuser commands

2016-10-30 Thread Mike Sofen
From: Samuel Williams  Sent: Sunday, October 30, 2016 3:42 PM
As a community I'd think that having feedback from a new user would be valuable 
since as you say, sometimes when you get ingrained into the "way of doing 
things" that you don't see how they could be improved or different.

Samuel

   

I’d take a different tack.  I spent 20 years with SQL Server and easily (almost 
gleefully) hopped over to Postgres and especially pgplsql and PgAdmin III, from 
using SqlServer Management Studio (SSMS – their admin/coding app).

 

Sure, I had to learn the PG way of doing things, but really, it was a 
no-brainer.  I had to spend a few extra cycles learning the PG best practices 
and particular way of doing things but it was trivial…google and done.  The 
vast community has created massive amounts of examples for nearly everything 
imaginable – and some things I would never have imagined anyone would try to do 
– such that I don’t have to Lewis and Clark it but just dive right in and write 
code.

 

IMO, nothing major needs changing in the language or command syntax – it’s 
logical and easy for anyone skilled in sql.  If someone isn’t skilled in sql, 
the requests you’ve made won’t assist them at all.

 

Mike Sofen (Synthetic Genomics)



Re: [GENERAL] Immutable datastore library?

2016-10-18 Thread Mike Sofen
From: Guyren Howe   Sent: Monday, October 17, 2016 1:40 PM
I would like to use Postgres as an immutable data store. A subject table
would have a timestamp column, and that would be added to what would
otherwise be the primary key.  Trap updates and turn them into inserts. Have
an associated _deleted table. Trap deletes and turn them into inserts of the
primary key into that table.  Create a view that only shows the non-deleted
records with the most recent timestamps.

Stored procedure to do all that to a table. Event trigger to drop and
re-create the view on changes to the table columns.
--
A couple years ago at another firm, I designed and built a real time ODS
(operational data store) for a large healthcare company, on SQL Server 2012
that supported your exact requirements, since the ODS was the primary data
feed for their data warehouse.

My solution leveraged the Merge tsql function ("Upsert" in PG) to detect
inserts/updates/deletes.  We don't allow physical row deletes in medical
data, so these were logical deletes aka an update to an
"InactivatedDatetime" column making it not null.  I used a checksum function
in the Update branch to detect if the inbound data had changed at all, to
avoid creating dry updates (no change in the data but a new Update row would
be written otherwise).

Ok that's the internals for the write to the ODS.  I wrapped the entire
Merge statement inside of another insert statement using the equivalent of
the PG "Into" function, which took every column from the ODS write and wrote
the same data to a "History" table of the same name - those writes were
always inserts, creating a persistent, complete picture of every write to
the ODS.  Each row going into the History tables was marked with a "D"
(delete), "I" (insert) or "U" (update).  The History data was used for both
auditing and for nightly batches feeding the data warehouse, where row type
(D, I, or U) drove their Type 2 processing.  As you can imagine, the table
design was crucial to the success of this model.

This was ultra-efficient on the real time data flowing in from the hospital
and clinical EMRs (different systems) - in one transaction/one query, I was
able to double-write the data and ensure both writes completed or fail both
and error.  The "batches" were small - up to 100k rows or less, and
processed in under 100ms.  But even when there was a network outage and we
had to do a catch up load with millions of rows, it ran very quickly.  IOWs,
the double write overhead was very modest, especially with modern disk
performance.

Mike Sofen (Synthetic Genomics)



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Graphical entity relation model

2016-09-28 Thread Mike Sofen
From: jotpe   Sent: Wednesday, September 28, 2016 1:17 PM



Does anybody know a Software for generating graphical entity relation models 
from existing postgresql databases?
Best regards Johannes

---

I don’t know any useful free ones, but my favorite less expensive one (and the 
one I’ve used for the last few years) is xcase (http://www.xcase.com/) .  It 
can reverse engineer against most current popular databases including Postgres. 
 The current version has 2 postgres quirks: 1) you cannot (for some bizzare 
reason) declare a json/jsonb column, so I declare it as text and simply adjust 
the generated DDL as needed.  2) you cannot declare an index on a text column 
(as if they are thinking it is akin to a sql server text column).  Other than 
those issues, which can be easily worked around, it is by far the fastest 
modeling tool I’ve ever used (I’ve used the more famous ones), and the fully 
comprehensive, with a very modern looking UI.

 

In contrast, there are the over-priced dinosaurs with old ugly UIs.  A while 
back I reviewed some of the modeling tools, and none did it for me, I went 
ahead and got another license to xcase.

 

Mike Sofen (Synthetic Genomics)



Re: [GENERAL] how to monitor the progress of really large bulk operations?

2016-09-28 Thread Mike Sofen
From: Pavel StehuleSent: Tuesday, September 27, 2016 9:18 PM
2016-09-28 6:13 GMT+02:00 Pavel Stehule <pavel.steh...@gmail.com 
<mailto:pavel.steh...@gmail.com> >:

Hi

2016-09-27 23:03 GMT+02:00 Mike Sofen <mso...@runbox.com 
<mailto:mso...@runbox.com> >:

Hi gang,

how to view the state of a transaction in flight, seeing how many rows have 
been read or inserted (possible for a transaction in flight?), memory 
allocations across the various PG processes, etc.

some years ago I used a trick 
http://okbob.blogspot.cz/2014/09/nice-unix-filter-pv.html#links

 

pltoolbox has counter function 
https://github.com/okbob/pltoolbox/blob/master/utils.c

pavel=# insert into omega2 select (x.xx).* 
   from (select pst.counter(omega,20, true) xx 
from omega
) x;
NOTICE:  processed 20 rows, current value is '(5,8)'
NOTICE:  processed 20 rows, current value is '(5,8)'

Regards

Pavel

 

 

Pavel - That’s a very interesting function and thanks for sharing your toolbox. 
 The big question of course, is what is the impact on performance, scalability 
and stability?  Would it work inside of a stored function that would allow me 
write out the progress to a tracking table?  

 

Mike



[GENERAL] how to monitor the progress of really large bulk operations?

2016-09-27 Thread Mike Sofen
Hi gang,

 

On PG 9.5.1, linux, I'm running some large ETL operations, migrate data from
a legacy mysql system into PG, upwards of 250m rows in a transaction (it's
on a big box).  It's always a 2 step operation - extract raw mysql data and
pull it to the target big box into staging tables that match the source, the
second step being read the landed dataset and transform it into the final
formats, linking to newly generated ids, compressing big subsets into jsonb
documents, etc.

 

While I could break it into smaller chunks, it hasn't been necessary, and it
doesn't eliminate my need:  how to view the state of a transaction in
flight, seeing how many rows have been read or inserted (possible for a
transaction in flight?), memory allocations across the various PG processes,
etc.

 

Possible or a hallucination?

 

Mike Sofen (Synthetic Genomics)



Re: [GENERAL] IDE for function/stored proc development.

2016-09-03 Thread Mike Sofen
From: Tim Uckun Sent: Saturday, September 03, 2016 2:37 AM
Does anybody use an IDE for doing heavy duty stored proc development?  PGadmin 
is decent but I am looking for something better.

 

I have been using the Datagrip app (from Jetbrains), from its beta release up 
through now v 2016.2 and love it.  Full autocomplete, it has my object browser 
on the left, source code file browser on the right (I have it tied into our 
git), massively customizable to look and behave any way you want it.  It is not 
effective yet for admin tasks.  I really love the modern interface.  I have 
zero connection to the company, paid for a license ($200).  Worth every penny.

 

I’m a hardcore stored proc/func dev, building database api’s for the large 
systems I design/build…I write a LOT of code and datagrip is where I do it.  I 
also use the crash-prone PgAdminIII  for admin stuff like table mods, quick 
scripting of table def or insert columns, or backups, etc…the normal admin 
stuff.

 

MikeS



Re: [GENERAL] Rackspace to RDS using DMS (Postgres 9.2)

2016-09-01 Thread Mike Sofen
From: Joshua D. Drake  Sent: Thursday, September 01, 2016 9:29 AM
On 08/31/2016 03:41 PM, Patrick B wrote:

> Hi guys, I posted this question on the ADMIN list but will post here 

> as well so more people can comment...

>   
> https://www.postgresql.org/message-id/CAJNY3it_AfxJhmwMHtpiAbHG47GS5rJ

> OAUgfHw%2BGm5OXCbUm7w%40mail.gmail.com

> 

> I've got a 2.3TB Database running at Rackspace... We'll be migrating 

> it to RDS PostgreSQL 9.5 very soon...

> 

> We already have an EC2 Instance at Amazon running PostgreSQL 9.2 as 

> streaming replication from Rackspace.

> 

> I'll have to upgrade the version of Postgres on that instance before 

> start using DMS service.

> 

> *Question:*

> Has anybody ever used that service? I'm just trying to find out how 

> much time it will take to perform the migration...

 

It is 2.3TB, it is going to take a long time no matter what service you are 
running.

 

No, I have not used DMS. Frankly, with all respect to AWS/RDS the idea of 
running a 2.3TB instance that will get any level of performance sounds 
ridiculously expensive.

 

Sincerely,

 

JD

 

I currently have an EC2 instance in AWS – an m4.xlarge (4 cores, 16gb, 3tb 
SSDs) and it’s pretty cheap, about $620/mo ($210/mo for the compute, $410 for 
the storage).  The performance of this setup rivals in-house Cisco UCS server 
that we are demoing that costs ~$100k, as long as our batch sizes don’t exceed 
available memory – that’s where the larger Cisco pulls ahead.  The $620/mo is 
the on-demand price, btw…the reserved price is much lower.

 

$100k/ $620 = 161 months of operation before cost parity.

 

Mike S



Re: [GENERAL] UPDATE OR REPLACE?

2016-09-01 Thread Mike Sofen
On Thu, Sep 1, 2016 at 12:10 PM, dandl <da...@andl.org> wrote:
> Sqlite has options to handle an update that causes a duplicate key. Is 
> there anything similar in Postgres?
> This is not an UPSERT. The scenario is an UPDATE that changes some key 
> field so that there is now a duplicate key. In Sqlite this handled as:
> UPDATE OR IGNORE table SET 
> UPDATE OR REPLACE table SET 
>
> And so on
>
> See https://www.sqlite.org/lang_update.html.
>
> Can Postgres do this?

I would propose that this effectively violates referential integrity and 
shouldn't be a valid design pattern.

In my mind primary keys are supposed to be static, stable, non-volatile...aka 
predictable.  It feels like an alien invading my schema, to contemplate such an 
activity.  I hope PG never supports that.

Postgres allows developers incredible freedom to do really crazy things.  That 
doesn't mean that they should.

Mike Sofen (USA)



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] UUIDs & Clustered Indexes

2016-08-30 Thread Mike Sofen
From: George Neuner  Sent: Tuesday, August 30, 2016 5:54 PM



>Mike Sofen wrote: So in this scenario, I'm using 

>BOTH bigserials as the PK and uuids as AKs in the core tables.  I 

>reference the bigints for all joins and (have to) use the uuids for the 

>filters.  It's been working ok so far, lookup performance on a table 

>with a few million rows, using the uuid (indexed) is instantaneous.  

>I'll soon have a 100 million+ rows loaded into a single table and know a
bit more.

> 

>The uuids are also design insurance for me in case I need to shard, 

>since I'll need/want that uniqueness across servers.

 

FYI:  articles about sharding using bigint keys.

 

 
<http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-in
stagram>
http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-ins
tagram

 <http://rob.conery.io/2014/05/29/a-better-id-generator-for-postgresql/>
http://rob.conery.io/2014/05/29/a-better-id-generator-for-postgresql/

 

George

 

I remember reading these articles a long time ago, forgot about them...and
appreciate the reminder!  

 

I really liked the enhanced Instagram function from Rob Conery in the second
link, but so far haven't needed to deal with it.  However, an upcoming
project may require huge data storage - approaching hundreds of billions of
rows, and I'm sticking with Postgres - so this will be a great way to test
drive the function.  And I may try my hand at a further enhancement, time
permitting.  Thanks for the links!

 

Mike

 



Re: [GENERAL] UUIDs & Clustered Indexes

2016-08-30 Thread Mike Sofen
From: Tom Lane  Sent: Tuesday, August 30, 2016 7:16 AM



Luke Gordon < <mailto:gord...@gmail.com> gord...@gmail.com> writes:

> However, according to a message on this mailing list, Postgres doesn't 

> have clustered indexes:

> "But Postgres doesn't _have_ clustered indexes, so that article 

> doesn't apply at all. The other authors appear to have missed this
important point."

>  <https://www.postgresql.org/message-id/56798352.7060902%40uchicago.edu>
https://www.postgresql.org/message-id/56798352.7060902%40uchicago.edu

 

> But, doing a quick check, it appears Postgres does indeed have a 

> mechanism for a clustered index:

>  <https://www.postgresql.org/docs/9.5/static/sql-cluster.html>
https://www.postgresql.org/docs/9.5/static/sql-cluster.html

 

CLUSTER just does a one-time sort to put the table into index order.

There is no mechanism that would cause subsequent insertions of new keys to
respect that ordering, so it's pretty much irrelevant to the argument about
whether new UUID keys need to be generated in some ordered fashion.

 

Do you actually *need* UUID keys, and if so why?  A plain old bigint column
is smaller, cheaper to index, and the natural mechanism for generating it
(ie a sequence) will tend to preserve ordering for free.

 

 regards, tom lane

 

I agree with Tom for a "normal" application - I would always use bigints
(bigserial) as a PK column.  The app I currently working on is a high
security web app for which the app coders require guids for all identifiers
flowing around the system.  So in this scenario, I'm using BOTH bigserials
as the PK and uuids as AKs in the core tables.  I reference the bigints for
all joins and (have to) use the uuids for the filters.  It's been working ok
so far, lookup performance on a table with a few million rows, using the
uuid (indexed) is instantaneous.  I'll soon have a 100 million+ rows loaded
into a single table and know a bit more.

 

The uuids are also design insurance for me in case I need to shard, since
I'll need/want that uniqueness across servers.

 

Mike Sofen



Re: [GENERAL] Re: Clustered index to preserve data locality in a multitenant application?

2016-08-30 Thread Mike Sofen
From: Thomas Kellerer Sent: Tuesday, August 30, 2016 4:39 AM
Nicolas Grilly schrieb am 30.08.2016 um 13:12:

> We rely on clustered indexes to preserve data locality for each 

> tenant. Primary keys start with the tenant ID. This way, rows 

> belonging to the same tenant are stored next to each other. Because 

> all requests hit only one tenant, this is a great performance 

> improvement.

> 

 

What about partitioning by tenant? With a local index on each partition. 

Partitioning is currently a bit limited in Postgres (e.g. you can't have 
incoming foreign keys) but this would fit your requirements pretty much as I 
understand them. 

 

For Nicolas’s situation, that would require 10,000 partitions – not very 
useful, and each partition would be very small.

 

The premise of clustered indexes is that the row data is “in the index”, so no 
row lookups are needed – that’s very effective/performant and clearly works 
well for many scenarios.  

 

In Postgres, as you mentioned, clustering is a “one time” operation but only in 
the sense that after you add more rows, you’ll need to re-cluster the table.  
Depending on the activity model for that table, that may be feasible/ok.  For 
example, if you load it via regular batch scripts, then the clustering could be 
done after those loads.  If you add rows only rarely but then do lots of 
updates, then the clustering would work great.  If this is an active real time 
data table, then clustering would not be viable.

 

But you may be still be fine without them.  Here’s why:  Postgres is a very 
fast database engine.  I am constantly surprised at how performant it is.  I 
came from the SQL Server world where I always leveraged the automatic clustered 
indexes but in Postgres I am not using them and am running some big data with 
no table partitioning (yet) and performance is still very good.  I carefully 
optimize my data models, so that is part of it.  I also carefully optimize my 
stored functions/indexes.  I am migrating data from 500+ mysql databases/~8,000 
tables/~20 billion rows into a single Postgres db.  As my data grows, I may 
shard it.  The new parallel query support in PG v9.6 may also support your 
table model very nicely, depending on how you query into it.

 

So I’d say, just build a prototype PG db, build that one table, load your 
existing data into it (very easy via the mysqly_fdw data wrapper – it’s exactly 
what I am doing…ping me off list if you need some ideas), put a regular index 
on it and run some queries.  If you have a decent size dev server to work on, 
you should see adequate performance.  And, what you’ll be incredibly pleased 
with is the remarkably rich and flexible plpgsql coding environment.  It’s 
heaven for sql devs.

 

BTW, I am putting together a rather detailed examination of Postgres ETL/bulk 
loading performance, on 2 different platforms: 4 cores/16gb ram/3tb SSD on AWS 
(Amazon cloud), and 48 cores/256gb ram/ 10tb SSD on a monster loaner Cisco UCS 
server.  Should have that posted to the Perform list later this week.

 

Mike Sofen (USA)

 



Re: [GENERAL] Stored procedure version control

2016-06-30 Thread Mike Sofen
-Original Message-
>From: Mark Morgan Lloyd Sent: Thursday, June 30, 2016 2:41 AM
>Neil Anderson wrote:
>> On 2016-06-29 12:37 PM, Mark Morgan Lloyd wrote:
>>> Elsewhere, somebody was asking how people implemented version control 
>>> for stored procedures on (MS) SQL Server.
>>>
>>> The consensus was that this is probably best managed by using scripts 
>>> or command files to generate stored procedures etc., but does anybody 
>>> have any comment on that from the POV of PostgreSQL?
>>>

> [etc.] Thanks everybody, summary passed on.
>Mark Morgan Lloyd
>markMLl .AT. telemetry.co .DOT. uk

A bit late to the thread, but here's some specific details on how I've
implemented version control in PG 9.5, in a small team environment deploying
to single database servers in each tier (dev, qa, stage, prod).  It's
working well so far, and allows my stored proc versions to be aligned with
the middle and upper tier code releases.  I'm the lead database
architect-engineer for a brand new genomics application (lots of data).
Details:
 - we're using git for version control, with a base name for each repo that
holds a single micro-service (like "JobManager") and a suffix for the data
tier code ("JobManagerDBMS") making it simple for devops to find the related
code for a micro-service deployment by repo.
 - within a DBMS repo, I've got subfolders like "scripts", "sprocs",
"documentation", where scripts holds ad hoc scripts that need to be run
during a deployment (might be adjusting DDL or seeding or cleaning up data),
sprocs for stored function files that must be compiled into PG, and
documentation holds notes, data models, etc.  We have a simple python script
that compiles/recompiles all stored proc files within a named folder -
deployment done with one call.
 - I only code using source code files, by cloning an existing suitable base
stored proc (akin to a template) to a new file name (like a "get" or "set"
stored proc) and then revising to match the requirement.  In a detailed
comment block within each stored proc, I list a version number (just for
reference, not used programmatically at this point), change history, author,
comments, and one or more sample calls that form the basis of my unit tests.
 - after I've finished the requested work and the stored procs are working
as expected, I update a Version file in the folder, push it into git and
merge it into the development branch.
 - let's say a new enhancement request comes in.  I create a new branch
(like "RequestForNewThingy"), version that, do the work and merge it back in
just like the above.  So we've got isolation and persistence of changes.
 - I happen to be using the new DataGrip code editor, which supports this
beautifully, since my git tree appears on the right side of editor window,
allowing me directly edit/clone without leaving the editor.

My coding efficiency using this model is quite high...the overhead of using
git is trivial.
For rollbacks, we can simply point to the prior stored proc version and
recompile those.  For DDL rollbacks, I have to code those scripts and supply
them...this is the one place I have to spend a bit more time creating a more
automated solution.  I'd love to hear how other folks have solved
programmatic rollbacks.

Mike Sofen  (San Diego, CA USA)





-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] OT hardware recommend

2016-06-18 Thread Mike Sofen
From: Adam Brusselback
Sent: Saturday, June 18, 2016 1:26 PM
Subject: Re: [GENERAL] OT hardware recommend

 

It's really amazing how much solid state drives transferred the database 
bottleneck away from disk.

 

 

Adam – so very true.  We used to spend ungodly amounts of time/money/effort to 
mitigate disk performance limitations.  It is almost MAGIC what SSDs do now.

 

Real numbers don’t lie: 5400 rpm disks can muster no more than 65 IOPS (7200s 
get ~90-100, 10k get 140-150).  So:

 

15 x 65 = 975 IOPS (aka boohoo)

 

Using the AS SSD Benchmark, the Samsung 480gb m2 850 EVO in my core i7 laptop 
measures (IOPS then MB/s):

Random 4k blocks:  7,235 iops read, 14,012 iops

Random 4K-64Threads:  97,743 iops read, 68,864 iops write

Random 512B:  14,380 iops read, 19,858 iops write (db comparison here)

MB/s:

Sequential: 500 MB/s read, 449 MB/s write

Random 4K:  28.26 MB/s read,  54.74 MB/s  write

4K-64Threads: 381.81 MB/s read, 269.00 MB/s write (this is closer to what db 
access looks like).

Access Times:  0.070 ms read, 0.050 ms write

 

Thusly,

1 x SSD = 14.75 times faster than a 15 drive array on reads, and 20 times 
faster on writes.

 

Like everyone else has said, just buy a 1 TB Samsung EVO 850 for $300 (USD) and 
call it a day.  :)

 

Mike

 



Re: [GENERAL] PostgresSQL and HIPAA compliance

2016-06-17 Thread Mike Sofen
-Original Message-
From: Alex John
Sent: Friday, June 17, 2016 3:04 AM
To: pgsql-general@postgresql.org
Subject: [GENERAL] PostgresSQL and HIPAA compliance

Hello, I have a few questions regarding the use of PostgreSQL and HIPAA
compliance. I work for a company that plans on storing protected health
information (PHI) on our servers. We have looked at various solutions for
doing so, and RDS is a prime candidate except for the fact that they have
explicitly stated that the Postgres engine is *not* HIPAA compliant.

Users on the IRC channel generally say that the guidelines are more catered
towards building better firewalls and a sane access policy, but I would like
to know if there is anything within the implementation of Postgres itself
that violates said compliance.

If anyone works at a similar company and utilizes postgresql to store PHI,
please let me know.

Thank you,
  Alex

-

HIPAA compliance does not specify (ever) the technical solution to meet the
requirements, so ANY datastore that can be properly managed within the
context of HIPAA compliance is legal and allowed.  Ignore IRCs and search on
securing PHI on relational databases, you'll find lots of details around
data access roles, documentation, processes, data obfuscation, etc.

Mike






-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] regarding schema only migration from sqlserver to postgres with runmtk.sh

2016-06-15 Thread Mike Sofen
 

From: pgsql-general-ow...@postgresql.org 
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of David G. Johnston
Sent: Wednesday, June 15, 2016 1:31 PM
To: Durgamahesh Manne 
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] regarding schema only migration from sqlserver to 
postgres with runmtk.sh

 

On Wed, Jun 15, 2016 at 11:27 AM, Durgamahesh Manne  > wrote:

may i know  recommended approach to migrate the tables from sql server to 
postgresql instead of using 3rd party tools

 

 

​I suppose SQL Server has some means to export in TSV/CSV or other row-oriented 
structured output.  The \copy psql meta-command or COPY SQL ​command could load 
that data.

 

David J.

 

 

There are MANY tools for sql server, many are built in (aka free) that will 
generate DDL, or extract data, or both, plus you can directly query the system 
catalogs, just like PG.  Utterly trivial stuff to use and Postgres can ingest 
all of it since it’s normally either flat file or ansi sql.  

 

MikeS 



Re: [GENERAL] Alternate or Optimization for with hold cursor

2016-05-12 Thread Mike Sofen
>From: Sangeetha Sent: Thursday, May 12, 2016 1:58 AM

Currently , I am using "With hold" cursor. In our case , the With hold
cursor is used to fetch the next record of the given primary key . The
performance is very slow for large data set. Can you provide me some
alternative ways like having own copy of table , or optimization for With
hold cursor?

Thanks and Regards,
S.Sangeetha<
==

Cursors are the last tool I would ever grab out of my sql toolbox (aka, I
never use one) - it converts the enormous power of a relational database
engine into "RBAR" (row by agonizing row).  For a large dataset in
particular, you are killing the server since the entire resultset must be
retained in working memory for the duration of the query as it peels off one
row at a time from that resultset OR if it's larger than your ram, you'll be
paging to disk constantly.  And since you're working on a single row at
time, it will take forever.

Convert the cursor into a normal query and you should see BIG (10-1000x)
gains in speed.  A cursor can always be converted to normal
sql...always...it's not always easy but it's always worth the effort. 
 
Mike Sofen



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Thoughts on "Love Your Database"

2016-05-04 Thread Mike Sofen
From: dandlSent: Wednesday, May 04, 2016 5:05 PM
To: 'Pierre Chevalier Géologue' <pierrechevalierg...@free.fr>



> From: Pierre Chevalier Géologue [ <mailto:pierrechevalierg...@free.fr> 
> mailto:pierrechevalierg...@free.fr]

> ...

> > Then I think you've seriously misunderstood. Most people can indeed 

> >learn to write basic SQL queries, but those are

> >(obviously) not what I'm talking about.

> >

> > To write the business logic of a significant application entirely in 

> >SQL requires PLSQL (or in other dialects, whatever passes for SQL/PSM).

> >It means writing an entire data access layer as a set of stored 

> >procedures, with a substantial set of special functions, types, 

> >triggers and so on. No beginner and few experts have the skills 

> >required to do that in SQL, and then debug that code on the server.

> 

> All right, I understand better now.  I think I also totally missed 

> your point, sorry...

> I'll give a look at andl.

 

I hope you do. Please feel free to contact me with any comments, suggestions, 
etc.

 

I have not completed the Postgres implementation -- probably another couple of 
weeks – 

but in-memory and Sqlite are there.

 

Bonne chance!

 

Regards

David M Bennett FACS

===

 

I disagree.  I’ve worked as database architect/engineer at a number of large 
and small firms in various verticals (healthcare, financials, insurance, 
aerospace, telecom, etc), and created complete database api’s via stored 
procs/stored functions, some of which were quite complex.  I’ve found that a 
mid-level database developer, with modest coaching and good comments in the 
code, can pick up the code, support it and even enhance it.  So the notion that 
experts can only write and maintain quality code isn’t valid in my experience.

 

There is definitely a difference in capability/velocity/solution  solving 
between junior, mid-level and senior developers, but that isn’t a deal killer, 
it’s just something that needs to be managed and accounted for.  

 

One reason for a database api is that ORMs have proved themselves incapable of 
proper scaling and ACID compliance, where stored procs/functions are capable of 
leveraging the massive set-based relational power of the underlying engine, and 
leverage efficient functionalities like windowing functions.

 

So I guess you’d say I’m in the entirely opposite camp, since it’s proven to be 
such an effective solution architecture for many applications that leverage 
relational database engines.

 

Mike Sofen  (San Diego, CA  USA)



Re: [GENERAL] Thoughts on "Love Your Database"

2016-05-04 Thread Mike Sofen
>From: Vincent Veyron   Sent: Wednesday, May 04, 2016 3:35 PM

>I write management applications for businesses, and give _a lot_ of care to
the database structure. 
>I find that the number of lines of code that need to be written is strictly
inversely correlated to the
 >appropriateness of the database design (meaning that the better the
database structure, the lesser 
>code is needed).

>Knowing about the many fine functions Postgresql offers also helps, of
course.
>   Bien à vous, Vincent Veyron 


An inverse corollary is also true, imo: encountering demands for exotic,
wild functions, cursors and code constructs (like triggers calling triggers
or frankly, triggers in general), is nearly always an indication of poor
database design.  I'm not talking about windowing functions or json or CTEs,
btw.

Postgres and mysql have piles and piles of functions that I will never use
and can't even imagine scenarios in which to use them.  So I agree 100% -
it's all about the database (design).   BTW, I'm currently
designing/building OLTP databases for use in genomics research (using
Postgres)...that's big data...where there is zero tolerance for slack db
design that could cause scalability or performance issues.  My stored
functions are...relatively simple.

Mike Sofen  (San Diego, CA USA)




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Function PostgreSQL 9.2

2016-05-03 Thread Mike Sofen
From: David G. Johnston  Sent: Tuesday, May 03, 2016 2:46 PM
To: drum.lu...@gmail.com



…The only other reasonable option is change your model and requirements to 
something less complex.

 

 

Seriously, get yourself the books I have recommended and study them  BEFORE you 
continue attempting to design your database.

You need a firm understanding of logical design & flow, otherwise you will be 
wasting your time. 

That's what I'm doing.. Studying.. asking for some help to get a better 
understand isn't this the purpose of this mail list?

 

​The purpose of this list is whatever people are willing to make of it - within 
reason.  I share the sentiment that your particular method of education is 
becoming burdensome to the people who volunteer their time on these lists to 
answer questions.  As the alternative is to simply stop replying to your emails 
be grateful that someone was at least willing to tell you to try other avenues 
of education.

 

David J.

 

+1   +1   +1   +1:  Lucas – David really nails it: you’ve stopped relying on 
yourself and are relying on the mercy and generosity of strangers to solve your 
problems.  It takes a lot of time to read, dissect, understand and then comment 
on your (or anyone’s) emails, and you’ve proven adept at consuming much of the 
available oxygen on the list.  Please a bit more considerate - you’ve received 
an amazing amount of solid guidance and advice.  I can tell you that when I see 
your name as the author, I now normally delete the email – that’s NOT the 
response you want, right?

 

I’ve worked with a bunch of junior developers over the years, some of whom hit 
a wall and just sit there, waiting for someone to fix them.  Those junior 
developers “never” become senior developers because their peers soon lose 
interest in collaborating with them, if you catch my drift…

 

Mike Sofen



Re: [GENERAL] (VERY) Slow Query - PostgreSQL 9.2

2016-05-03 Thread Mike Sofen
 

From: drum.lu...@gmail.com <mailto:drum.lu...@gmail.com>   Sent: Tuesday, May 
03, 2016 2:55 AM
I'm trying to get the query below a better performance.. but just don't know 
what else I can do...

Please, have a look and let me know if you can help somehow.. also.. if you 
need some extra data jet ask me please.

* Note that the gorfs.inode_segments table is 1.7TB size

I have the following Query:

explain analyze
SELECT split_part(full_path, '/', 4)::INT AS account_id,
   split_part(full_path, '/', 6)::INT AS note_id,
   split_part(full_path, '/', 9)::TEXT AS variation,
   st_size,
   segment_index,
   reverse(split_part(reverse(full_path), '/', 1)) as file_name,
   i.st_ino,
   full_path,
   (i.st_size / 100::FLOAT)::NUMERIC(5,2) || 'MB' AS size_mb
FROM gorfs.inodes i
JOIN gorfs.inode_segments s
  ON i.st_ino = s.st_ino_target
WHERE
  i.checksum_md5 IS NOT NULL
  AND s.full_path ~ '^/userfiles/account/[0-9]+/[a-z]+/[0-9]+'
  AND i.st_size > 0;
  split_part(s.full_path, '/', 4)::INT IN (
 
SELECT account.id
FROM public.ja_clients AS account
WHERE
NOT (
((account.last_sub_pay > EXTRACT('epoch' FROM 
(transaction_timestamp() - CAST('4 Months' AS INTERVAL AND 
(account.price_model > 0)) OR
(account.regdate > EXTRACT('epoch' FROM 
(transaction_timestamp() - CAST('3 Month' AS INTERVAL OR
(((account.price_model = 0) AND (account.jobcredits > 0)) AND 
(account.last_login > EXTRACT('epoch' FROM (transaction_timestamp() - CAST('4 
Month' AS INTERVAL)
) LIMIT 100
);

 

There is one obvious solution: restructure your data, since it is not in a 
“standard” form but you’re trying to query it as if it were…you are turning 
your long full_path string into columns…if performance is a concern, that 
overhead has to be eliminated.  

Your two choices would be to either restructure this table directly (requiring 
a change in app code that was filling it), or use it to fill a proper table 
that already has everything decomposed from the long full_path string via 
post-processing after the insert.  A third consideration would be to archive 
off older/unneeded rows to a history table to reduce row counts.  This is about 
proper structure.

Mike Sofen



Re: [GENERAL] truncate table getting blocked

2016-04-26 Thread Mike Sofen
From: Jayadevan M  Sent: Tuesday, April 26, 2016 6:32 AM



Hello,

 

I have a python script. It opens a cursor…

 

Thanks,

Jayadevan

 

 

 



Re: [GENERAL] Function PostgreSQL 9.2

2016-04-19 Thread Mike Sofen
This is such a poorly designed, hacked together “thing” – it isn’t a database, 
it’s someone’s idea of how to store data when they don’t know how to store 
data, like they moved it from Access or Excel.  

 

Just start over and design a proper relational schema with best practices and 
you’ll save, oh, perhaps 10 years of wasted effort and 12 million emails.  This 
is as close to bandaids on bandaids on steroids that it comes.  Really – 
rethink your solution model.

 

Mike

 

From: drum.lu...@gmail.com Sent: Tuesday, 
April 19, 2016 7:40 PM
Just forgot to say:


Does increment_client_code relate to users or some other table, say clients?

nope.. there is no link between them

If the users.code is empty/null, then the trigger has to get the last number 
from client_code_increment and put on the users.code column



Re: [GENERAL] Re: Why is the comparison between timestamp and date so much slower then between two dates

2016-04-14 Thread Mike Sofen
|-Original Message-
|From: Thomas Kellerer   Sent: Wednesday, April 13, 2016 11:37 PM
|
|Alban Hertroys schrieb am 13.04.2016 um 16:39:
|>>> So my question is: why is comparing a timestamp to a date so much slower?
|>
|> The reason that the other way around is so much more expensive is that
|> the database needs to do that conversion twice for every row in the
|> table. When down-converting now(), the DB only needs to do that once
|> for all rows.
|
|Why does it do that for each row? The value of now() won't change while the
|statement is running, so this conversion could be done once at the start of the
|statement.

The general rule in the SQL Server world is that using a function in a Where 
clause or join will eliminate usage of an index that would have been leveraged 
if the function didn't exist.  The reason is that functions are 
non-deterministic, so the optimizer can't possibly tell in advance what the 
outcome will be and thus takes the safest route to completion.  I'm betting 
that the same logic holds in PG (I just haven't tested it enough to be 
absolutely sure).

In the case of now() in the Where clause, to avoid the conversion/loss of index 
usage, I always place (what should be a static value anyway) the output of 
now() into a local variable and then use that in the Where clause...and get my 
index back.

This is just a style of coding (no functions in where clauses/joins), but one 
that doesn't seem prevalent in PG...instead I see people using functions within 
functions within functions, the cascading impact of which becomes very hard to 
unravel.

Mike Sofen



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] I/O - Increase RAM

2016-04-13 Thread Mike Sofen
|From: John R Pierce  Sent: Wednesday, April 13, 2016 1:53 PM
|
|On 4/13/2016 1:43 PM, drum.lu...@gmail.com wrote:
|> At the moment I'm having 100% I/O during the day. My server has SATA
|> HDs, and it can't be changed now.
|> So, to solve the problem (or at least try) I was thinking about double
|> the RAM, and by doing that, increasing the cache.
|
|depends on if its read or write IO.   many of our database servers are
|nearly 100% write IO, only thing that will speed that up is faster disks and/or
|more disks in raid10.
|--
|john r pierce, recycling bits in santa cruz

Agree with John and also add that if your READ queries or data model are not 
optimized then you could throw a TB of ram at it and see no difference.  You 
need to analyze your queries and find out WHY there's so much i/o.  It sounds 
like thrash to me...so you need to do some homework and get the behaviors 
sorted out, then it should become obvious what needs fixing.

Mike



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to delete few elements from array beginning?

2016-03-09 Thread Mike Sofen
>>Alexander Farber wrote on  Wednesday, March 09, 2016 4:11 AM



Hello fellow PostgreSQL users,

what is please the most efficient way to delete a slice from the start of a 
longer array (after I have copied it to another array)?

Do I really have to copy a large slice of the array to itself, like in the last 
line here:

pile_array := pile_array || swap_array;

/* here I copy away swap_len elements */
new_hand := pile_array[1:swap_len];

/* here I don't know how to efficiently remove already copied elements 
*/
pile_array := pile_array[(swap_len + 1):ARRAY_LENGTH(pile_array, 1)];

or is there a better way?

Thank you

Alex

<< 

Have you considered a normal (relational), non-array-based data model for this 
app (2 or 3 tables in a 1:M/M:M) instead of the single table model you’ve 
shown?  That would then allow you to use normal sql set-based operations that 
are readable, understandable, maintainable and very fast/scalable.  

When I see row by row operations (looping or cursors) in what should be a real 
time query…that’s my alarm bell that perhaps the code has wandered off a valid 
solution path.

Mike