Re: [GENERAL] journaling / time travel

2016-09-24 Thread George Neuner
Coming late to this, but ...

On Mon, 19 Sep 2016 17:48:20 +0200, Willy-Bas Loos
 wrote:

>The use case of legal disputes being fought with our data as evidence and
>digging up the exact data from a certain point of time never occurred in
>those 10 years, and it is unlikely that it ever will.
>But it might, if anyone could reasonably expect this to be possible.
>
>:
>
>My question to you all is:
>* Is the legal thing actualy something one could expect of us?
>* Is the security thing really a good practice?
>* Is this a common use case that is normally solved with standard
>components?

I am not a lawyer.

You don't say where you are specifically, but in the US, there is a
legal notion that changes/deletions done in the "normal course of
business" generally are permitted.  That is, e.g., if once a month you
routinely purge records older than 3 years, then you can't be expected
to produce records from 4 years ago.  But you have to prove to the
court that this is normal for your business: e.g., show documentation
of your record keeping procedures.

The problem comes when you do get notice of a legal action.  From that
moment forward you must preserve any data that might be relevent to
the case ... including any new data that is created ... in the event
that it ever is subpoenaed by the court. 

This can become a major issue when you realize that a court case may
drag on for many years, and you may not know exactly what data has to
be preserved.  Lawyers often go on "fishing expeditions", asking for
data in many different ways [by different keywords, etc.], hoping to
find something by comparing the results.

Journaling solves the retention problem and may provide other nice
features like an audit trail of who made the changes.  Of course,
journaling may take a lot of extra space unless it stores only deltas.


Many locales have similar requirements for data preservation in the
face of a legal action.  You need to find out what is expected where
you are.  I'd have to advise that you talk to your lawyers rather than
ask here.

At least in the US, the "normal course of business" applies to archive
data as well as to live data, so you may be able to limit how long you
need to keep the journals.


Hope this ... doesn't further confuse.
George



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] journaling / time travel

2016-09-23 Thread phb07



On 09/19/2016 08:48 AM, Willy-Bas Loos wrote:
> Since records can
> be changed afterwards, it has been argued that we should have
> "journaling", meaning that every change to the data is saved in a
> separate schema that holds a "journaling" copy of each table

I don't think this is especially unusual. I've worked on similar
projects that maintain history for regulatory reasons. I can't speak to
your legal situation, but I don't think keeping history is a strange choice.

Re the security argument: it seems to me it depends on whether you can
restrict access to the journal while providing access to the normal
tables. I guess journaling would help for some threats but not others,
right? If regular users can't touch the journal, then I agree keeping
one could make sense.

Are you saying the journaling portion is 624GB on top of 1.1TB? Or that
of the 1.1TB, 624GB of it is from journaling? Either way it doesn't seem
like a massive cost to me.

I don't think PITR is an appropriate solution to keeping a 10-year
history of changes.

It sounds like you're not looking for a green-field solution, but just
trying to get perspective on what others are doing. Some resources for
this that might help you:

https://www.youtube.com/watch?v=TRgni5q0YM8
https://github.com/arkhipov/temporal_tables  (aka
http://pgxn.org/dist/temporal_tables/)
http://www.cs.arizona.edu/~rts/tdbbook.pdf 
  (also available in print)

https://www.amazon.com/Bitemporal-Data-Practice-Tom-Johnston/dp/0124080677

The two books contain multiple approaches to storing history, each with
tradeoffs for ease-of-use, disk space, etc. Reading them might be
overkill for you. If you want to choose one, I'd read Snodgrass. The
first few chapters are tedious, but it gets better.

Temporal databases are in a funny in-between zone where there is lots of
research, but standard tools are fairly underdeveloped. Postgres
recently added range types and exclusion constraints, which are
important primitives for building a temporal system, but it is still a
long way from SQL:2011, and SQL:2011 is itself a long way from
everything you might want. One thing that seems lacking to me, even in
the research, is how to handle DDL changes. You should be glad that you
only care about audit history and not subject history too, because going
bi-temporal is where you really cross over into lack of available tools,
outside of a few commercial offerings. (Teradata has temporal support,
using a Snodgrass-like approach that pre-dates the standard.)

Paul

You may also have a look on the E-Maj extension. It is available for 
instance here on pgxn .
Its purpose is to record all updates on tables either to examine them or 
to "rollback" them later. But it should be also usable for auditing.
The documentation in the package includes a detailed reference guide, 
but also a presentation that allows to quickly have a view on the extention.
Even if you think E-Maj doesn't fit your need, feel free to give me a 
feedback. It may give me thoughts for enhancements ;-)


Regards. Philippe.




Re: [GENERAL] journaling / time travel

2016-09-19 Thread Paul Jungwirth

> On 09/19/2016 10:56 AM, Willy-Bas Loos wrote:
> > On Mon, Sep 19, 2016 at 6:26 PM, Paul Jungwirth
> > > 
wrote:

> > I've worked on similar
> > projects that maintain history for regulatory reasons.
> Can you explain "regulatory reasons" please?

I mean government regulations, like HIPAA and Sarbanes-Oxley. I'm in the 
U.S. Since you are in the Netherlands you'll have different rules to 
deal with.


Paul



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] journaling / time travel

2016-09-19 Thread Paul Jungwirth

On 09/19/2016 08:48 AM, Willy-Bas Loos wrote:

Since records can
be changed afterwards, it has been argued that we should have
"journaling", meaning that every change to the data is saved in a
separate schema that holds a "journaling" copy of each table


I don't think this is especially unusual. I've worked on similar 
projects that maintain history for regulatory reasons. I can't speak to 
your legal situation, but I don't think keeping history is a strange choice.


Re the security argument: it seems to me it depends on whether you can 
restrict access to the journal while providing access to the normal 
tables. I guess journaling would help for some threats but not others, 
right? If regular users can't touch the journal, then I agree keeping 
one could make sense.


Are you saying the journaling portion is 624GB on top of 1.1TB? Or that 
of the 1.1TB, 624GB of it is from journaling? Either way it doesn't seem 
like a massive cost to me.


I don't think PITR is an appropriate solution to keeping a 10-year 
history of changes.


It sounds like you're not looking for a green-field solution, but just 
trying to get perspective on what others are doing. Some resources for 
this that might help you:


https://www.youtube.com/watch?v=TRgni5q0YM8
https://github.com/arkhipov/temporal_tables (aka 
http://pgxn.org/dist/temporal_tables/)

http://www.cs.arizona.edu/~rts/tdbbook.pdf (also available in print)
https://www.amazon.com/Bitemporal-Data-Practice-Tom-Johnston/dp/0124080677

The two books contain multiple approaches to storing history, each with 
tradeoffs for ease-of-use, disk space, etc. Reading them might be 
overkill for you. If you want to choose one, I'd read Snodgrass. The 
first few chapters are tedious, but it gets better.


Temporal databases are in a funny in-between zone where there is lots of 
research, but standard tools are fairly underdeveloped. Postgres 
recently added range types and exclusion constraints, which are 
important primitives for building a temporal system, but it is still a 
long way from SQL:2011, and SQL:2011 is itself a long way from 
everything you might want. One thing that seems lacking to me, even in 
the research, is how to handle DDL changes. You should be glad that you 
only care about audit history and not subject history too, because going 
bi-temporal is where you really cross over into lack of available tools, 
outside of a few commercial offerings. (Teradata has temporal support, 
using a Snodgrass-like approach that pre-dates the standard.)


Paul



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] journaling / time travel

2016-09-19 Thread Willy-Bas Loos
On Mon, Sep 19, 2016 at 5:48 PM, Willy-Bas Loos  wrote:

>
> The use case of legal disputes being fought with our data as evidence and
> digging up the exact data from a certain point of time never occurred in
> those 10 years, and it is unlikely that it ever will.
> But it might, if anyone could reasonably expect this to be possible.
>
>
Just to clarify, this is not a specific feature that we are bound to in an
SLA, or anything like that.
It is just argued that we should keep the feature because a judge may ask
for it.
I'm not asking for official legal advice on this list, just for common
sense.
We are based in the Netherlands and any legal action of relevance to this
question would take place here.

Cheers,
-- 
Willy-Bas Loos


[GENERAL] journaling / time travel

2016-09-19 Thread Willy-Bas Loos
Hi,

In a joint effort with some parties, we have created a large database (1.1
TB)  of records that change only occasionally. Mainly, more of them are
added to the database in a slow rate (1.1 TB built up in 10 years).
The records can have some significance in legal cases. Since records can be
changed afterwards, it has been argued that we should have "journaling",
meaning that every change to the data is saved in a separate schema that
holds a "journaling" copy of each table in the datamodel, including
lookups. So theoretically it is possible to go back to the data at one
exact point in time. This is stored in the same database.
By now, this journaling schema takes up 624GB.
I once looked into a contrib module called time travel that does something
similar, hence the title.

The use case of legal disputes being fought with our data as evidence and
digging up the exact data from a certain point of time never occurred in
those 10 years, and it is unlikely that it ever will.
But it might, if anyone could reasonably expect this to be possible.

Also, it has been argued that this journal is a good thing, because it
could save us in case of a unforeseen creeping corruption of data or some
other catastrophe that isn't covered  by our backup system. Someone
described this as a good practice, resulting in the fact that we still do
this.

Now, i don't like this blown up journal that we don't ever use, so i would
love to lose it.
But indeed we should do anything we reasonably can to protect that data.

My question to you all is:
* Is the legal thing actualy something one could expect of us?
* Is the security thing really a good practice?
* Is this a common use case that is normally solved with standard
components?

One thing: PITR would be a candidate for a standard component, but it takes
up much more space. We could double the retention time if the journal were
to be axed, but that would probably not be enough.

Cheers,

-- 
Willy-Bas Loos